summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2017-08-23block: cache the partition index in struct block_deviceChristoph Hellwig
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-23btrfs: index check-integrity state hash by a dev_tChristoph Hellwig
We won't have the struct block_device available in the bio soon, so switch to the numerical dev_t instead of the block_device pointer for looking up the check-integrity state. Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-23cifs: return ENAMETOOLONG for overlong names in cifs_open()/cifs_lookup()Ronnie Sahlberg
Add checking for the path component length and verify it is <= the maximum that the server advertizes via FileFsAttributeInformation. With this patch cifs.ko will now return ENAMETOOLONG instead of ENOENT when users to access an overlong path. To test this, try to cd into a (non-existing) directory on a CIFS share that has a too long name: cd /mnt/aaaaaaaaaaaaaaa... and it now should show a good error message from the shell: bash: cd: /mnt/aaaaaaaaaaaaaaaa...aaaaaa: File name too long rh bz 1153996 Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com> Cc: <stable@vger.kernel.org>
2017-08-23cifs: Fix df output for users with quota limitsSachin Prabhu
The df for a SMB2 share triggers a GetInfo call for FS_FULL_SIZE_INFORMATION. The values returned are used to populate struct statfs. The problem is that none of the information returned by the call contains the total blocks available on the filesystem. Instead we use the blocks available to the user ie. quota limitation when filling out statfs.f_blocks. The information returned does contain Actual free units on the filesystem and is used to populate statfs.f_bfree. For users with quota enabled, it can lead to situations where the total free space reported is more than the total blocks on the system ending up with df reports like the following # df -h /mnt/a Filesystem Size Used Avail Use% Mounted on //192.168.22.10/a 2.5G -2.3G 2.5G - /mnt/a To fix this problem, we instead populate both statfs.f_bfree with the same value as statfs.f_bavail ie. CallerAvailableAllocationUnits. This is similar to what is done already in the code for cifs and df now reports the quota information for the user used to mount the share. # df --si /mnt/a Filesystem Size Used Avail Use% Mounted on //192.168.22.10/a 2.7G 101M 2.6G 4% /mnt/a Signed-off-by: Sachin Prabhu <sprabhu@redhat.com> Signed-off-by: Pierguido Lambri <plambri@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com> Cc: <stable@vger.kernel.org>
2017-08-23isofs: Delete an unnecessary variable initialisation in isofs_read_inode()Markus Elfring
The local variable "bh" will be set to an appropriate pointer a bit later. Thus omit the explicit initialisation at the beginning. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Signed-off-by: Jan Kara <jack@suse.cz>
2017-08-23isofs: Adjust four checks for null pointersMarkus Elfring
The script “checkpatch.pl” pointed information out like the following. Comparison to NULL could be written !... Thus fix the affected source code places. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Signed-off-by: Jan Kara <jack@suse.cz>
2017-08-22Merge tag 'ext4_for_linus_stable' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 fixes from Ted Ts'o: "Fix a clang build regression and an potential xattr corruption bug" * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: add missing xattr hash update ext4: fix clang build regression
2017-08-22xfs: stop searching for free slots in an inode chunk when there are noneCarlos Maiolino
In a filesystem without finobt, the Space manager selects an AG to alloc a new inode, where xfs_dialloc_ag_inobt() will search the AG for the free slot chunk. When the new inode is in the same AG as its parent, the btree will be searched starting on the parent's record, and then retried from the top if no slot is available beyond the parent's record. To exit this loop though, xfs_dialloc_ag_inobt() relies on the fact that the btree must have a free slot available, once its callers relied on the agi->freecount when deciding how/where to allocate this new inode. In the case when the agi->freecount is corrupted, showing available inodes in an AG, when in fact there is none, this becomes an infinite loop. Add a way to stop the loop when a free slot is not found in the btree, making the function to fall into the whole AG scan which will then, be able to detect the corruption and shut the filesystem down. As pointed by Brian, this might impact performance, giving the fact we don't reset the search distance anymore when we reach the end of the tree, giving it fewer tries before falling back to the whole AG search, but it will only affect searches that start within 10 records to the end of the tree. Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-08-22xfs: add log recovery tracepoint for head/tailBrian Foster
Torn write detection and tail overwrite detection can shift the log head and tail respectively in the event of CRC mismatch or corruption errors. Add a high-level log recovery tracepoint to dump the final log head/tail and make those values easily attainable in debug/diagnostic situations. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-08-22xfs: handle -EFSCORRUPTED during head/tail verificationBrian Foster
Torn write and tail overwrite detection both trigger only on -EFSBADCRC errors. While this is the most likely failure scenario for each condition, -EFSCORRUPTED is still possible in certain cases depending on what ends up on disk when a torn write or partial tail overwrite occurs. For example, an invalid log record h_len can lead to an -EFSCORRUPTED error when running the log recovery CRC pass. Therefore, update log head and tail verification to trigger the associated head/tail fixups in the event of -EFSCORRUPTED errors along with -EFSBADCRC. Also, -EFSCORRUPTED can currently be returned from xlog_do_recovery_pass() before rhead_blk is initialized if the first record encountered happens to be corrupted. This leads to an incorrect 'first_bad' return value. Initialize rhead_blk earlier in the function to address that problem as well. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-08-22xfs: add log item pinning error injection tagBrian Foster
Add an error injection tag to force log items in the AIL to the pinned state. This option can be used by test infrastructure to induce head behind tail conditions. Specifically, this is intended to be used by xfstests to reproduce log recovery problems after failed/corrupted log writes overwrite the last good tail LSN in the log. When enabled, AIL push attempts see log items in the AIL in the pinned state. This stalls metadata writeback and thus prevents the current tail of the log from moving forward. When disabled, subsequent AIL pushes observe the log items in their appropriate state and filesystem operation continues as normal. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-08-22xfs: fix log recovery corruption error due to tail overwriteBrian Foster
If we consider the case where the tail (T) of the log is pinned long enough for the head (H) to push and block behind the tail, we can end up blocked in the following state without enough free space (f) in the log to satisfy a transaction reservation: 0 phys. log N [-------HffT---H'--T'---] The last good record in the log (before H) refers to T. The tail eventually pushes forward (T') leaving more free space in the log for writes to H. At this point, suppose space frees up in the log for the maximum of 8 in-core log buffers to start flushing out to the log. If this pushes the head from H to H', these next writes overwrite the previous tail T. This is safe because the items logged from T to T' have been written back and removed from the AIL. If the next log writes (H -> H') happen to fail and result in partial records in the log, the filesystem shuts down having overwritten T with invalid data. Log recovery correctly locates H on the subsequent mount, but H still refers to the now corrupted tail T. This results in log corruption errors and recovery failure. Since the tail overwrite results from otherwise correct runtime behavior, it is up to log recovery to try and deal with this situation. Update log recovery tail verification to run a CRC pass from the first record past the tail to the head. This facilitates error detection at T and moves the recovery tail to the first good record past H' (similar to truncating the head on torn write detection). If corruption is detected beyond the range possibly affected by the max number of iclogs, the log is legitimately corrupted and log recovery failure is expected. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-08-22xfs: always verify the log tail during recoveryBrian Foster
Log tail verification currently only occurs when torn writes are detected at the head of the log. This was introduced because a change in the head block due to torn writes can lead to a change in the tail block (each log record header references the current tail) and the tail block should be verified before log recovery proceeds. Tail corruption is possible outside of torn write scenarios, however. For example, partial log writes can be detected and cleared during the initial head/tail block discovery process. If the partial write coincides with a tail overwrite, the log tail is corrupted and recovery fails. To facilitate correct handling of log tail overwites, update log recovery to always perform tail verification. This is necessary to detect potential tail overwrite conditions when torn writes may not have occurred. This changes normal (i.e., no torn writes) recovery behavior slightly to detect and return CRC related errors near the tail before actual recovery starts. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-08-22xfs: fix recovery failure when log record header wraps log endBrian Foster
The high-level log recovery algorithm consists of two loops that walk the physical log and process log records from the tail to the head. The first loop handles the case where the tail is beyond the head and processes records up to the end of the physical log. The subsequent loop processes records from the beginning of the physical log to the head. Because log records can wrap around the end of the physical log, the first loop mentioned above must handle this case appropriately. Records are processed from in-core buffers, which means that this algorithm must split the reads of such records into two partial I/Os: 1.) from the beginning of the record to the end of the log and 2.) from the beginning of the log to the end of the record. This is further complicated by the fact that the log record header and log record data are read into independent buffers. The current handling of each buffer correctly splits the reads when either the header or data starts before the end of the log and wraps around the end. The data read does not correctly handle the case where the prior header read wrapped or ends on the physical log end boundary. blk_no is incremented to or beyond the log end after the header read to point to the record data, but the split data read logic triggers, attempts to read from an invalid log block and ultimately causes log recovery to fail. This can be reproduced fairly reliably via xfstests tests generic/047 and generic/388 with large iclog sizes (256k) and small (10M) logs. If the record header read has pushed beyond the end of the physical log, the subsequent data read is actually contiguous. Update the data read logic to detect the case where blk_no has wrapped, mod it against the log size to read from the correct address and issue one contiguous read for the log data buffer. The log record is processed as normal from the buffer(s), the loop exits after the current iteration and the subsequent loop picks up with the first new record after the start of the log. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-08-22xfs: Properly retry failed inode items in case of error during buffer writebackCarlos Maiolino
When a buffer has been failed during writeback, the inode items into it are kept flush locked, and are never resubmitted due the flush lock, so, if any buffer fails to be written, the items in AIL are never written to disk and never unlocked. This causes unmount operation to hang due these items flush locked in AIL, but this also causes the items in AIL to never be written back, even when the IO device comes back to normal. I've been testing this patch with a DM-thin device, creating a filesystem larger than the real device. When writing enough data to fill the DM-thin device, XFS receives ENOSPC errors from the device, and keep spinning on xfsaild (when 'retry forever' configuration is set). At this point, the filesystem can not be unmounted because of the flush locked items in AIL, but worse, the items in AIL are never retried at all (once xfs_inode_item_push() will skip the items that are flush locked), even if the underlying DM-thin device is expanded to the proper size. This patch fixes both cases, retrying any item that has been failed previously, using the infra-structure provided by the previous patch. Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-08-22xfs: Add infrastructure needed for error propagation during buffer IO failureCarlos Maiolino
With the current code, XFS never re-submit a failed buffer for IO, because the failed item in the buffer is kept in the flush locked state forever. To be able to resubmit an log item for IO, we need a way to mark an item as failed, if, for any reason the buffer which the item belonged to failed during writeback. Add a new log item callback to be used after an IO completion failure and make the needed clean ups. Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-08-22xfs: toggle readonly state around xfs_log_mount_finishEric Sandeen
When we do log recovery on a readonly mount, unlinked inode processing does not happen due to the readonly checks in xfs_inactive(), which are trying to prevent any I/O on a readonly mount. This is misguided - we do I/O on readonly mounts all the time, for consistency; for example, log recovery. So do the same RDONLY flag twiddling around xfs_log_mount_finish() as we do around xfs_log_mount(), for the same reason. This all cries out for a big rework but for now this is a simple fix to an obvious problem. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-08-22xfs: write unmount record for ro mountsEric Sandeen
There are dueling comments in the xfs code about intent for log writes when unmounting a readonly filesystem. In xfs_mountfs, we see the intent: /* * Now the log is fully replayed, we can transition to full read-only * mode for read-only mounts. This will sync all the metadata and clean * the log so that the recovery we just performed does not have to be * replayed again on the next mount. */ and it calls xfs_quiesce_attr(), but by the time we get to xfs_log_unmount_write(), it returns early for a RDONLY mount: * Don't write out unmount record on read-only mounts. Because of this, sequential ro mounts of a filesystem with a dirty log will replay the log each time, which seems odd. Fix this by writing an unmount record even for RO mounts, as long as norecovery wasn't specified (don't write a clean log record if a dirty log may still be there!) and the log device is writable. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-08-22btrfs: submit superblock io with REQ_META and REQ_PRIODavid Sterba
The superblock is also metadata of the filesystem so the relevant IO should be tagged as such. We also tag it as high priority, as it's the last block committed for metadata from a given transaction. Any delays would effectively block the whole transaction, also blocking any other operation holding the device_list_mutex. Reviewed-by: Josef Bacik <jbacik@fb.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
2017-08-21f2fs: introduce discard_granularity sysfs entryChao Yu
Commit d618ebaf0aa8 ("f2fs: enable small discard by default") enables f2fs to issue 4K size discard in real-time discard mode. However, issuing smaller discard may cost more lifetime but releasing less free space in flash device. Since f2fs has ability of separating hot/cold data and garbage collection, we can expect that small-sized invalid region would expand soon with OPU, deletion or garbage collection on valid datas, so it's better to delay or skip issuing smaller size discards, it could help to reduce overmuch consumption of IO bandwidth and lifetime of flash storage. This patch makes f2fs selectng 64K size as its default minimal granularity, and issue discard with the size which is not smaller than minimal granularity. Also it exposes discard granularity as sysfs entry for configuration in different scenario. Jaegeuk Kim: We must issue all the accumulated discard commands when fstrim is called. So, I've added pend_list_tag[] to indicate whether we should issue the commands or not. If tag sets P_ACTIVE or P_TRIM, we have to issue them. P_TRIM is set once at a time, given fstrim trigger. In addition, issue_discard_thread is calling too much due to the number of discard commands remaining in the pending list. I added a timer to control it likewise gc_thread. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-08-21f2fs: remove unused function overprovision_sectionsYunlong Song
Signed-off-by: Yunlong Song <yunlong.song@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-08-21f2fs: check hot_data for roll-forward recoveryJaegeuk Kim
We need to check HOT_DATA to truncate any previous data block when doing roll-forward recovery. Cc: <stable@vger.kernel.org> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-08-21f2fs: add tracepoint for f2fs_gcChao Yu
This patch adds tracepoint for f2fs_gc. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-08-21f2fs: retry to revoke atomic commit in -ENOMEM caseChao Yu
During atomic committing, if we encounter -ENOMEM in revoke path, it's better to give a chance to retry revoking. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-08-21f2fs: let fill_super handle roll-forward errorsJaegeuk Kim
If we set CP_ERROR_FLAG in roll-forward error, f2fs is no longer to proceed any IOs due to f2fs_cp_error(). But, for example, if some stale data is involved on roll-forward process, we're able to get -ENOENT, getting fs stuck. If we get any error, let fill_super set SBI_NEED_FSCK and try to recover back to stable point. Cc: <stable@vger.kernel.org> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-08-21f2fs: merge equivalent flags F2FS_GET_BLOCK_[READ|DIO]Qiuyang Sun
Currently, the two flags F2FS_GET_BLOCK_[READ|DIO] are totally equivalent and can be used interchangably in all scenarios they are involved in. Neither of the flags is referenced in f2fs_map_blocks(), making them both the default case. To remove the ambiguity, this patch merges both flags into F2FS_GET_BLOCK_DEFAULT, and introduces an enum for all distinct flags. Signed-off-by: Qiuyang Sun <sunqiuyang@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-08-21f2fs: support journalled quotaChao Yu
This patch supports to enable f2fs to accept quota information through mount option: - {usr,grp,prj}jquota=<quota file path> - jqfmt=<quota type> Then, in ->mount flow, we can recover quota file during log replaying, by this, journelled quota can be supported. Signed-off-by: Chao Yu <yuchao0@huawei.com> [Jaegeuk Kim: Fix wrong return values.] Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-08-21btrfs: remove unnecessary memory barrier in btrfs_direct_IONikolay Borisov
Commit 38851cc19adb ("Btrfs: implement unlocked dio write") implemented unlocked dio write, allowing multiple dio writers to write to non-overlapping, and non-eof-extending regions. In doing so it also introduced a broken memory barrier. It is broken due to 2 things: 1. Memory barriers _MUST_ always be paired, this is clearly not the case here 2. Checkpatch actually produces a warning if a memory barrier is introduced that doesn't have a comment explaining how it's being paired. Specifically for inode::i_dio_count that's wrapped inside inode_dio_begin, there is no explicit barrier semantics attached, so removing is fine as the atomic is used in common the waiter/wakeup pattern. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ enhance changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-21btrfs: remove superfluous chunk_tree argument from btrfs_alloc_dev_extentNikolay Borisov
Currently this function is always called with the object id of the root key of the chunk_tree, which is always BTRFS_CHUNK_TREE_OBJECTID. So let's subsume it straight into the function itself. No functional change. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-21btrfs: Remove chunk_objectid parameter of btrfs_alloc_dev_extentNikolay Borisov
THe function is always called with chunk_objectid set to BTRFS_FIRST_CHUNK_TREE_OBJECTID. Let's collapse the parameter in the function itself. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-21btrfs: pass fs_info to btrfs_del_root instead of tree_rootJeff Mahoney
btrfs_del_roots always uses the tree_root. Let's pass fs_info instead. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-21Btrfs: add one more sanity check for shared ref typeLiu Bo
Every shared ref has a parent tree block, which can be get from btrfs_extent_inline_ref_offset(). And the tree block must be aligned to the nodesize, so we'd know this inline ref is not valid if this block's bytenr is not aligned to the nodesize, in which case, most likely the ref type has been misused. This adds the above mentioned check and also updates print_extent_item() called by btrfs_print_leaf() to point out the invalid ref while printing the tree structure. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-21Btrfs: remove BUG_ON in __add_tree_blockLiu Bo
The BUG_ON() can be triggered when the caller is processing an invalid extent inline ref, e.g. a shared data ref is offered instead of an extent data ref, such that it tries to find a non-existent tree block and then btrfs_search_slot returns 1 for no such item. This replaces the BUG_ON() with a WARN() followed by calling btrfs_print_leaf() to show more details about what's going on and returning -EINVAL to upper callers. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-21Btrfs: remove BUG() in add_data_referenceLiu Bo
Now that we have a helper to report invalid value of extent inline ref type, we need to quit gracefully instead of throwing out a kernel panic. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-21Btrfs: remove BUG() in print_extent_itemLiu Bo
btrfs_print_leaf() is used in btrfs_get_extent_inline_ref_type, so here we really want to print the invalid value of ref type instead of causing a kernel panic. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-21Btrfs: remove BUG() in btrfs_extent_inline_ref_sizeLiu Bo
Now that btrfs_get_extent_inline_ref_type() can report if type is a valid one and all callers can gracefully deal with that, we don't need to crash here. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-21Btrfs: convert to use btrfs_get_extent_inline_ref_typeLiu Bo
Since we have a helper which can do sanity check, this converts all btrfs_extent_inline_ref_type to it. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-21Btrfs: add a helper to retrive extent inline ref typeLiu Bo
An invalid value of extent inline ref type may be read from a malicious image which may force btrfs to crash. This adds a helper which does sanity check for the ref type, so we can know if it's sane, return he type, otherwise return an error. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ minimal tweak const types, causing warnings due to other cleanup patches ] Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-21btrfs: scrub: simplify scrub worker initializationDavid Sterba
Minor simplification, merge calls to one. Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-21btrfs: scrub: clean up division in scrub_find_csumDavid Sterba
Use proper helpers for 64bit division. Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-21btrfs: scrub: clean up division in __scrub_mark_bitmapDavid Sterba
Use proper helpers for 64bit division and then cast to narrower type. Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-21btrfs: scrub: use bool for flush_all_writesDavid Sterba
flush_all_writes is an atomic but does not use the semantics at all, it's just on/off indicator, we can use bool. Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-21btrfs: preserve i_mode if __btrfs_set_acl() failsErnesto A. Fernández
When changing a file's acl mask, btrfs_set_acl() will first set the group bits of i_mode to the value of the mask, and only then set the actual extended attribute representing the new acl. If the second part fails (due to lack of space, for example) and the file had no acl attribute to begin with, the system will from now on assume that the mask permission bits are actual group permission bits, potentially granting access to the wrong users. Prevent this by restoring the original mode bits if __btrfs_set_acl fails. Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-21btrfs: Remove extraneous chunk_objectid variableNikolay Borisov
BTRFS_FIRST_CHUNK_TREE_OBJECTIS id the only objectid being used in the chunk_tree. So remove a variable which is always set to that value and collapse its usage in callees which are passed this variable. No functional changes Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-21btrfs: Remove chunk_objectid argument from btrfs_make_block_groupNikolay Borisov
btrfs_make_block_group is always called with chunk_objectid set to BTRFS_FIRST_CHUNK_TREE_OBJECTID. There's no reason why this behavior will change anytime soon, so let's remove the argument and decrease the cognitive load when reading the code path. No functional change Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-21btrfs: Remove extra parentheses from condition in copy_items()Matthias Kaehlcke
There is no need for the extra pair of parentheses, remove it. This fixes the following warning when building with clang: fs/btrfs/tree-log.c:3694:10: warning: equality comparison with extraneous parentheses [-Wparentheses-equality] if ((i == (nr - 1))) ~~^~~~~~~~~~~ Also remove the unnecessary parentheses around the substraction. Signed-off-by: Matthias Kaehlcke <mka@chromium.org> Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-21btrfs: Remove redundant setting of uuid in btrfs_block_headerNikolay Borisov
btrfs_alloc_dev_extent currently unconditionally sets the uuid in the leaf block header the function is working with. This is unnecessary since this operation is peformed by the core btree handling code (splitting a node, allocating a new btree block etc). So let's remove it. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-21btrfs: Do not use data_alloc_cluster in ssd modeHans van Kranenburg
This patch provides a band aid to improve the 'out of the box' behaviour of btrfs for disks that are detected as being an ssd. In a general purpose mixed workload scenario, the current ssd mode causes overallocation of available raw disk space for data, while leaving behind increasing amounts of unused fragmented free space. This situation leads to early ENOSPC problems which are harming user experience and adoption of btrfs as a general purpose filesystem. This patch modifies the data extent allocation behaviour of the ssd mode to make it behave identical to nossd mode. The metadata behaviour and additional ssd_spread option stay untouched so far. Recommendations for future development are to reconsider the current oversimplified nossd / ssd distinction and the broken detection mechanism based on the rotational attribute in sysfs and provide experienced users with a more flexible way to choose allocator behaviour for data and metadata, optimized for certain use cases, while keeping sane 'out of the box' default settings. The internals of the current btrfs code have more potential than what currently gets exposed to the user to choose from. The SSD story... In the first year of btrfs development, around early 2008, btrfs gained a mount option which enables specific functionality for filesystems on solid state devices. The first occurance of this functionality is in commit e18e4809, labeled "Add mount -o ssd, which includes optimizations for seek free storage". The effect on allocating free space for doing (data) writes is to 'cluster' writes together, writing them out in contiguous space, as opposed to a 'tetris' way of putting all separate writes into any free space fragment that fits (which is what the -o nossd behaviour does). A somewhat simplified explanation of what happens is that, when for example, the 'cluster' size is set to 2MiB, when we do some writes, the data allocator will search for a free space block that is 2MiB big, and put the writes in there. The ssd mode itself might allow a 2MiB cluster to be composed of multiple free space extents with some existing data in between, while the additional ssd_spread mount option kills off this option and requires fully free space. The idea behind this is (commit 536ac8ae): "The [...] clusters make it more likely a given IO will completely overwrite the ssd block, so it doesn't have to do an internal rwm cycle."; ssd block meaning nand erase block. So, effectively this means applying a "locality based algorithm" and trying to outsmart the actual ssd. Since then, various changes have been made to the involved code, but the basic idea is still present, and gets activated whenever the ssd mount option is active. This also happens by default, when the rotational flag as seen at /sys/block/<device>/queue/rotational is set to 0. However, there's a number of problems with this approach. First, what the optimization is trying to do is outsmart the ssd by assuming there is a relation between the physical address space of the block device as seen by btrfs and the actual physical storage of the ssd, and then adjusting data placement. However, since the introduction of the Flash Translation Layer (FTL) which is a part of the internal controller of an ssd, these attempts are futile. The use of good quality FTL in consumer ssd products might have been limited in 2008, but this situation has changed drastically soon after that time. Today, even the flash memory in your automatic cat feeding machine or your grandma's wheelchair has a full featured one. Second, the behaviour as described above results in the filesystem being filled up with badly fragmented free space extents because of relatively small pieces of space that are freed up by deletes, but not selected again as part of a 'cluster'. Since the algorithm prefers allocating a new chunk over going back to tetris mode, the end result is a filesystem in which all raw space is allocated, but which is composed of underutilized chunks with a 'shotgun blast' pattern of fragmented free space. Usually, the next problematic thing that happens is the filesystem wanting to allocate new space for metadata, which causes the filesystem to fail in spectacular ways. Third, the default mount options you get for an ssd ('ssd' mode enabled, 'discard' not enabled), in combination with spreading out writes over the full address space and ignoring freed up space leads to worst case behaviour in providing information to the ssd itself, since it will never learn that all the free space left behind is actually free. There are two ways to let an ssd know previously written data does not have to be preserved, which are sending explicit signals using discard or fstrim, or by simply overwriting the space with new data. The worst case behaviour is the btrfs ssd_spread mount option in combination with not having discard enabled. It has a side effect of minimizing the reuse of free space previously written in. Fourth, the rotational flag in /sys/ does not reliably indicate if the device is a locally attached ssd. For example, iSCSI or NBD displays as non-rotational, while a loop device on an ssd shows up as rotational. The combination of the second and third problem effectively means that despite all the good intentions, the btrfs ssd mode reliably causes the ssd hardware and the filesystem structures and performance to be choked to death. The clickbait version of the title of this story would have been "Btrfs ssd optimizations considered harmful for ssds". The current nossd 'tetris' mode (even still without discard) allows a pattern of overwriting much more previously used space, causing many more implicit discards to happen because of the overwrite information the ssd gets. The actual location in the physical address space, as seen from the point of view of btrfs is irrelevant, because the actual writes to the low level flash are reordered anyway thanks to the FTL. Changes made in the code 1. Make ssd mode data allocation identical to tetris mode, like nossd. 2. Adjust and clean up filesystem mount messages so that we can easily identify if a kernel has this patch applied or not, when providing support to end users. Also, make better use of the *_and_info helpers to only trigger messages on actual state changes. Backporting notes Notes for whoever wants to backport this patch to their 4.9 LTS kernel: * First apply commit 951e7966 "btrfs: drop the nossd flag when remounting with -o ssd", or fixup the differences manually. * The rest of the conflicts are because of the fs_info refactoring. So, for example, instead of using fs_info, it's root->fs_info in extent-tree.c Signed-off-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-21btrfs: use btrfsic_submit_bio instead of submit_bio in write_dev_flushLu Fengqi
Although this bio has no data attached, it will reach this condition (bio->bi_opf & REQ_PREFLUSH) and then update the flush_gen of dev_state in __btrfsic_submit_bio. So we should still submit it through integrity checker. Otherwise, the integrity checker will throw the following warning when I mount a newly created btrfs filesystem. [10264.755497] btrfs: attempt to write superblock which references block M @29523968 (sdb1/1111654400/0) which is not flushed out of disk's write cache (block flush_gen=1, dev->flush_gen=0)! [10264.755498] btrfs: attempt to write superblock which references block M @29523968 (sdb1/37912576/0) which is not flushed out of disk's write cache (block flush_gen=1, dev->flush_gen=0)! Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>