summaryrefslogtreecommitdiff
path: root/fs/btrfs
AgeCommit message (Collapse)Author
2017-04-18btrfs: replace hardcoded value with SEQ_LAST macroEdmund Nadolski
Define the SEQ_LAST macro to replace (u64)-1 in places where said value triggers a special-case ref search behavior. Signed-off-by: Edmund Nadolski <enadolski@suse.com> Reviewed-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18btrfs: provide enumeration for __merge_refs mode argumentEdmund Nadolski
Replace hardcoded numeric values for __merge_refs 'mode' argument with descriptive constants. Signed-off-by: Edmund Nadolski <enadolski@suse.com> Reviewed-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18btrfs: remove unused qgroup members from btrfs_trans_handleDavid Sterba
The members have been effectively unused since "Btrfs: rework qgroup accounting" (fcebe4562dec83b3), there's no substitute for assert_qgroups_uptodate so it's removed as well. Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18btrfs: remove local blocksize variable in reada_find_extentDavid Sterba
The name is misleading and the local variable serves no purpose. Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18btrfs: remove redundant parameter from reada_start_machine_devDavid Sterba
We can read fs_info from dev. Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18btrfs: remove redundant parameter from reada_find_zoneDavid Sterba
We can read fs_info from dev. Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18btrfs: remove redundant parameter from btree_readahead_hookDavid Sterba
We can read fs_info from eb. Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18btrfs: preallocate radix tree node for global readahead treeDavid Sterba
We can preallocate the node so insertion does not have to do that under the lock. The GFP flags for the global radix tree are initialized to GFP_NOFS & ~__GFP_DIRECT_RECLAIM but we can use GFP_KERNEL, because readahead is optional and not on any critical writeout path. Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18btrfs: preallocate radix tree node for readaheadDavid Sterba
We can preallocate the node so insertion does not have to do that under the lock. The GFP flags for the per-device radix tree are initialized to GFP_NOFS & ~__GFP_DIRECT_RECLAIM but we can use GFP_KERNEL, same as an allocation above anyway, but also because readahead is optional and not on any critical writeout path. Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18btrfs: No need to check !(flags & MS_RDONLY) twiceGoldwyn Rodrigues
Code cleanup. The code block is for !(*flags & MS_RDONLY). We don't need to check it again. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18Btrfs: update comments in cache_save_setupLiu Bo
We also don't bother to flush free space cache while with free space tree. Cc: David Sterba <dsterba@suse.cz> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18Btrfs: convert BUG_ON to WARN_ONLiu Bo
These two BUG_ON()s would never be true, ensured by callers' logic. Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18Btrfs: helper for ops that requires full stripeLiu Bo
This adds a helper to show directly whether ops require full stripe. Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18Btrfs: do not add extra mirror when dev_replace target dev is not availableLiu Bo
With this, we can avoid allocating memory for dev replace copies if the target dev is not available. Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18Btrfs: handle operations for device replace separatelyLiu Bo
Since this part is mostly independent, this moves it to a separate function. Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18Btrfs: introduce a function to get extra mirror from replaceLiu Bo
As the part of getting extra mirror in __btrfs_map_block is independent, this puts it into a separate function. Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18Btrfs: separate DISCARD from __btrfs_map_blockLiu Bo
Since DISCARD is not as important as an operation like write, we don't copy it to target device during replace, and it makes __btrfs_map_block less complex. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18Btrfs: create a helper for getting chunk mapLiu Bo
We have similar code here and there, this merges them into a helper. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18Btrfs: add file item tracepointsLiu Bo
While debugging truncate problems, I found that these tracepoints could help us quickly know what went wrong. Two sets of tracepoints are created to track regular/prealloc file item and inline file item respectively, I put inline as a separate one since what inline file items cares about are way less than the regular one. This adds four tracepoints: - btrfs_get_extent_show_fi_regular - btrfs_get_extent_show_fi_inline - btrfs_truncate_show_fi_regular - btrfs_truncate_show_fi_inline Cc: David Sterba <dsterba@suse.cz> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ formatting adjustments ] Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18btrfs: convert btrfs_raid_bio.refs from atomic_t to refcount_tElena Reshetova
refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David Windsor <dwindsor@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18btrfs: convert scrub_ctx.refs from atomic_t to refcount_tElena Reshetova
refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David Windsor <dwindsor@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18btrfs: convert scrub_parity.refs from atomic_t to refcount_tElena Reshetova
refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David Windsor <dwindsor@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18btrfs: convert scrub_block.refs from atomic_t to refcount_tElena Reshetova
refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David Windsor <dwindsor@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18btrfs: convert scrub_recover.refs from atomic_t to refcount_tElena Reshetova
refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David Windsor <dwindsor@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18btrfs: convert compressed_bio.pending_bios from atomic_t to refcount_tElena Reshetova
refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David Windsor <dwindsor@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18btrfs: convert extent_state.refs from atomic_t to refcount_tElena Reshetova
refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David Windsor <dwindsor@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18btrfs: convert btrfs_root.refs from atomic_t to refcount_tElena Reshetova
refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David Windsor <dwindsor@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18btrfs: convert btrfs_delayed_item.refs from atomic_t to refcount_tElena Reshetova
refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David Windsor <dwindsor@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18btrfs: convert btrfs_delayed_node.refs from atomic_t to refcount_tElena Reshetova
refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David Windsor <dwindsor@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18btrfs: convert btrfs_delayed_ref_node.refs from atomic_t to refcount_tElena Reshetova
refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David Windsor <dwindsor@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18btrfs: convert btrfs_caching_control.count from atomic_t to refcount_tElena Reshetova
refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David Windsor <dwindsor@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18btrfs: convert btrfs_ordered_extent.refs from atomic_t to refcount_tElena Reshetova
refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David Windsor <dwindsor@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18btrfs: convert extent_map.refs from atomic_t to refcount_tElena Reshetova
refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David Windsor <dwindsor@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18btrfs: convert btrfs_transaction.use_count from atomic_t to refcount_tElena Reshetova
refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David Windsor <dwindsor@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18btrfs: convert btrfs_bio.refs from atomic_t to refcount_tElena Reshetova
refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David Windsor <dwindsor@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18Btrfs: remove ASSERT in btrfs_truncate_inode_itemsLiu Bo
After 76b42abbf748 ("Btrfs: fix data loss after truncate when using the no-holes feature"), For either NO_HOLES or inline extents, we've set last_size to newsize to avoid data loss after remount or inode got evicted and read again, thus, we don't need this check anymore. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18btrfs: fix a bogus warning when converting only data or metadataAdam Borowski
If your filesystem has, eg, data:raid0 metadata:raid1, and you run "btrfs balance -dconvert=raid1", the meta.target field will be uninitialized. That's otherwise ok, as it's unused except for this warning. Thus, let's use the existing set of raid levels for the comparison. As a side effect, non-convert balances will now nag about data>metadata. Signed-off-by: Adam Borowski <kilobyte@angband.pl> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-14Merge branch 'for-linus-4.11' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "Dave Sterba collected a few more fixes for the last rc. These aren't marked for stable, but I'm putting them in with a batch were testing/sending by hand for this release" * 'for-linus-4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: fix potential use-after-free for cloned bio Btrfs: fix segmentation fault when doing dio read Btrfs: fix invalid dereference in btrfs_retry_endio btrfs: drop the nossd flag when remounting with -o ssd
2017-04-11Btrfs: fix potential use-after-free for cloned bioLiu Bo
KASAN reports that there is a use-after-free case of bio in btrfs_map_bio. If we need to submit IOs to several disks at a time, the original bio would get cloned and mapped to the destination disk, but we really should use the original bio instead of a cloned bio to do the sanity check because cloned bios are likely to be freed by its endio. Reported-by: Diego <diegocg@gmail.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-11Btrfs: fix segmentation fault when doing dio readLiu Bo
Commit 2dabb3248453 ("Btrfs: Direct I/O read: Work on sectorsized blocks") introduced this bug during iterating bio pages in dio read's endio hook, and it could end up with segment fault of the dio reading task. So the reason is 'if (nr_sectors--)', and it makes the code assume that there is one more block in the same page, so page offset is increased and the bio which is created to repair the bad block then has an incorrect bvec.bv_offset, and a later access of the page content would throw a segmentation fault. This also adds ASSERT to check page offset against page size. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-11Btrfs: fix invalid dereference in btrfs_retry_endioLiu Bo
When doing directIO repair, we have this oops: [ 1458.532816] general protection fault: 0000 [#1] SMP ... [ 1458.536291] Workqueue: btrfs-endio-repair btrfs_endio_repair_helper [btrfs] [ 1458.536893] task: ffff88082a42d100 task.stack: ffffc90002b3c000 [ 1458.537499] RIP: 0010:btrfs_retry_endio+0x7e/0x1a0 [btrfs] ... [ 1458.543261] Call Trace: [ 1458.543958] ? rcu_read_lock_sched_held+0xc4/0xd0 [ 1458.544374] bio_endio+0xed/0x100 [ 1458.544750] end_workqueue_fn+0x3c/0x40 [btrfs] [ 1458.545257] normal_work_helper+0x9f/0x900 [btrfs] [ 1458.545762] btrfs_endio_repair_helper+0x12/0x20 [btrfs] [ 1458.546224] process_one_work+0x34d/0xb70 [ 1458.546570] ? process_one_work+0x29e/0xb70 [ 1458.546938] worker_thread+0x1cf/0x960 [ 1458.547263] ? process_one_work+0xb70/0xb70 [ 1458.547624] kthread+0x17d/0x180 [ 1458.547909] ? kthread_create_on_node+0x70/0x70 [ 1458.548300] ret_from_fork+0x31/0x40 It turns out that btrfs_retry_endio is trying to get inode from a directIO page. This fixes the problem by using the saved inode pointer, done->inode. btrfs_retry_endio_nocsum has the same problem, and it's fixed as well. Also cleanup unused @start (which is too trivial for a separate patch). Cc: David Sterba <dsterba@suse.cz> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-11btrfs: drop the nossd flag when remounting with -o ssdAdam Borowski
The opposite case was already handled right in the very next switch entry. And also when turning on nossd, drop ssd_spread. Reported-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com> Signed-off-by: Adam Borowski <kilobyte@angband.pl> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-03-31Merge branch 'for-linus-4.11' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "We have three small fixes queued up in my for-linus-4.11 branch" * 'for-linus-4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: fix an integer overflow check btrfs: Change qgroup_meta_rsv to 64bit Btrfs: bring back repair during read
2017-03-29Btrfs: fix an integer overflow checkDan Carpenter
This isn't super serious because you need CAP_ADMIN to run this code. I added this integer overflow check last year but apparently I am rubbish at writing integer overflow checks... There are two issues. First, access_ok() works on unsigned long type and not u64 so on 32 bit systems the access_ok() could be checking a truncated size. The other issue is that we should be using a stricter limit so we don't overflow the kzalloc() setting ctx->clone_roots later in the function after the access_ok(): alloc_size = sizeof(struct clone_root) * (arg->clone_sources_count + 1); sctx->clone_roots = kzalloc(alloc_size, GFP_KERNEL | __GFP_NOWARN); Fixes: f5ecec3ce21f ("btrfs: send: silence an integer overflow warning") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ added comment ] Signed-off-by: David Sterba <dsterba@suse.com>
2017-03-29btrfs: Change qgroup_meta_rsv to 64bitGoldwyn Rodrigues
Using an int value is causing qg->reserved to become negative and exclusive -EDQUOT to be reached prematurely. This affects exclusive qgroups only. TEST CASE: DEVICE=/dev/vdb MOUNTPOINT=/mnt SUBVOL=$MOUNTPOINT/tmp umount $SUBVOL umount $MOUNTPOINT mkfs.btrfs -f $DEVICE mount /dev/vdb $MOUNTPOINT btrfs quota enable $MOUNTPOINT btrfs subvol create $SUBVOL umount $MOUNTPOINT mount /dev/vdb $MOUNTPOINT mount -o subvol=tmp $DEVICE $SUBVOL btrfs qgroup limit -e 3G $SUBVOL btrfs quota rescan /mnt -w for i in `seq 1 44000`; do dd if=/dev/zero of=/mnt/tmp/test_$i bs=10k count=1 if [[ $? > 0 ]]; then btrfs qgroup show -pcref $SUBVOL exit 1 fi done Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> [ add reproducer to changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
2017-03-29Btrfs: bring back repair during readLiu Bo
Commit 20a7db8ab3f2 ("btrfs: add dummy callback for readpage_io_failed and drop checks") made a cleanup around readpage_io_failed_hook, and it was supposed to keep the original sematics, but it also unexpectedly disabled repair during read for dup, raid1 and raid10. This fixes the problem by letting data's inode call the generic readpage_io_failed callback by returning -EAGAIN from its readpage_io_failed_hook in order to notify end_bio_extent_readpage to do the rest. We don't call it directly because the generic one takes an offset from end_bio_extent_readpage() to calculate the index in the checksum array and inode's readpage_io_failed_hook doesn't offer that offset. Cc: David Sterba <dsterba@suse.cz> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ keep the const function attribute ] Signed-off-by: David Sterba <dsterba@suse.com>
2017-03-23Merge branch 'for-linus-4.11' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "Zygo tracked down a very old bug with inline compressed extents. I didn't tag this one for stable because I want to do individual tested backports. It's a little tricky and I'd rather do some extra testing on it along the way" * 'for-linus-4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: btrfs: add missing memset while reading compressed inline extents Btrfs: fix regression in lock_delalloc_pages btrfs: remove btrfs_err_str function from uapi/linux/btrfs.h
2017-03-17btrfs: add missing memset while reading compressed inline extentsZygo Blaxell
This is a story about 4 distinct (and very old) btrfs bugs. Commit c8b978188c ("Btrfs: Add zlib compression support") added three data corruption bugs for inline extents (bugs #1-3). Commit 93c82d5750 ("Btrfs: zero page past end of inline file items") fixed bug #1: uncompressed inline extents followed by a hole and more extents could get non-zero data in the hole as they were read. The fix was to add a memset in btrfs_get_extent to zero out the hole. Commit 166ae5a418 ("btrfs: fix inline compressed read err corruption") fixed bug #2: compressed inline extents which contained non-zero bytes might be replaced with zero bytes in some cases. This patch removed an unhelpful memset from uncompress_inline, but the case where memset is required was missed. There is also a memset in the decompression code, but this only covers decompressed data that is shorter than the ram_bytes from the extent ref record. This memset doesn't cover the region between the end of the decompressed data and the end of the page. It has also moved around a few times over the years, so there's no single patch to refer to. This patch fixes bug #3: compressed inline extents followed by a hole and more extents could get non-zero data in the hole as they were read (i.e. bug #3 is the same as bug #1, but s/uncompressed/compressed/). The fix is the same: zero out the hole in the compressed case too, by putting a memset back in uncompress_inline, but this time with correct parameters. The last and oldest bug, bug #0, is the cause of the offending inline extent/hole/extent pattern. Bug #0 is a subtle and mostly-harmless quirk of behavior somewhere in the btrfs write code. In a few special cases, an inline extent and hole are allowed to persist where they normally would be combined with later extents in the file. A fast reproducer for bug #0 is presented below. A few offending extents are also created in the wild during large rsync transfers with the -S flag. A Linux kernel build (git checkout; make allyesconfig; make -j8) will produce a handful of offending files as well. Once an offending file is created, it can present different content to userspace each time it is read. Bug #0 is at least 4 and possibly 8 years old. I verified every vX.Y kernel back to v3.5 has this behavior. There are fossil records of this bug's effects in commits all the way back to v2.6.32. I have no reason to believe bug #0 wasn't present at the beginning of btrfs compression support in v2.6.29, but I can't easily test kernels that old to be sure. It is not clear whether bug #0 is worth fixing. A fix would likely require injecting extra reads into currently write-only paths, and most of the exceptional cases caused by bug #0 are already handled now. Whether we like them or not, bug #0's inline extents followed by holes are part of the btrfs de-facto disk format now, and we need to be able to read them without data corruption or an infoleak. So enough about bug #0, let's get back to bug #3 (this patch). An example of on-disk structure leading to data corruption found in the wild: item 61 key (606890 INODE_ITEM 0) itemoff 9662 itemsize 160 inode generation 50 transid 50 size 47424 nbytes 49141 block group 0 mode 100644 links 1 uid 0 gid 0 rdev 0 flags 0x0(none) item 62 key (606890 INODE_REF 603050) itemoff 9642 itemsize 20 inode ref index 3 namelen 10 name: DB_File.so item 63 key (606890 EXTENT_DATA 0) itemoff 8280 itemsize 1362 inline extent data size 1341 ram 4085 compress(zlib) item 64 key (606890 EXTENT_DATA 4096) itemoff 8227 itemsize 53 extent data disk byte 5367308288 nr 20480 extent data offset 0 nr 45056 ram 45056 extent compression(zlib) Different data appears in userspace during each read of the 11 bytes between 4085 and 4096. The extent in item 63 is not long enough to fill the first page of the file, so a memset is required to fill the space between item 63 (ending at 4085) and item 64 (beginning at 4096) with zero. Here is a reproducer from Liu Bo, which demonstrates another method of creating the same inline extent and hole pattern: Using 'page_poison=on' kernel command line (or enable CONFIG_PAGE_POISONING) run the following: # touch foo # chattr +c foo # xfs_io -f -c "pwrite -W 0 1000" foo # xfs_io -f -c "falloc 4 8188" foo # od -x foo # echo 3 >/proc/sys/vm/drop_caches # od -x foo This produce the following on my box: Correct output: file contains 1000 data bytes followed by zeros: 0000000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd * 0001740 cdcd cdcd cdcd cdcd 0000 0000 0000 0000 0001760 0000 0000 0000 0000 0000 0000 0000 0000 * 0020000 Actual output: the data after the first 1000 bytes will be different each run: 0000000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd * 0001740 cdcd cdcd cdcd cdcd 6c63 7400 635f 006d 0001760 5f74 6f43 7400 435f 0053 5f74 7363 7400 0002000 435f 0056 5f74 6164 7400 645f 0062 5f74 (...) Signed-off-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Chris Mason <clm@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2017-03-17Btrfs: fix regression in lock_delalloc_pagesLiu Bo
The bug is a regression after commit (da2c7009f6ca "btrfs: teach __process_pages_contig about PAGE_LOCK operation") and commit (76c0021db8fd "Btrfs: use helper to simplify lock/unlock pages"). So if the dirty pages which are under writeback got truncated partially before we lock the dirty pages, we couldn't find all pages mapping to the delalloc range, and the bug didn't return an error so it kept going on and found that the delalloc range got truncated and got to unlock the dirty pages, and then the ASSERT could caught the error, and showed ----------------------------------------------------------------------------- assertion failed: page_ops & PAGE_LOCK, file: fs/btrfs/extent_io.c, line: 1716 ----------------------------------------------------------------------------- This fixes the bug by returning the proper -EAGAIN. Cc: David Sterba <dsterba@suse.com> Reported-by: Dave Jones <davej@codemonkey.org.uk> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-03-03Merge branch 'rebased-statx' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs 'statx()' update from Al Viro. This adds the new extended stat() interface that internally subsumes our previous stat interfaces, and allows user mode to specify in more detail what kind of information it wants. It also allows for some explicit synchronization information to be passed to the filesystem, which can be relevant for network filesystems: is the cached value ok, or do you need open/close consistency, or what? From David Howells. Andreas Dilger points out that the first version of the extended statx interface was posted June 29, 2010: https://www.spinics.net/lists/linux-fsdevel/msg33831.html * 'rebased-statx' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: statx: Add a system call to make enhanced file info available