summaryrefslogtreecommitdiff
path: root/fs/btrfs/scrub.c
AgeCommit message (Collapse)Author
2016-12-16Merge branch 'for-linus-4.10' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs updates from Chris Mason: "Jeff Mahoney and Dave Sterba have a really nice set of cleanups in here, and Christoph pitched in corrections/improvements to make btrfs use proper helpers for bio walking instead of doing it by hand. There are some key fixes as well, including some long standing bugs that took forever to track down in btrfs_drop_extents and during balance" * 'for-linus-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (77 commits) btrfs: limit async_work allocation and worker func duration Revert "Btrfs: adjust len of writes if following a preallocated extent" Btrfs: don't WARN() in btrfs_transaction_abort() for IO errors btrfs: opencode chunk locking, remove helpers btrfs: remove root parameter from transaction commit/end routines btrfs: split btrfs_wait_marked_extents into normal and tree log functions btrfs: take an fs_info directly when the root is not used otherwise btrfs: simplify btrfs_wait_cache_io prototype btrfs: convert extent-tree tracepoints to use fs_info btrfs: root->fs_info cleanup, access fs_info->delayed_root directly btrfs: root->fs_info cleanup, add fs_info convenience variables btrfs: root->fs_info cleanup, update_block_group{,flags} btrfs: root->fs_info cleanup, lock/unlock_chunks btrfs: root->fs_info cleanup, btrfs_calc_{trans,trunc}_metadata_size btrfs: pull node/sector/stripe sizes out of root and into fs_info btrfs: root->fs_info cleanup, io_ctl_init btrfs: root->fs_info cleanup, use fs_info->dev_root everywhere btrfs: struct reada_control.root -> reada_control.fs_info btrfs: struct btrfsic_state->root should be an fs_info btrfs: alloc_reserved_file_extent trace point should use extent_root ...
2016-12-06btrfs: remove root parameter from transaction commit/end routinesJeff Mahoney
Now we only use the root parameter to print the root objectid in a tracepoint. We can use the root parameter from the transaction handle for that. It's also used to join the transaction with async commits, so we remove the comment that it's just for checking. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-12-06btrfs: take an fs_info directly when the root is not used otherwiseJeff Mahoney
There are loads of functions in btrfs that accept a root parameter but only use it to obtain an fs_info pointer. Let's convert those to just accept an fs_info pointer directly. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-12-06btrfs: root->fs_info cleanup, add fs_info convenience variablesJeff Mahoney
In routines where someptr->fs_info is referenced multiple times, we introduce a convenience variable. This makes the code considerably more readable. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-12-06btrfs: pull node/sector/stripe sizes out of root and into fs_infoJeff Mahoney
We track the node sizes per-root, but they never vary from the values in the superblock. This patch messes with the 80-column style a bit, but subsequent patches to factor out root->fs_info into a convenience variable fix it up again. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-12-06btrfs: root->fs_info cleanup, use fs_info->dev_root everywhereJeff Mahoney
Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-29btrfs: don't abuse REQ_OP_* flags for btrfs_map_blockChristoph Hellwig
btrfs_map_block supports different types of mappings, which to a large extent resemble block layer operations. But they don't always do, and currently btrfs dangerously overlays it's own flag over the block layer flags. This is just asking for a conflict, so introduce a different map flags enum inside of btrfs instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-01block,fs: use REQ_* flags directlyChristoph Hellwig
Remove the WRITE_* and READ_SYNC wrappers, and just use the flags directly. Where applicable this also drops usage of the bio_set_op_attrs wrapper. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-09-26btrfs: unsplit printed stringsJeff Mahoney
CodingStyle chapter 2: "[...] never break user-visible strings such as printk messages, because that breaks the ability to grep for them." This patch unsplits user-visible strings. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-08-04Merge branch 'for-linus-4.8' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull more btrfs updates from Chris Mason: "This is part two of my btrfs pull, which is some cleanups and a batch of fixes. Most of the code here is from Jeff Mahoney, making the pointers we pass around internally more consistent and less confusing overall. I noticed a small problem right before I sent this out yesterday, so I fixed it up and re-tested overnight" * 'for-linus-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (40 commits) Btrfs: fix __MAX_CSUM_ITEMS btrfs: btrfs_abort_transaction, drop root parameter btrfs: add btrfs_trans_handle->fs_info pointer btrfs: btrfs_relocate_chunk pass extent_root to btrfs_end_transaction btrfs: convert nodesize macros to static inlines btrfs: introduce BTRFS_MAX_ITEM_SIZE btrfs: cleanup, remove prototype for btrfs_find_root_ref btrfs: copy_to_sk drop unused root parameter btrfs: simpilify btrfs_subvol_inherit_props btrfs: tests, use BTRFS_FS_STATE_DUMMY_FS_INFO instead of dummy root btrfs: tests, require fs_info for root btrfs: tests, move initialization into tests/ btrfs: btrfs_test_opt and friends should take a btrfs_fs_info btrfs: prefix fsid to all trace events btrfs: plumb fs_info into btrfs_work btrfs: remove obsolete part of comment in statfs btrfs: hide test-only member under ifdef btrfs: Ratelimit "no csum found" info message btrfs: Add ratelimit to btrfs printing Btrfs: fix unexpected balance crash due to BUG_ON ...
2016-07-26btrfs: plumb fs_info into btrfs_workJeff Mahoney
In order to provide an fsid for trace events, we'll need a btrfs_fs_info pointer. The most lightweight way to do that for btrfs_work structures is to associate it with the __btrfs_workqueue structure. Each queued btrfs_work structure has a workqueue associated with it, so that's a natural fit. It's a privately defined structures, so we add accessors to retrieve the fs_info pointer. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26Btrfs: subpage-blocksize: Rate limit scrub error messageChandan Rajendra
btrfs/073 invokes scrub ioctl in a tight loop. In subpage-blocksize scenario this results in a lot of "scrub: size assumption sectorsize != PAGE_SIZE " messages being printed on the console. To reduce the number of such messages this commit uses btrfs_err_rl() instead of btrfs_err(). Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-06-07btrfs: use bio op accessorsMike Christie
This should be the easier cases to convert btrfs to bio_set_op_attrs/bio_op. They are mostly just cut and replace type of changes. Signed-off-by: Mike Christie <mchristi@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07block/fs/drivers: remove rw argument from submit_bioMike Christie
This has callers of submit_bio/submit_bio_wait set the bio->bi_rw instead of passing it in. This makes that use the same as generic_make_request and how we set the other bio fields. Signed-off-by: Mike Christie <mchristi@redhat.com> Fixed up fs/ext4/crypto.c Signed-off-by: Jens Axboe <axboe@fb.com>
2016-05-30Btrfs: fix race setting block group back to RW mode during device replaceFilipe Manana
After it finishes processing a device extent, the device replace code sets back the block group to RW mode and then after that it sets the left cursor to match the logical end address of the block group, so that future writes into extents belonging to the block group go both the source (old) and target (new) devices. However from the moment we turn the block group back to RW mode we have a short time window, that lasts until we update the left cursor's value, where extents can be allocated from the block group and written to, in which case they will not be copied/written to the target (new) device. Fix this by updating the left cursor's value before turning the block group back to RW mode. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Josef Bacik <jbacik@fb.com>
2016-05-30Btrfs: fix unprotected assignment of the left cursor for device replaceFilipe Manana
We were assigning new values to fields of the device replace object without holding the respective lock after processing each device extent. This is important for the left cursor field which can be accessed by a concurrent task running __btrfs_map_block (which, correctly, takes the device replace lock). So change these fields while holding the device replace lock. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Josef Bacik <jbacik@fb.com>
2016-05-30Btrfs: fix race setting block group readonly during device replaceFilipe Manana
When we do a device replace, for each device extent we find from the source device, we set the corresponding block group to readonly mode to prevent writes into it from happening while we are copying the device extent from the source to the target device. However just before we set the block group to readonly mode some concurrent task might have already allocated an extent from it or decided it could perform a nocow write into one of its extents, which can make the device replace process to miss copying an extent since it uses the extent tree's commit root to search for extents and only once it finishes searching for all extents belonging to the block group it does set the left cursor to the logical end address of the block group - this is a problem if the respective ordered extents finish while we are searching for extents using the extent tree's commit root and no transaction commit happens while we are iterating the tree, since it's the delayed references created by the ordered extents (when they complete) that insert the extent items into the extent tree (using the non-commit root of course). Example: CPU 1 CPU 2 btrfs_dev_replace_start() btrfs_scrub_dev() scrub_enumerate_chunks() --> finds device extent belonging to block group X <transaction N starts> starts buffered write against some inode writepages is run against that inode forcing dellaloc to run btrfs_writepages() extent_writepages() extent_write_cache_pages() __extent_writepage() writepage_delalloc() run_delalloc_range() cow_file_range() btrfs_reserve_extent() --> allocates an extent from block group X (which is not yet in RO mode) btrfs_add_ordered_extent() --> creates ordered extent Y flush_epd_write_bio() --> bio against the extent from block group X is submitted btrfs_inc_block_group_ro(bg X) --> sets block group X to readonly scrub_chunk(bg X) scrub_stripe(device extent from srcdev) --> keeps searching for extent items belonging to the block group using the extent tree's commit root --> it never blocks due to fs_info->scrub_pause_req as no one tries to commit transaction N --> copies all extents found from the source device into the target device --> finishes search loop bio completes ordered extent Y completes and creates delayed data reference which will add an extent item to the extent tree when run (typically at transaction commit time) --> so the task doing the scrub/device replace at CPU 1 misses this and does not copy this extent into the new/target device btrfs_dec_block_group_ro(bg X) --> turns block group X back to RW mode dev_replace->cursor_left is set to the logical end offset of block group X So fix this by waiting for all cow and nocow writes after setting a block group to readonly mode. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Josef Bacik <jbacik@fb.com>
2016-05-25Merge branch 'cleanups-4.7' into for-chris-4.7-20160525David Sterba
2016-05-25btrfs: fix string and comment grammatical issues and typosNicholas D Steeves
Signed-off-by: Nicholas D Steeves <nsteeves@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-25btrfs: scrub: Set bbio to NULL before calling btrfs_map_blockZhao Lei
We usually call btrfs_put_bbio() when btrfs_map_block() failed, btrfs_put_bbio() works right whether bbio is a valid value, or NULL. But there is a exception, in some case, btrfs_map_block() will return fail without touching *bbio(keeping its original value), and if bbio was not initialized yet, invalid memory accessing will happened. Above case is in scrub_missing_raid56_pages(), and similar case in scrub_raid56_parity(). Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-16btrfs: fix memory leak during RAID 5/6 device replacementScott Talbert
A 'struct bio' is allocated in scrub_missing_raid56_pages(), but it was never freed anywhere. Signed-off-by: Scott Talbert <scott.talbert@hgst.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-06btrfs: Fix BUG_ON condition in scrub_setup_recheck_block()Ashish Samant
pagev array in scrub_block{} is of size SCRUB_MAX_PAGES_PER_BLOCK. page_index should be checked with the same to trigger BUG_ON(). Signed-off-by: Ashish Samant <ashish.samant@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-06Btrfs: fix divide error upon chunk's stripe_lenLiu Bo
The struct 'map_lookup' uses type int for @stripe_len, while btrfs_chunk_stripe_len() can return a u64 value, and it may end up with @stripe_len being undefined value and it can lead to 'divide error' in __btrfs_map_block(). This changes 'map_lookup' to use type u64 for stripe_len, also right now we only use BTRFS_STRIPE_LEN for stripe_len, so this adds a valid checker for BTRFS_STRIPE_LEN. Reported-by: Vegard Nossum <vegard.nossum@oracle.com> Reported-by: Quentin Casasnovas <quentin.casasnovas@oracle.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ folded division fix to scrub_raid56_parity ] Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-06btrfs: reuse existing variable in scrub_stripe, reduce stack usageDavid Sterba
The key variable occupies 17 bytes, the key_start is used once, we can simply reuse existing 'key' for that purpose. As the key is not a simple type, compiler doest not do it on itself. Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-29btrfs: sink gfp parameter to clear_extent_bitsDavid Sterba
Callers pass GFP_NOFS and GFP_KERNEL. No need to pass the flags around. Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-29btrfs: sink gfp parameter to set_extent_bitsDavid Sterba
All callers pass GFP_NOFS. Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-04mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macrosKirill A. Shutemov
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time ago with promise that one day it will be possible to implement page cache with bigger chunks than PAGE_SIZE. This promise never materialized. And unlikely will. We have many places where PAGE_CACHE_SIZE assumed to be equal to PAGE_SIZE. And it's constant source of confusion on whether PAGE_CACHE_* or PAGE_* constant should be used in a particular case, especially on the border between fs and mm. Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much breakage to be doable. Let's stop pretending that pages in page cache are special. They are not. The changes are pretty straight-forward: - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>; - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>; - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN}; - page_cache_get() -> get_page(); - page_cache_release() -> put_page(); This patch contains automated changes generated with coccinelle using script below. For some reason, coccinelle doesn't patch header files. I've called spatch for them manually. The only adjustment after coccinelle is revert of changes to PAGE_CAHCE_ALIGN definition: we are going to drop it later. There are few places in the code where coccinelle didn't reach. I'll fix them manually in a separate patch. Comments and documentation also will be addressed with the separate patch. virtual patch @@ expression E; @@ - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT) + E @@ expression E; @@ - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) + E @@ @@ - PAGE_CACHE_SHIFT + PAGE_SHIFT @@ @@ - PAGE_CACHE_SIZE + PAGE_SIZE @@ @@ - PAGE_CACHE_MASK + PAGE_MASK @@ expression E; @@ - PAGE_CACHE_ALIGN(E) + PAGE_ALIGN(E) @@ expression E; @@ - page_cache_get(E) + get_page(E) @@ expression E; @@ - page_cache_release(E) + put_page(E) Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-11btrfs: scrub: silence an uninitialized variable warningDan Carpenter
It's basically harmless if "ref_level" isn't initialized since it's only used for an error message, but it causes a static checker warning. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-26Merge branch 'foreign/liubo/replace-lockup' into for-chris-4.6David Sterba
2016-02-23Btrfs: fix lockdep deadlock warning due to dev_replaceLiu Bo
Xfstests btrfs/011 complains about a deadlock warning, [ 1226.649039] ========================================================= [ 1226.649039] [ INFO: possible irq lock inversion dependency detected ] [ 1226.649039] 4.1.0+ #270 Not tainted [ 1226.649039] --------------------------------------------------------- [ 1226.652955] kswapd0/46 just changed the state of lock: [ 1226.652955] (&delayed_node->mutex){+.+.-.}, at: [<ffffffff81458735>] __btrfs_release_delayed_node+0x45/0x1d0 [ 1226.652955] but this lock took another, RECLAIM_FS-unsafe lock in the past: [ 1226.652955] (&fs_info->dev_replace.lock){+.+.+.} and interrupts could create inverse lock ordering between them. [ 1226.652955] other info that might help us debug this: [ 1226.652955] Chain exists of: &delayed_node->mutex --> &found->groups_sem --> &fs_info->dev_replace.lock [ 1226.652955] Possible interrupt unsafe locking scenario: [ 1226.652955] CPU0 CPU1 [ 1226.652955] ---- ---- [ 1226.652955] lock(&fs_info->dev_replace.lock); [ 1226.652955] local_irq_disable(); [ 1226.652955] lock(&delayed_node->mutex); [ 1226.652955] lock(&found->groups_sem); [ 1226.652955] <Interrupt> [ 1226.652955] lock(&delayed_node->mutex); [ 1226.652955] *** DEADLOCK *** Commit 084b6e7c7607 ("btrfs: Fix a lockdep warning when running xfstest.") tried to fix a similar one that has the exactly same warning, but with that, we still run to this. The above lock chain comes from btrfs_commit_transaction ->btrfs_run_delayed_items ... ->__btrfs_update_delayed_inode ... ->__btrfs_cow_block ... ->find_free_extent ->cache_block_group ->load_free_space_cache ->btrfs_readpages ->submit_one_bio ... ->__btrfs_map_block ->btrfs_dev_replace_lock However, with high memory pressure, tasks which hold dev_replace.lock can be interrupted by kswapd and then kswapd is intended to release memory occupied by superblock, inodes and dentries, where we may call evict_inode, and it comes to [ 1226.652955] [<ffffffff81458735>] __btrfs_release_delayed_node+0x45/0x1d0 [ 1226.652955] [<ffffffff81459e74>] btrfs_remove_delayed_node+0x24/0x30 [ 1226.652955] [<ffffffff8140c5fe>] btrfs_evict_inode+0x34e/0x700 delayed_node->mutex may be acquired in __btrfs_release_delayed_node(), and it leads to a ABBA deadlock. To fix this, we can use "blocking rwlock" used in the case of extent_buffer, but things are simpler here since we only needs read's spinlock to blocking lock. With this, btrfs/011 no more produces warnings in dmesg. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-11btrfs: scrub: use GFP_KERNEL on the submission pathDavid Sterba
Scrub is not on the critical writeback path we don't need to use GFP_NOFS for all allocations. The failures are handled and stats passed back to userspace. Let's use GFP_KERNEL on the paths where everything is ok, ie. setup the global structures and the IO submission paths. Functions that do the repair and fixups still use GFP_NOFS as we might want to skip any other filesystem activity if we encounter an error. This could turn out to be unnecessary, but requires more review compared to the easy cases in this patch. Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-22wrappers for ->i_mutex accessAl Viro
parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested}, inode_foo(inode) being mutex_foo(&inode->i_mutex). Please, use those for access to ->i_mutex; over the coming cycle ->i_mutex will become rwsem, with ->lookup() done with it held only shared. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-01-20btrfs: Fix calculation of rbio->dbitmap's size calculationZhao Lei
Current code is trying to calculate rbio->dbitmap's size to make it align to sizeof(long), but implement haven't achived this object, it is align to sizeof(char) instead. This patch fixed above calculation, and use sizeof(long) instead of fixed "8" to increate compatibility. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2016-01-19Merge branch 'misc-cleanups-4.5' of ↵Chris Mason
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.5
2016-01-15btrfs: cleanup, stop casting for extent_map->lookup everywhereJeff Mahoney
Overloading extent_map->bdev to struct map_lookup * might have started out as a means to an end, but it's a pattern that's used all over the place now. Let's get rid of the casting and just add a union instead. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-11Merge branch 'misc-cleanups-4.5' of ↵Chris Mason
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.5 Signed-off-by: Chris Mason <clm@fb.com>
2016-01-11Merge branch 'misc-for-4.5' of ↵Chris Mason
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.5
2016-01-07btrfs: cleanup, use enum values for btrfs_path readaDavid Sterba
Replace the integers by enums for better readability. The value 2 does not have any meaning since a717531942f488209dded30f6bc648167bcefa72 "Btrfs: do less aggressive btree readahead" (2009-01-22). Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07btrfs: cleanup, remove stray return statementsDavid Sterba
Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07btrfs: drop duplicate prefix from scrub workqueuesDavid Sterba
The helper btrfs_alloc_workqueue will add the "btrfs-" prefix. Signed-off-by: David Sterba <dsterba@suse.com>
2015-12-23Merge branch 'dev/simplify-set-bit' of ↵Chris Mason
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.5 Signed-off-by: Chris Mason <clm@fb.com>
2015-12-03btrfs: drop unused parameter from lock_extent_bitsDavid Sterba
We've always passed 0. Stack usage will slightly decrease. Signed-off-by: David Sterba <dsterba@suse.com>
2015-11-25Btrfs: fix scrub preventing unused block groups from being deletedFilipe Manana
Currently scrub can race with the cleaner kthread when the later attempts to delete an unused block group, and the result is preventing the cleaner kthread from ever deleting later the block group - unless the block group becomes used and unused again. The following diagram illustrates that race: CPU 1 CPU 2 cleaner kthread btrfs_delete_unused_bgs() gets block group X from fs_info->unused_bgs and removes it from that list scrub_enumerate_chunks() searches device tree using its commit root finds device extent for block group X gets block group X from the tree fs_info->block_group_cache_tree (via btrfs_lookup_block_group()) sets bg X to RO sees the block group is already RO and therefore doesn't delete it nor adds it back to unused list So fix this by making scrub add the block group again to the list of unused block groups if the block group is still unused when it finished scrubbing it and it hasn't been removed already. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-11-25Btrfs: fix race between scrub and block group deletionFilipe Manana
Scrub can race with the cleaner kthread deleting block groups that are unused (and with relocation too) leading to a failure with error -EINVAL that gets returned to user space. The following diagram illustrates how it happens: CPU 1 CPU 2 cleaner kthread btrfs_delete_unused_bgs() gets block group X from fs_info->unused_bgs sets block group to RO btrfs_remove_chunk(bg X) deletes device extents scrub_enumerate_chunks() searches device tree using its commit root finds device extent for block group X gets block group X from the tree fs_info->block_group_cache_tree (via btrfs_lookup_block_group()) sets bg X to RO (again) btrfs_remove_block_group(bg X) deletes block group from fs_info->block_group_cache_tree removes extent map from fs_info->mapping_tree scrub_chunk(offset X) searches fs_info->mapping_tree for extent map starting at offset X --> doesn't find any such extent map --> returns -EINVAL and scrub errors out to userspace with -EINVAL Fix this by dealing with an extent map lookup failure as an indicator of block group deletion. Issue reproduced with fstest btrfs/071. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-11-25btrfs: Continue replace when set_block_ro failedZhaolei
xfstests/011 failed in node with small_size filesystem. Can be reproduced by following script: DEV_LIST="/dev/vdd /dev/vde" DEV_REPLACE="/dev/vdf" do_test() { local mkfs_opt="$1" local size="$2" dmesg -c >/dev/null umount $SCRATCH_MNT &>/dev/null echo mkfs.btrfs -f $mkfs_opt "${DEV_LIST[*]}" mkfs.btrfs -f $mkfs_opt "${DEV_LIST[@]}" || return 1 mount "${DEV_LIST[0]}" $SCRATCH_MNT echo -n "Writing big files" dd if=/dev/urandom of=$SCRATCH_MNT/t0 bs=1M count=1 >/dev/null 2>&1 for ((i = 1; i <= size; i++)); do echo -n . /bin/cp $SCRATCH_MNT/t0 $SCRATCH_MNT/t$i || return 1 done echo echo Start replace btrfs replace start -Bf "${DEV_LIST[0]}" "$DEV_REPLACE" $SCRATCH_MNT || { dmesg return 1 } return 0 } # Set size to value near fs size # for example, 1897 can trigger this bug in 2.6G device. # ./do_test "-d raid1 -m raid1" 1897 System will report replace fail with following warning in dmesg: [ 134.710853] BTRFS: dev_replace from /dev/vdd (devid 1) to /dev/vdf started [ 135.542390] BTRFS: btrfs_scrub_dev(/dev/vdd, 1, /dev/vdf) failed -28 [ 135.543505] ------------[ cut here ]------------ [ 135.544127] WARNING: CPU: 0 PID: 4080 at fs/btrfs/dev-replace.c:428 btrfs_dev_replace_start+0x398/0x440() [ 135.545276] Modules linked in: [ 135.545681] CPU: 0 PID: 4080 Comm: btrfs Not tainted 4.3.0 #256 [ 135.546439] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014 [ 135.547798] ffffffff81c5bfcf ffff88003cbb3d28 ffffffff817fe7b5 0000000000000000 [ 135.548774] ffff88003cbb3d60 ffffffff810a88f1 ffff88002b030000 00000000ffffffe4 [ 135.549774] ffff88003c080000 ffff88003c082588 ffff88003c28ab60 ffff88003cbb3d70 [ 135.550758] Call Trace: [ 135.551086] [<ffffffff817fe7b5>] dump_stack+0x44/0x55 [ 135.551737] [<ffffffff810a88f1>] warn_slowpath_common+0x81/0xc0 [ 135.552487] [<ffffffff810a89e5>] warn_slowpath_null+0x15/0x20 [ 135.553211] [<ffffffff81448c88>] btrfs_dev_replace_start+0x398/0x440 [ 135.554051] [<ffffffff81412c3e>] btrfs_ioctl+0x1d2e/0x25c0 [ 135.554722] [<ffffffff8114c7ba>] ? __audit_syscall_entry+0xaa/0xf0 [ 135.555506] [<ffffffff8111ab36>] ? current_kernel_time64+0x56/0xa0 [ 135.556304] [<ffffffff81201e3d>] do_vfs_ioctl+0x30d/0x580 [ 135.557009] [<ffffffff8114c7ba>] ? __audit_syscall_entry+0xaa/0xf0 [ 135.557855] [<ffffffff810011d1>] ? do_audit_syscall_entry+0x61/0x70 [ 135.558669] [<ffffffff8120d1c1>] ? __fget_light+0x61/0x90 [ 135.559374] [<ffffffff81202124>] SyS_ioctl+0x74/0x80 [ 135.559987] [<ffffffff81809857>] entry_SYSCALL_64_fastpath+0x12/0x6f [ 135.560842] ---[ end trace 2a5c1fc3205abbdd ]--- Reason: When big data writen to fs, the whole free space will be allocated for data chunk. And operation as scrub need to set_block_ro(), and when there is only one metadata chunk in system(or other metadata chunks are all full), the function will try to allocate a new chunk, and failed because no space in device. Fix: When set_block_ro failed for metadata chunk, it is not a problem because scrub_lock paused commit_trancaction in same time, and metadata are always cowed, so the on-the-fly writepages will not write data into same place with scrub/replace. Let replace continue in this case is no problem. Tested by above script, and xfstests/011, plus 100 times xfstests/070. Changelog v1->v2: 1: Add detail comments in source and commit-message. 2: Add dmesg detail into commit-message. 3: Limit return value of -ENOSPC to be passed. All suggested by: Filipe Manana <fdmanana@gmail.com> Suggested-by: Filipe Manana <fdmanana@gmail.com> Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-11-10btrfs: Remove len argument from scrub_find_csumZhao Lei
It is useless. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-11-10btrfs: Reduce unnecessary arguments in scrub_recheck_blockZhao Lei
We don't need pass so many arguments for recheck sblock now, this patch cleans them. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-11-10btrfs: Use scrub_checksum_data and scrub_checksum_tree_block for ↵Zhao Lei
scrub_recheck_block_checksum We can use existing scrub_checksum_data() and scrub_checksum_tree_block() for scrub_recheck_block_checksum(), instead of write duplicated code. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-11-10btrfs: Reset sblock->xxx_error stats before calling scrub_recheck_block_checksumZhao Lei
We should reset sblock->xxx_error stats before calling scrub_recheck_block_checksum(). Current code run correctly because all sblock are allocated by k[cz]alloc(), and the error stats are not got changed. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-11-10btrfs: scrub: setup all fields for sblock_to_checkZhao Lei
scrub_setup_recheck_block() isn't setup all necessary fields for sblock_to_check because history reason. So current code need more arguments in severial functions, and more local variables, just to passing these lacked values to necessary place. This patch setup above fields to sblock_to_check in scrub_setup_recheck_block(), for: 1: more cleanup for function arg, local variable 2: to make sblock_to_check complete, then we can use sblock_to_check without concern about some uninitialized member. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>