summaryrefslogtreecommitdiff
path: root/fs/btrfs
AgeCommit message (Collapse)Author
2020-05-25btrfs: only check priority tickets for priority flushingJosef Bacik
In debugging a generic/320 failure on ppc64, Nikolay noticed that sometimes we'd ENOSPC out with plenty of space to reclaim if we had committed the transaction. He further discovered that this was because there was a priority ticket that was small enough to fit in the free space currently in the space_info. Consider the following scenario. There is no more space to reclaim in the fs without committing the transaction. Assume there's 1MiB of space free in the space info, but there are pending normal tickets with 2MiB reservations. Now a priority ticket comes in with a .5MiB reservation. Because we have normal tickets pending we add ourselves to the priority list, despite the fact that we could satisfy this reservation. The flushing machinery now gets to the point where it wants to commit the transaction, but because there's a .5MiB ticket on the priority list and we have 1MiB of free space we assume the ticket will be granted soon, so we bail without committing the transaction. Meanwhile the priority flushing does not commit the transaction, and eventually fails with an ENOSPC. Then all other tickets are failed with ENOSPC because we were never able to actually commit the transaction. The fix for this is we should have simply granted the priority flusher his reservation, because there was space to make the reservation. Priority flushers by definition take priority, so they are allowed to make their reservations before any previous normal tickets. By not adding this priority ticket to the list the normal flushing mechanisms will then commit the transaction and everything will continue normally. We still need to serialize ourselves with other priority tickets, so if there are any tickets on the priority list then we need to add ourselves to that list in order to maintain the serialization between priority tickets. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25btrfs: account for trans_block_rsv in may_commit_transactionJosef Bacik
On ppc64le with 64k page size (respectively 64k block size) generic/320 was failing and debug output showed we were getting a premature ENOSPC with a bunch of space in btrfs_fs_info::trans_block_rsv. This meant there were still open transaction handles holding space, yet the flusher didn't commit the transaction because it deemed the freed space won't be enough to satisfy the current reserve ticket. Fix this by accounting for space in trans_block_rsv when deciding whether the current transaction should be committed or not. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25btrfs: allow to use up to 90% of the global block rsv for unlinkJosef Bacik
We previously had a limit of stealing 50% of the global reserve for unlink. This was from a time when the global reserve was used for the delayed refs as well. However now those reservations are kept separate, so the global reserve can be depleted much more to allow us to make progress for space restoring operations like unlink. Change the minimum amount of space required to be left in the global reserve to 10%. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25btrfs: improve global reserve stealing logicJosef Bacik
For unlink transactions and block group removal btrfs_start_transaction_fallback_global_rsv will first try to start an ordinary transaction and if it fails it will fall back to reserving the required amount by stealing from the global reserve. This is problematic because of all the same reasons we had with previous iterations of the ENOSPC handling, thundering herd. We get a bunch of failures all at once, everybody tries to allocate from the global reserve, some win and some lose, we get an ENSOPC. Fix this behavior by introducing BTRFS_RESERVE_FLUSH_ALL_STEAL. It's used to mark unlink reservation. To fix this we need to integrate this logic into the normal ENOSPC infrastructure. We still go through all of the normal flushing work, and at the moment we begin to fail all the tickets we try to satisfy any tickets that are allowed to steal by stealing from the global reserve. If this works we start the flushing system over again just like we would with a normal ticket satisfaction. This serializes our global reserve stealing, so we don't have the thundering herd problem. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25btrfs: backref: distinguish reloc and non-reloc use of indirect resolutionQu Wenruo
For relocation tree detection, relocation backref cache uses btrfs_should_ignore_reloc_root() which uses relocation-specific checks like checking the DEAD_RELOC_ROOT bit. However for general purpose backref cache, we can rely on that check, as it's possible that relocation is also running. For generic purposed backref cache, we detect reloc root by SHARED_BLOCK_REF item. Only reloc root node has its parent bytenr pointing back to itself. And in that case, backref cache will mark the reloc root node useless, dropping any child orphan nodes. So only call btrfs_should_ignore_reloc_root() if the backref cache is for relocation. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25btrfs: reloc: move error handling of build_backref_tree() to backref.cQu Wenruo
The error cleanup will be extracted as a new function, btrfs_backref_error_cleanup(), and moved to backref.c and exported for later usage. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25btrfs: backref: rename and move finish_upper_links()Qu Wenruo
This the the 2nd major part of generic backref cache. Move it to backref.c so we can reuse it. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25btrfs: backref: rename and move handle_one_tree_block()Qu Wenruo
This function is the major part of backref cache build process, move it to backref.c so we can reuse it later. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25btrfs: reloc: open code read_fs_root() for handle_indirect_tree_backref()Qu Wenruo
The backref code is going to be moved to backref.c, and read_fs_root() is just a simple wrapper, open-code it to prepare to the incoming code move. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25btrfs: backref: rename and move should_ignore_root()Qu Wenruo
This function is mostly single purpose to relocation backref cache, but since we're moving the main part of backref cache to backref.c, we need to export such function. And to avoid confusion, rename the function to btrfs_should_ignore_reloc_root() make the name a little more clear. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25btrfs: backref: rename and move backref_tree_panic()Qu Wenruo
Also change the parameter, since all callers can easily grab an fs_info, there is no need for all the pointer chasing. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25btrfs: backref: rename and move backref_cache_cleanup()Qu Wenruo
Since we're releasing all existing nodes/edges, other than cleanup the mess after error, "release" is a more proper naming here. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25btrfs: backref: rename and move remove_backref_node()Qu Wenruo
Also add comment explaining the cleanup progress, to differ it from btrfs_backref_drop_node(). Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25btrfs: backref: rename and move drop_backref_node()Qu Wenruo
With extra comment for drop_backref_node() as it has some similarity with remove_backref_node(), thus we need extra comment explaining the difference. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25btrfs: backref: rename and move free_backref_(node|edge)Qu Wenruo
Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25btrfs: backref: rename and move link_backref_edge()Qu Wenruo
Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25btrfs: backref: rename and move alloc_backref_edge()Qu Wenruo
Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25btrfs: backref: rename and move alloc_backref_node()Qu Wenruo
Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25btrfs: backref: rename and move backref_cache_init()Qu Wenruo
Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25btrfs: rename tree_entry to rb_simple_node and export itQu Wenruo
Structure tree_entry provides a very simple rb_tree which only uses bytenr as search index. That tree_entry is used in 3 structures: backref_node, mapping_node and tree_block. Since we're going to make backref_node independnt from relocation, it's a good time to extract the tree_entry into rb_simple_node, and export it into misc.h. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25btrfs: backref: move btrfs_backref_(node|edge|cache) structures to backref.hQu Wenruo
These 3 structures are the main part of btrfs backref cache, move them to backref.h to build the basis for later reuse. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25btrfs: reloc: add btrfs_ prefix for backref_node/edge/cacheQu Wenruo
Those three structures are the main elements of backref cache. Add the "btrfs_" prefix for later export. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25btrfs: reloc: refactor useless nodes handling into its own functionQu Wenruo
This patch will also add some comment for the cleanup. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25btrfs: reloc: refactor finishing part of upper linkage into finish_upper_links()Qu Wenruo
After handle_one_tree_backref(), all newly added (not cached) edges and nodes have the following features: - Only backref_edge::list[LOWER] is linked. This means, we can only iterate from botton to top, not the other direction. - Newly added nodes are not added to cache rb_tree yet So to finish the backref cache, we still need to finish the links and add all nodes into backref cache rb_tree. This patch will refactor the existing code into finish_upper_links(), add more comments of each branch, and why we need to do all the work. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25btrfs: reloc: remove the open-coded goto loop for breadth-first searchQu Wenruo
build_backref_tree() uses "goto again;" to implement a breadth-first search to build backref cache. This patch will extract most of its work into a wrapper, handle_one_tree_block(), and use a do {} while() loop to implement the same thing. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25btrfs: reloc: pass essential members for alloc_backref_node()Qu Wenruo
Bytenr and level are essential parameters for backref_node, thus it makes sense to initialize them at allocation time. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25btrfs: reloc: use wrapper to replace open-coded edge linkingQu Wenruo
Since backref_edge is used to connect upper and lower backref nodes, and needs to access both nodes, some code can look pretty nasty: list_add_tail(&edge->list[LOWER], &cur->upper); The above code will link @cur to the LOWER side of the edge, while both "LOWER" and "upper" words show up. This can sometimes be very confusing for reader to grasp. This patch introduces a new wrapper, link_backref_edge(), to handle the linking behavior. Which also has extra ASSERT() to ensure caller won't pass wrong nodes. Also, this updates the comment of related lists of backref_node and backref_edge, to make it more clear that each list points to what. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25btrfs: reloc: refactor indirect tree backref processing into its own functionQu Wenruo
The processing of indirect tree backref (TREE_BLOCK_REF) is the most complex work. We need to grab the fs root, do a tree search to locate all its parent nodes, link all needed edges, and put all uncached edges to pending edge list. This is definitely worth a helper function. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25btrfs: reloc: refactor direct tree backref processing into its own functionQu Wenruo
For BTRFS_SHARED_BLOCK_REF_KEY, its processing is straightforward, as we now the parent node bytenr directly. If the parent is already cached, or a root, call it a day. If the parent is not cached, add it pending list. This patch will just refactor this part into its own function, handle_direct_tree_backref() and add some comment explaining the @ref_key parameter. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25btrfs: reloc: make reloc root search-specific for relocation backref cacheQu Wenruo
find_reloc_root() searches reloc_control::reloc_root_tree to find the reloc root. This behavior is only useful for relocation backref cache. For the incoming more generic purpose backref cache, we don't care about who owns the reloc root, but only care if it's a reloc root. So this patch makes the following modifications to make the reloc root search more specific to relocation backref: - Add backref_node::is_reloc_root This will be an extra indicator for generic purposed backref cache. User doesn't need to read root key from backref_node::root to determine if it's a reloc root. Also for reloc tree root, it's useless and will be queued to useless list. - Add backref_cache::is_reloc This will allow backref cache code to do different behavior for generic purpose backref cache and relocation backref cache. - Pass fs_info to find_reloc_root() - Export find_reloc_root() So backref.c can utilize this function. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25btrfs: reloc: add backref_cache::fs_info memberQu Wenruo
Add this member so that we can grab fs_info without the help from reloc_control. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25btrfs: reloc: add backref_cache::pending_edge and backref_cache::useless_nodeQu Wenruo
These two new members will act the same as the existing local lists, @useless and @list in build_backref_tree(). Currently build_backref_tree() is only executed serially, thus moving such local list into backref_cache is still safe. Also since we're here, use list_first_entry() to replace a lot of list_entry() calls after !list_empty(). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25btrfs: reloc: rename mark_block_processed and __mark_block_processedQu Wenruo
These two functions are weirdly named, mark_block_processed() in fact just marks a range dirty unconditionally, while __mark_block_processed() does extra check before doing the marking. This patch will open code old mark_block_processed, and rename __mark_block_processed() to remove the "__" prefix. Since we're here, also kill the forward declaration, which could also kill in_block_group() with in_range() macro. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25btrfs: reloc: use btrfs_backref_iter infrastructureQu Wenruo
In the core function of relocation, build_backref_tree, it needs to iterate all backref items of one tree block. Use btrfs_backref_iter infrastructure to do the loop and make the code more readable. The backref items look would be much more easier to read: ret = btrfs_backref_iter_start(iter, cur->bytenr); for (; ret == 0; ret = btrfs_backref_iter_next(iter)) { /* The really important work */ } Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25btrfs: backref: implement btrfs_backref_iter_next()Qu Wenruo
This function will go to the next inline/keyed backref for btrfs_backref_iter infrastructure. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25btrfs: backref: introduce the skeleton of btrfs_backref_iterQu Wenruo
Due to the complex nature of btrfs extent tree, when we want to iterate all backrefs of one extent, this involves quite a lot of work, like searching the EXTENT_ITEM/METADATA_ITEM, iteration through inline and keyed backrefs. Normally this would result in a complex code, something like: btrfs_search_slot() /* Ensure we are at EXTENT_ITEM/METADATA_ITEM */ while (1) { /* Loop for extent tree items */ while (ptr < end) { /* Loop for inlined items */ /* Real work here */ } next: ret = btrfs_next_item() /* Ensure we're still at keyed item for specified bytenr */ } The idea of btrfs_backref_iter is to avoid such complex and hard to read code structure, but something like the following: iter = btrfs_backref_iter_alloc(); ret = btrfs_backref_iter_start(iter, bytenr); if (ret < 0) goto out; for (; ; ret = btrfs_backref_iter_next(iter)) { /* Real work here */ } out: btrfs_backref_iter_free(iter); This patch is just the skeleton + btrfs_backref_iter_start() code. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25btrfs: add missing annotation for btrfs_tree_lock()Jules Irenge
Sparse reports a warning at btrfs_tree_lock() warning: context imbalance in btrfs_tree_lock() - wrong count at exit The root cause is the missing annotation at btrfs_tree_lock() Add the missing __acquires(&eb->lock) annotation Signed-off-by: Jules Irenge <jbi.octave@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25btrfs: add missing annotation for btrfs_lock_cluster()Jules Irenge
Sparse reports a warning at btrfs_lock_cluster() warning: context imbalance in btrfs_lock_cluster() - wrong count The root cause is the missing annotation at btrfs_lock_cluster() Add the missing __acquires(&cluster->refill_lock) annotation. Signed-off-by: Jules Irenge <jbi.octave@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-09btrfs_ioctl_send(): don't bother with access_ok()Al Viro
we do copy_from_user() on that range anyway Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-04-30btrfs: fix gcc-4.8 build warning for struct initializerArnd Bergmann
Some older compilers like gcc-4.8 warn about mismatched curly braces in a initializer: fs/btrfs/backref.c: In function 'is_shared_data_backref': fs/btrfs/backref.c:394:9: error: missing braces around initializer [-Werror=missing-braces] struct prelim_ref target = {0}; ^ fs/btrfs/backref.c:394:9: error: (near initialization for 'target.rbnode') [-Werror=missing-braces] Use the GNU empty initializer extension to avoid this. Fixes: ed58f2e66e84 ("btrfs: backref, don't add refs from shared block when resolving normal backref") Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-04-27btrfs: transaction: Avoid deadlock due to bad initialization timing of ↵Qu Wenruo
fs_info::journal_info [BUG] One run of btrfs/063 triggered the following lockdep warning: ============================================ WARNING: possible recursive locking detected 5.6.0-rc7-custom+ #48 Not tainted -------------------------------------------- kworker/u24:0/7 is trying to acquire lock: ffff88817d3a46e0 (sb_internal#2){.+.+}, at: start_transaction+0x66c/0x890 [btrfs] but task is already holding lock: ffff88817d3a46e0 (sb_internal#2){.+.+}, at: start_transaction+0x66c/0x890 [btrfs] other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(sb_internal#2); lock(sb_internal#2); *** DEADLOCK *** May be due to missing lock nesting notation 4 locks held by kworker/u24:0/7: #0: ffff88817b495948 ((wq_completion)btrfs-endio-write){+.+.}, at: process_one_work+0x557/0xb80 #1: ffff888189ea7db8 ((work_completion)(&work->normal_work)){+.+.}, at: process_one_work+0x557/0xb80 #2: ffff88817d3a46e0 (sb_internal#2){.+.+}, at: start_transaction+0x66c/0x890 [btrfs] #3: ffff888174ca4da8 (&fs_info->reloc_mutex){+.+.}, at: btrfs_record_root_in_trans+0x83/0xd0 [btrfs] stack backtrace: CPU: 0 PID: 7 Comm: kworker/u24:0 Not tainted 5.6.0-rc7-custom+ #48 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 Workqueue: btrfs-endio-write btrfs_work_helper [btrfs] Call Trace: dump_stack+0xc2/0x11a __lock_acquire.cold+0xce/0x214 lock_acquire+0xe6/0x210 __sb_start_write+0x14e/0x290 start_transaction+0x66c/0x890 [btrfs] btrfs_join_transaction+0x1d/0x20 [btrfs] find_free_extent+0x1504/0x1a50 [btrfs] btrfs_reserve_extent+0xd5/0x1f0 [btrfs] btrfs_alloc_tree_block+0x1ac/0x570 [btrfs] btrfs_copy_root+0x213/0x580 [btrfs] create_reloc_root+0x3bd/0x470 [btrfs] btrfs_init_reloc_root+0x2d2/0x310 [btrfs] record_root_in_trans+0x191/0x1d0 [btrfs] btrfs_record_root_in_trans+0x90/0xd0 [btrfs] start_transaction+0x16e/0x890 [btrfs] btrfs_join_transaction+0x1d/0x20 [btrfs] btrfs_finish_ordered_io+0x55d/0xcd0 [btrfs] finish_ordered_fn+0x15/0x20 [btrfs] btrfs_work_helper+0x116/0x9a0 [btrfs] process_one_work+0x632/0xb80 worker_thread+0x80/0x690 kthread+0x1a3/0x1f0 ret_from_fork+0x27/0x50 It's pretty hard to reproduce, only one hit so far. [CAUSE] This is because we're calling btrfs_join_transaction() without re-using the current running one: btrfs_finish_ordered_io() |- btrfs_join_transaction() <<< Call #1 |- btrfs_record_root_in_trans() |- btrfs_reserve_extent() |- btrfs_join_transaction() <<< Call #2 Normally such btrfs_join_transaction() call should re-use the existing one, without trying to re-start a transaction. But the problem is, in btrfs_join_transaction() call #1, we call btrfs_record_root_in_trans() before initializing current::journal_info. And in btrfs_join_transaction() call #2, we're relying on current::journal_info to avoid such deadlock. [FIX] Call btrfs_record_root_in_trans() after we have initialized current::journal_info. CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-04-27btrfs: fix partial loss of prealloc extent past i_size after fsyncFilipe Manana
When we have an inode with a prealloc extent that starts at an offset lower than the i_size and there is another prealloc extent that starts at an offset beyond i_size, we can end up losing part of the first prealloc extent (the part that starts at i_size) and have an implicit hole if we fsync the file and then have a power failure. Consider the following example with comments explaining how and why it happens. $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt # Create our test file with 2 consecutive prealloc extents, each with a # size of 128Kb, and covering the range from 0 to 256Kb, with a file # size of 0. $ xfs_io -f -c "falloc -k 0 128K" /mnt/foo $ xfs_io -c "falloc -k 128K 128K" /mnt/foo # Fsync the file to record both extents in the log tree. $ xfs_io -c "fsync" /mnt/foo # Now do a redudant extent allocation for the range from 0 to 64Kb. # This will merely increase the file size from 0 to 64Kb. Instead we # could also do a truncate to set the file size to 64Kb. $ xfs_io -c "falloc 0 64K" /mnt/foo # Fsync the file, so we update the inode item in the log tree with the # new file size (64Kb). This also ends up setting the number of bytes # for the first prealloc extent to 64Kb. This is done by the truncation # at btrfs_log_prealloc_extents(). # This means that if a power failure happens after this, a write into # the file range 64Kb to 128Kb will not use the prealloc extent and # will result in allocation of a new extent. $ xfs_io -c "fsync" /mnt/foo # Now set the file size to 256K with a truncate and then fsync the file. # Since no changes happened to the extents, the fsync only updates the # i_size in the inode item at the log tree. This results in an implicit # hole for the file range from 64Kb to 128Kb, something which fsck will # complain when not using the NO_HOLES feature if we replay the log # after a power failure. $ xfs_io -c "truncate 256K" -c "fsync" /mnt/foo So instead of always truncating the log to the inode's current i_size at btrfs_log_prealloc_extents(), check first if there's a prealloc extent that starts at an offset lower than the i_size and with a length that crosses the i_size - if there is one, just make sure we truncate to a size that corresponds to the end offset of that prealloc extent, so that we don't lose the part of that extent that starts at i_size if a power failure happens. A test case for fstests follows soon. Fixes: 31d11b83b96f ("Btrfs: fix duplicate extents after fsync of file with prealloc extents") CC: stable@vger.kernel.org # 4.14+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-04-23btrfs: fix transaction leak in btrfs_recover_relocationXiyu Yang
btrfs_recover_relocation() invokes btrfs_join_transaction(), which joins a btrfs_trans_handle object into transactions and returns a reference of it with increased refcount to "trans". When btrfs_recover_relocation() returns, "trans" becomes invalid, so the refcount should be decreased to keep refcount balanced. The reference counting issue happens in one exception handling path of btrfs_recover_relocation(). When read_fs_root() failed, the refcnt increased by btrfs_join_transaction() is not decreased, causing a refcnt leak. Fix this issue by calling btrfs_end_transaction() on this error path when read_fs_root() failed. Fixes: 79787eaab461 ("btrfs: replace many BUG_ONs with proper error handling") CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Xiyu Yang <xiyuyang19@fudan.edu.cn> Signed-off-by: Xin Tan <tanxin.ctf@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-04-23btrfs: fix block group leak when removing failsXiyu Yang
btrfs_remove_block_group() invokes btrfs_lookup_block_group(), which returns a local reference of the block group that contains the given bytenr to "block_group" with increased refcount. When btrfs_remove_block_group() returns, "block_group" becomes invalid, so the refcount should be decreased to keep refcount balanced. The reference counting issue happens in several exception handling paths of btrfs_remove_block_group(). When those error scenarios occur such as btrfs_alloc_path() returns NULL, the function forgets to decrease its refcnt increased by btrfs_lookup_block_group() and will cause a refcnt leak. Fix this issue by jumping to "out_put_group" label and calling btrfs_put_block_group() when those error scenarios occur. CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Xiyu Yang <xiyuyang19@fudan.edu.cn> Signed-off-by: Xin Tan <tanxin.ctf@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-04-23btrfs: drop logs when we've aborted a transactionJosef Bacik
Dave reported a problem where we were panicing with generic/475 with misc-5.7. This is because we were doing IO after we had stopped all of the worker threads, because we do the log tree cleanup on roots at drop time. Cleaning up the log tree will always need to do reads if we happened to have evicted the blocks from memory. Because of this simply add a helper to btrfs_cleanup_transaction() that will go through and drop all of the log roots. This gets run before we do the close_ctree() work, and thus we are allowed to do any reads that we would need. I ran this through many iterations of generic/475 with constrained memory and I did not see the issue. general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6b6b: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI CPU: 2 PID: 12359 Comm: umount Tainted: G W 5.6.0-rc7-btrfs-next-58 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 RIP: 0010:btrfs_queue_work+0x33/0x1c0 [btrfs] RSP: 0018:ffff9cfb015937d8 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff8eb5e339ed80 RCX: 0000000000000000 RDX: 0000000000000001 RSI: ffff8eb5eb33b770 RDI: ffff8eb5e37a0460 RBP: ffff8eb5eb33b770 R08: 000000000000020c R09: ffffffff9fc09ac0 R10: 0000000000000007 R11: 0000000000000000 R12: 6b6b6b6b6b6b6b6b R13: ffff9cfb00229040 R14: 0000000000000008 R15: ffff8eb5d3868000 FS: 00007f167ea022c0(0000) GS:ffff8eb5fae00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f167e5e0cb1 CR3: 0000000138c18004 CR4: 00000000003606e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: btrfs_end_bio+0x81/0x130 [btrfs] __split_and_process_bio+0xaf/0x4e0 [dm_mod] ? percpu_counter_add_batch+0xa3/0x120 dm_process_bio+0x98/0x290 [dm_mod] ? generic_make_request+0xfb/0x410 dm_make_request+0x4d/0x120 [dm_mod] ? generic_make_request+0xfb/0x410 generic_make_request+0x12a/0x410 ? submit_bio+0x38/0x160 submit_bio+0x38/0x160 ? percpu_counter_add_batch+0xa3/0x120 btrfs_map_bio+0x289/0x570 [btrfs] ? kmem_cache_alloc+0x24d/0x300 btree_submit_bio_hook+0x79/0xc0 [btrfs] submit_one_bio+0x31/0x50 [btrfs] read_extent_buffer_pages+0x2fe/0x450 [btrfs] btree_read_extent_buffer_pages+0x7e/0x170 [btrfs] walk_down_log_tree+0x343/0x690 [btrfs] ? walk_log_tree+0x3d/0x380 [btrfs] walk_log_tree+0xf7/0x380 [btrfs] ? plist_requeue+0xf0/0xf0 ? delete_node+0x4b/0x230 free_log_tree+0x4c/0x130 [btrfs] ? wait_log_commit+0x140/0x140 [btrfs] btrfs_free_log+0x17/0x30 [btrfs] btrfs_drop_and_free_fs_root+0xb0/0xd0 [btrfs] btrfs_free_fs_roots+0x10c/0x190 [btrfs] ? do_raw_spin_unlock+0x49/0xc0 ? _raw_spin_unlock+0x29/0x40 ? release_extent_buffer+0x121/0x170 [btrfs] close_ctree+0x289/0x2e6 [btrfs] generic_shutdown_super+0x6c/0x110 kill_anon_super+0xe/0x30 btrfs_kill_super+0x12/0x20 [btrfs] deactivate_locked_super+0x3a/0x70 Reported-by: David Sterba <dsterba@suse.com> Fixes: 8c38938c7bb096 ("btrfs: move the root freeing stuff into btrfs_put_root") Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-04-23btrfs: fix memory leak of transaction when deleting unused block groupFilipe Manana
When cleaning pinned extents right before deleting an unused block group, we check if there's still a previous transaction running and if so we increment its reference count before using it for cleaning pinned ranges in its pinned extents iotree. However we ended up never decrementing the reference count after using the transaction, resulting in a memory leak. Fix it by decrementing the reference count. Fixes: fe119a6eeb6705 ("btrfs: switch to per-transaction pinned extents") Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-04-20btrfs: discard: Use the correct style for SPDX License IdentifierNishad Kamdar
This patch corrects the SPDX License Identifier style in header file related to Btrfs File System support. For C header files Documentation/process/license-rules.rst mandates C-like comments (opposed to C source files where C++ style should be used). Changes made by using a script provided by Joe Perches here: https://lkml.org/lkml/2019/2/7/46. Suggested-by: Joe Perches <joe@perches.com> Signed-off-by: Nishad Kamdar <nishadkamdar@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-04-17btrfs: fix setting last_trans for reloc rootsJosef Bacik
I made a mistake with my previous fix, I assumed that we didn't need to mess with the reloc roots once we were out of the part of relocation where we are actually moving the extents. The subtle thing that I missed is that btrfs_init_reloc_root() also updates the last_trans for the reloc root when we do btrfs_record_root_in_trans() for the corresponding fs_root. I've added a comment to make sure future me doesn't make this mistake again. This showed up as a WARN_ON() in btrfs_copy_root() because our last_trans didn't == the current transid. This could happen if we snapshotted a fs root with a reloc root after we set rc->create_reloc_tree = 0, but before we actually merge the reloc root. Worth mentioning that the regression produced the following warning when running snapshot creation and balance in parallel: BTRFS info (device sdc): relocating block group 30408704 flags metadata|dup ------------[ cut here ]------------ WARNING: CPU: 0 PID: 12823 at fs/btrfs/ctree.c:191 btrfs_copy_root+0x26f/0x430 [btrfs] CPU: 0 PID: 12823 Comm: btrfs Tainted: G W 5.6.0-rc7-btrfs-next-58 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 RIP: 0010:btrfs_copy_root+0x26f/0x430 [btrfs] RSP: 0018:ffffb96e044279b8 EFLAGS: 00010202 RAX: 0000000000000009 RBX: ffff9da70bf61000 RCX: ffffb96e04427a48 RDX: ffff9da733a770c8 RSI: ffff9da70bf61000 RDI: ffff9da694163818 RBP: ffff9da733a770c8 R08: fffffffffffffff8 R09: 0000000000000002 R10: ffffb96e044279a0 R11: 0000000000000000 R12: ffff9da694163818 R13: fffffffffffffff8 R14: ffff9da6d2512000 R15: ffff9da714cdac00 FS: 00007fdeacf328c0(0000) GS:ffff9da735e00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055a2a5b8a118 CR3: 00000001eed78002 CR4: 00000000003606f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: ? create_reloc_root+0x49/0x2b0 [btrfs] ? kmem_cache_alloc_trace+0xe5/0x200 create_reloc_root+0x8b/0x2b0 [btrfs] btrfs_reloc_post_snapshot+0x96/0x5b0 [btrfs] create_pending_snapshot+0x610/0x1010 [btrfs] create_pending_snapshots+0xa8/0xd0 [btrfs] btrfs_commit_transaction+0x4c7/0xc50 [btrfs] ? btrfs_mksubvol+0x3cd/0x560 [btrfs] btrfs_mksubvol+0x455/0x560 [btrfs] __btrfs_ioctl_snap_create+0x15f/0x190 [btrfs] btrfs_ioctl_snap_create_v2+0xa4/0xf0 [btrfs] ? mem_cgroup_commit_charge+0x6e/0x540 btrfs_ioctl+0x12d8/0x3760 [btrfs] ? do_raw_spin_unlock+0x49/0xc0 ? _raw_spin_unlock+0x29/0x40 ? __handle_mm_fault+0x11b3/0x14b0 ? ksys_ioctl+0x92/0xb0 ksys_ioctl+0x92/0xb0 ? trace_hardirqs_off_thunk+0x1a/0x1c __x64_sys_ioctl+0x16/0x20 do_syscall_64+0x5c/0x280 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x7fdeabd3bdd7 Fixes: 2abc726ab4b8 ("btrfs: do not init a reloc root if we aren't relocating") Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-04-08btrfs: fix reclaim counter leak of space_info objectsFilipe Manana
Whenever we add a ticket to a space_info object we increment the object's reclaim_size counter witht the ticket's bytes, and we decrement it with the corresponding amount only when we are able to grant the requested space to the ticket. When we are not able to grant the space to a ticket, or when the ticket is removed due to a signal (e.g. an application has received sigterm from the terminal) we never decrement the counter with the corresponding bytes from the ticket. This leak can result in the space reclaim code to later do much more work than necessary. So fix it by decrementing the counter when those two cases happen as well. Fixes: db161806dc5615 ("btrfs: account ticket size at add/delete time") Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-04-08btrfs: make full fsyncs always operate on the entire file againFilipe Manana
This is a revert of commit 0a8068a3dd4294 ("btrfs: make ranged full fsyncs more efficient"), with updated comment in btrfs_sync_file. Commit 0a8068a3dd4294 ("btrfs: make ranged full fsyncs more efficient") made full fsyncs operate on the given range only as it assumed it was safe when using the NO_HOLES feature, since the hole detection was simplified some time ago and no longer was a source for races with ordered extent completion of adjacent file ranges. However it's still not safe to have a full fsync only operate on the given range, because extent maps for new extents might not be present in memory due to inode eviction or extent cloning. Consider the following example: 1) We are currently at transaction N; 2) We write to the file range [0, 1MiB); 3) Writeback finishes for the whole range and ordered extents complete, while we are still at transaction N; 4) The inode is evicted; 5) We open the file for writing, causing the inode to be loaded to memory again, which sets the 'full sync' bit on its flags. At this point the inode's list of modified extent maps is empty (figuring out which extents were created in the current transaction and were not yet logged by an fsync is expensive, that's why we set the 'full sync' bit when loading an inode); 6) We write to the file range [512KiB, 768KiB); 7) We do a ranged fsync (such as msync()) for file range [512KiB, 768KiB). This correctly flushes this range and logs its extent into the log tree. When the writeback started an extent map for range [512KiB, 768KiB) was added to the inode's list of modified extents, and when the fsync() finishes logging it removes that extent map from the list of modified extent maps. This fsync also clears the 'full sync' bit; 8) We do a regular fsync() (full ranged). This fsync() ends up doing nothing because the inode's list of modified extents is empty and no other changes happened since the previous ranged fsync(), so it just returns success (0) and we end up never logging extents for the file ranges [0, 512KiB) and [768KiB, 1MiB). Another scenario where this can happen is if we replace steps 2 to 4 with cloning from another file into our test file, as that sets the 'full sync' bit in our inode's flags and does not populate its list of modified extent maps. This was causing test case generic/457 to fail sporadically when using the NO_HOLES feature, as it exercised this later case where the inode has the 'full sync' bit set and has no extent maps in memory to represent the new extents due to extent cloning. Fix this by reverting commit 0a8068a3dd4294 ("btrfs: make ranged full fsyncs more efficient") since there is no easy way to work around it. Fixes: 0a8068a3dd4294 ("btrfs: make ranged full fsyncs more efficient") Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>