From 9fecd13202f520f3f25d5b1c313adb740fe19773 Mon Sep 17 00:00:00 2001 From: Filipe Manana Date: Mon, 1 Jun 2020 19:12:06 +0100 Subject: btrfs: fix a block group ref counter leak after failure to remove block group When removing a block group, if we fail to delete the block group's item from the extent tree, we jump to the 'out' label and end up decrementing the block group's reference count once only (by 1), resulting in a counter leak because the block group at that point was already removed from the block group cache rbtree - so we have to decrement the reference count twice, once for the rbtree and once for our lookup at the start of the function. There is a second bug where if removing the free space tree entries (the call to remove_block_group_free_space()) fails we end up jumping to the 'out_put_group' label but end up decrementing the reference count only once, when we should have done it twice, since we have already removed the block group from the block group cache rbtree. This happens because the reference count decrement for the rbtree reference happens after attempting to remove the free space tree entries, which is far away from the place where we remove the block group from the rbtree. To make things less error prone, decrement the reference count for the rbtree immediately after removing the block group from it. This also eleminates the need for two different exit labels on error, renaming 'out_put_label' to just 'out' and removing the old 'out'. Fixes: f6033c5e333238 ("btrfs: fix block group leak when removing fails") CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Nikolay Borisov Reviewed-by: Anand Jain Signed-off-by: Filipe Manana Reviewed-by: David Sterba Signed-off-by: David Sterba --- fs/btrfs/block-group.c | 19 +++++++++---------- 1 file changed, 9 insertions(+), 10 deletions(-) (limited to 'fs/btrfs/block-group.c') diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c index 176e8a292fd1..6462dd0b155c 100644 --- a/fs/btrfs/block-group.c +++ b/fs/btrfs/block-group.c @@ -940,7 +940,7 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans, path = btrfs_alloc_path(); if (!path) { ret = -ENOMEM; - goto out_put_group; + goto out; } /* @@ -978,7 +978,7 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans, ret = btrfs_orphan_add(trans, BTRFS_I(inode)); if (ret) { btrfs_add_delayed_iput(inode); - goto out_put_group; + goto out; } clear_nlink(inode); /* One for the block groups ref */ @@ -1001,13 +1001,13 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans, ret = btrfs_search_slot(trans, tree_root, &key, path, -1, 1); if (ret < 0) - goto out_put_group; + goto out; if (ret > 0) btrfs_release_path(path); if (ret == 0) { ret = btrfs_del_item(trans, tree_root, path); if (ret) - goto out_put_group; + goto out; btrfs_release_path(path); } @@ -1016,6 +1016,9 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans, &fs_info->block_group_cache_tree); RB_CLEAR_NODE(&block_group->cache_node); + /* Once for the block groups rbtree */ + btrfs_put_block_group(block_group); + if (fs_info->first_logical_byte == block_group->start) fs_info->first_logical_byte = (u64)-1; spin_unlock(&fs_info->block_group_cache_lock); @@ -1125,10 +1128,7 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans, ret = remove_block_group_free_space(trans, block_group); if (ret) - goto out_put_group; - - /* Once for the block groups rbtree */ - btrfs_put_block_group(block_group); + goto out; ret = remove_block_group_item(trans, path, block_group); if (ret < 0) @@ -1145,10 +1145,9 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans, free_extent_map(em); } -out_put_group: +out: /* Once for the lookup reference */ btrfs_put_block_group(block_group); -out: if (remove_rsv) btrfs_delayed_refs_rsv_release(fs_info, 1); btrfs_free_path(path); -- cgit From ffcb9d44572afbaf8fa6dbf5115bff6dab7b299e Mon Sep 17 00:00:00 2001 From: Filipe Manana Date: Mon, 1 Jun 2020 19:12:19 +0100 Subject: btrfs: fix race between block group removal and block group creation There is a race between block group removal and block group creation when the removal is completed by a task running fitrim or scrub. When this happens we end up failing the block group creation with an error -EEXIST since we attempt to insert a duplicate block group item key in the extent tree. That results in a transaction abort. The race happens like this: 1) Task A is doing a fitrim, and at btrfs_trim_block_group() it freezes block group X with btrfs_freeze_block_group() (until very recently that was named btrfs_get_block_group_trimming()); 2) Task B starts removing block group X, either because it's now unused or due to relocation for example. So at btrfs_remove_block_group(), while holding the chunk mutex and the block group's lock, it sets the 'removed' flag of the block group and it sets the local variable 'remove_em' to false, because the block group is currently frozen (its 'frozen' counter is > 0, until very recently this counter was named 'trimming'); 3) Task B unlocks the block group and the chunk mutex; 4) Task A is done trimming the block group and unfreezes the block group by calling btrfs_unfreeze_block_group() (until very recently this was named btrfs_put_block_group_trimming()). In this function we lock the block group and set the local variable 'cleanup' to true because we were able to decrement the block group's 'frozen' counter down to 0 and the flag 'removed' is set in the block group. Since 'cleanup' is set to true, it locks the chunk mutex and removes the extent mapping representing the block group from the mapping tree; 5) Task C allocates a new block group Y and it picks up the logical address that block group X had as the logical address for Y, because X was the block group with the highest logical address and now the second block group with the highest logical address, the last in the fs mapping tree, ends at an offset corresponding to block group X's logical address (this logical address selection is done at volumes.c:find_next_chunk()). At this point the new block group Y does not have yet its item added to the extent tree (nor the corresponding device extent items and chunk item in the device and chunk trees). The new group Y is added to the list of pending block groups in the transaction handle; 6) Before task B proceeds to removing the block group item for block group X from the extent tree, which has a key matching: (X logical offset, BTRFS_BLOCK_GROUP_ITEM_KEY, length) task C while ending its transaction handle calls btrfs_create_pending_block_groups(), which finds block group Y and tries to insert the block group item for Y into the exten tree, which fails with -EEXIST since logical offset is the same that X had and task B hasn't yet deleted the key from the extent tree. This failure results in a transaction abort, producing a stack like the following: ------------[ cut here ]------------ BTRFS: Transaction aborted (error -17) WARNING: CPU: 2 PID: 19736 at fs/btrfs/block-group.c:2074 btrfs_create_pending_block_groups+0x1eb/0x260 [btrfs] Modules linked in: btrfs blake2b_generic xor raid6_pq (...) CPU: 2 PID: 19736 Comm: fsstress Tainted: G W 5.6.0-rc7-btrfs-next-58 #5 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 RIP: 0010:btrfs_create_pending_block_groups+0x1eb/0x260 [btrfs] Code: ff ff ff 48 8b 55 50 f0 48 (...) RSP: 0018:ffffa4160a1c7d58 EFLAGS: 00010286 RAX: 0000000000000000 RBX: ffff961581909d98 RCX: 0000000000000000 RDX: 0000000000000001 RSI: ffffffffb3d63990 RDI: 0000000000000001 RBP: ffff9614f3356a58 R08: 0000000000000000 R09: 0000000000000001 R10: ffff9615b65b0040 R11: 0000000000000000 R12: ffff961581909c10 R13: ffff9615b0c32000 R14: ffff9614f3356ab0 R15: ffff9614be779000 FS: 00007f2ce2841e80(0000) GS:ffff9615bae00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000555f18780000 CR3: 0000000131d34005 CR4: 00000000003606e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: btrfs_start_dirty_block_groups+0x398/0x4e0 [btrfs] btrfs_commit_transaction+0xd0/0xc50 [btrfs] ? btrfs_attach_transaction_barrier+0x1e/0x50 [btrfs] ? __ia32_sys_fdatasync+0x20/0x20 iterate_supers+0xdb/0x180 ksys_sync+0x60/0xb0 __ia32_sys_sync+0xa/0x10 do_syscall_64+0x5c/0x280 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x7f2ce1d4d5b7 Code: 83 c4 08 48 3d 01 (...) RSP: 002b:00007ffd8b558c58 EFLAGS: 00000202 ORIG_RAX: 00000000000000a2 RAX: ffffffffffffffda RBX: 000000000000002c RCX: 00007f2ce1d4d5b7 RDX: 00000000ffffffff RSI: 00000000186ba07b RDI: 000000000000002c RBP: 0000555f17b9e520 R08: 0000000000000012 R09: 000000000000ce00 R10: 0000000000000078 R11: 0000000000000202 R12: 0000000000000032 R13: 0000000051eb851f R14: 00007ffd8b558cd0 R15: 0000555f1798ec20 irq event stamp: 0 hardirqs last enabled at (0): [<0000000000000000>] 0x0 hardirqs last disabled at (0): [] copy_process+0x74f/0x2020 softirqs last enabled at (0): [] copy_process+0x74f/0x2020 softirqs last disabled at (0): [<0000000000000000>] 0x0 ---[ end trace bd7c03622e0b0a9c ]--- Fix this simply by making btrfs_remove_block_group() remove the block group's item from the extent tree before it flags the block group as removed. Also make the free space deletion from the free space tree before flagging the block group as removed, to avoid a similar race with adding and removing free space entries for the free space tree. Fixes: 04216820fe83d5 ("Btrfs: fix race between fs trimming and block group remove/allocation") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana Signed-off-by: David Sterba --- fs/btrfs/block-group.c | 27 +++++++++++++++++++-------- 1 file changed, 19 insertions(+), 8 deletions(-) (limited to 'fs/btrfs/block-group.c') diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c index 6462dd0b155c..c037ef514b64 100644 --- a/fs/btrfs/block-group.c +++ b/fs/btrfs/block-group.c @@ -1092,6 +1092,25 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans, spin_unlock(&block_group->space_info->lock); + /* + * Remove the free space for the block group from the free space tree + * and the block group's item from the extent tree before marking the + * block group as removed. This is to prevent races with tasks that + * freeze and unfreeze a block group, this task and another task + * allocating a new block group - the unfreeze task ends up removing + * the block group's extent map before the task calling this function + * deletes the block group item from the extent tree, allowing for + * another task to attempt to create another block group with the same + * item key (and failing with -EEXIST and a transaction abort). + */ + ret = remove_block_group_free_space(trans, block_group); + if (ret) + goto out; + + ret = remove_block_group_item(trans, path, block_group); + if (ret < 0) + goto out; + mutex_lock(&fs_info->chunk_mutex); spin_lock(&block_group->lock); block_group->removed = 1; @@ -1126,14 +1145,6 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans, mutex_unlock(&fs_info->chunk_mutex); - ret = remove_block_group_free_space(trans, block_group); - if (ret) - goto out; - - ret = remove_block_group_item(trans, path, block_group); - if (ret < 0) - goto out; - if (remove_em) { struct extent_map_tree *em_tree; -- cgit From 9e22b925985e71c6acf0dba03f9b99a56806a137 Mon Sep 17 00:00:00 2001 From: Nikolay Borisov Date: Fri, 3 Apr 2020 16:40:34 +0300 Subject: btrfs: read stripe len directly in btrfs_rmap_block extent_map::orig_block_len contains the size of a physical stripe when it's used to describe block groups (calculated in read_one_chunk via calc_stripe_length or calculated in decide_stripe_size and then assigned to extent_map::orig_block_len in create_chunk). Exploit this fact to get the size directly rather than opencoding the calculations. No functional changes. Signed-off-by: Nikolay Borisov Signed-off-by: David Sterba --- fs/btrfs/block-group.c | 13 +++---------- 1 file changed, 3 insertions(+), 10 deletions(-) (limited to 'fs/btrfs/block-group.c') diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c index c037ef514b64..956ae6283d33 100644 --- a/fs/btrfs/block-group.c +++ b/fs/btrfs/block-group.c @@ -1657,19 +1657,12 @@ int btrfs_rmap_block(struct btrfs_fs_info *fs_info, u64 chunk_start, return -EIO; map = em->map_lookup; - data_stripe_length = em->len; + data_stripe_length = em->orig_block_len; io_stripe_size = map->stripe_len; - if (map->type & BTRFS_BLOCK_GROUP_RAID10) - data_stripe_length = div_u64(data_stripe_length, - map->num_stripes / map->sub_stripes); - else if (map->type & BTRFS_BLOCK_GROUP_RAID0) - data_stripe_length = div_u64(data_stripe_length, map->num_stripes); - else if (map->type & BTRFS_BLOCK_GROUP_RAID56_MASK) { - data_stripe_length = div_u64(data_stripe_length, - nr_data_stripes(map)); + /* For RAID5/6 adjust to a full IO stripe length */ + if (map->type & BTRFS_BLOCK_GROUP_RAID56_MASK) io_stripe_size = map->stripe_len * nr_data_stripes(map); - } buf = kcalloc(map->num_stripes, sizeof(u64), GFP_NOFS); if (!buf) { -- cgit From 96f9b0f2fa01c96e90abd4e981208a341a5da66e Mon Sep 17 00:00:00 2001 From: Nikolay Borisov Date: Fri, 3 Apr 2020 16:40:35 +0300 Subject: btrfs: simplify checks when adding excluded ranges Adresses held in 'logical' array are always guaranteed to fall within the boundaries of the block group. That is, 'start' can never be smaller than cache->start. This invariant follows from the way the address are calculated in btrfs_rmap_block: stripe_nr = physical - map->stripes[i].physical; stripe_nr = div64_u64(stripe_nr, map->stripe_len); bytenr = chunk_start + stripe_nr * io_stripe_size; I.e it's always some IO stripe within the given chunk. Exploit this invariant to simplify the body of the loop by removing the unnecessary 'if' since its 'else' part is the one always executed. Signed-off-by: Nikolay Borisov Signed-off-by: David Sterba --- fs/btrfs/block-group.c | 21 ++++----------------- 1 file changed, 4 insertions(+), 17 deletions(-) (limited to 'fs/btrfs/block-group.c') diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c index 956ae6283d33..e29e9629b246 100644 --- a/fs/btrfs/block-group.c +++ b/fs/btrfs/block-group.c @@ -1741,25 +1741,12 @@ static int exclude_super_stripes(struct btrfs_block_group *cache) return ret; while (nr--) { - u64 start, len; - - if (logical[nr] > cache->start + cache->length) - continue; - - if (logical[nr] + stripe_len <= cache->start) - continue; - - start = logical[nr]; - if (start < cache->start) { - start = cache->start; - len = (logical[nr] + stripe_len) - start; - } else { - len = min_t(u64, stripe_len, - cache->start + cache->length - start); - } + u64 len = min_t(u64, stripe_len, + cache->start + cache->length - logical[nr]); cache->bytes_super += len; - ret = btrfs_add_excluded_extent(fs_info, start, len); + ret = btrfs_add_excluded_extent(fs_info, logical[nr], + len); if (ret) { kfree(logical); return ret; -- cgit From 89d7da9bc592aa6a341d00f2d949615a89bb1eb7 Mon Sep 17 00:00:00 2001 From: Johannes Thumshirn Date: Tue, 2 Jun 2020 19:05:56 +0900 Subject: btrfs: get mapping tree directly from fsinfo in find_first_block_group We already have an fs_info in our function parameters, there's no need to do the maths again and get fs_info from the extent_root just to get the mapping_tree. Instead directly grab the mapping_tree from fs_info. Reviewed-by: Nikolay Borisov Signed-off-by: Johannes Thumshirn Signed-off-by: David Sterba --- fs/btrfs/block-group.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'fs/btrfs/block-group.c') diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c index e29e9629b246..bfb5f4258336 100644 --- a/fs/btrfs/block-group.c +++ b/fs/btrfs/block-group.c @@ -1566,7 +1566,7 @@ static int find_first_block_group(struct btrfs_fs_info *fs_info, struct extent_map_tree *em_tree; struct extent_map *em; - em_tree = &root->fs_info->mapping_tree; + em_tree = &fs_info->mapping_tree; read_lock(&em_tree->lock); em = lookup_extent_mapping(em_tree, found_key.objectid, found_key.offset); -- cgit From e3ba67a108ff5a2657990014d3ed4488bd665be6 Mon Sep 17 00:00:00 2001 From: Johannes Thumshirn Date: Tue, 2 Jun 2020 19:05:57 +0900 Subject: btrfs: factor out reading of bg from find_frist_block_group When find_first_block_group() finds a block group item in the extent-tree, it does a lookup of the object in the extent mapping tree and does further checks on the item. Factor out this step from find_first_block_group() so we can further simplify the code. While we're at it, we can also just return early in find_first_block_group(), if the tree slot isn't found. Signed-off-by: Johannes Thumshirn Signed-off-by: David Sterba --- fs/btrfs/block-group.c | 102 +++++++++++++++++++++++++++---------------------- 1 file changed, 56 insertions(+), 46 deletions(-) (limited to 'fs/btrfs/block-group.c') diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c index bfb5f4258336..11244b67434e 100644 --- a/fs/btrfs/block-group.c +++ b/fs/btrfs/block-group.c @@ -1532,21 +1532,70 @@ void btrfs_mark_bg_unused(struct btrfs_block_group *bg) spin_unlock(&fs_info->unused_bgs_lock); } +static int read_bg_from_eb(struct btrfs_fs_info *fs_info, struct btrfs_key *key, + struct btrfs_path *path) +{ + struct extent_map_tree *em_tree; + struct extent_map *em; + struct btrfs_block_group_item bg; + struct extent_buffer *leaf; + int slot; + u64 flags; + int ret = 0; + + slot = path->slots[0]; + leaf = path->nodes[0]; + + em_tree = &fs_info->mapping_tree; + read_lock(&em_tree->lock); + em = lookup_extent_mapping(em_tree, key->objectid, key->offset); + read_unlock(&em_tree->lock); + if (!em) { + btrfs_err(fs_info, + "logical %llu len %llu found bg but no related chunk", + key->objectid, key->offset); + return -ENOENT; + } + + if (em->start != key->objectid || em->len != key->offset) { + btrfs_err(fs_info, + "block group %llu len %llu mismatch with chunk %llu len %llu", + key->objectid, key->offset, em->start, em->len); + ret = -EUCLEAN; + goto out_free_em; + } + + read_extent_buffer(leaf, &bg, btrfs_item_ptr_offset(leaf, slot), + sizeof(bg)); + flags = btrfs_stack_block_group_flags(&bg) & + BTRFS_BLOCK_GROUP_TYPE_MASK; + + if (flags != (em->map_lookup->type & BTRFS_BLOCK_GROUP_TYPE_MASK)) { + btrfs_err(fs_info, +"block group %llu len %llu type flags 0x%llx mismatch with chunk type flags 0x%llx", + key->objectid, key->offset, flags, + (BTRFS_BLOCK_GROUP_TYPE_MASK & em->map_lookup->type)); + ret = -EUCLEAN; + } + +out_free_em: + free_extent_map(em); + return ret; +} + static int find_first_block_group(struct btrfs_fs_info *fs_info, struct btrfs_path *path, struct btrfs_key *key) { struct btrfs_root *root = fs_info->extent_root; - int ret = 0; + int ret; struct btrfs_key found_key; struct extent_buffer *leaf; - struct btrfs_block_group_item bg; - u64 flags; int slot; ret = btrfs_search_slot(NULL, root, key, path, 0, 0); if (ret < 0) - goto out; + return ret; while (1) { slot = path->slots[0]; @@ -1563,49 +1612,10 @@ static int find_first_block_group(struct btrfs_fs_info *fs_info, if (found_key.objectid >= key->objectid && found_key.type == BTRFS_BLOCK_GROUP_ITEM_KEY) { - struct extent_map_tree *em_tree; - struct extent_map *em; - - em_tree = &fs_info->mapping_tree; - read_lock(&em_tree->lock); - em = lookup_extent_mapping(em_tree, found_key.objectid, - found_key.offset); - read_unlock(&em_tree->lock); - if (!em) { - btrfs_err(fs_info, - "logical %llu len %llu found bg but no related chunk", - found_key.objectid, found_key.offset); - ret = -ENOENT; - } else if (em->start != found_key.objectid || - em->len != found_key.offset) { - btrfs_err(fs_info, - "block group %llu len %llu mismatch with chunk %llu len %llu", - found_key.objectid, found_key.offset, - em->start, em->len); - ret = -EUCLEAN; - } else { - read_extent_buffer(leaf, &bg, - btrfs_item_ptr_offset(leaf, slot), - sizeof(bg)); - flags = btrfs_stack_block_group_flags(&bg) & - BTRFS_BLOCK_GROUP_TYPE_MASK; - - if (flags != (em->map_lookup->type & - BTRFS_BLOCK_GROUP_TYPE_MASK)) { - btrfs_err(fs_info, -"block group %llu len %llu type flags 0x%llx mismatch with chunk type flags 0x%llx", - found_key.objectid, - found_key.offset, flags, - (BTRFS_BLOCK_GROUP_TYPE_MASK & - em->map_lookup->type)); - ret = -EUCLEAN; - } else { - ret = 0; - } - } - free_extent_map(em); - goto out; + ret = read_bg_from_eb(fs_info, &found_key, path); + break; } + path->slots[0]++; } out: -- cgit From f22f457a1a58b95c22a870f5e45b3fd7b23ebe7b Mon Sep 17 00:00:00 2001 From: Filipe Manana Date: Mon, 1 Jun 2020 19:12:27 +0100 Subject: btrfs: remove no longer necessary chunk mutex locking cases Initially when the 'removed' flag was added to a block group to avoid races between block group removal and fitrim, by commit 04216820fe83d5 ("Btrfs: fix race between fs trimming and block group remove/allocation"), we had to lock the chunks mutex because we could be moving the block group from its current list, the pending chunks list, into the pinned chunks list, or we could just be adding it to the pinned chunks if it was not in the pending chunks list. Both lists were protected by the chunk mutex. However we no longer have those lists since commit 1c11b63eff2a67 ("btrfs: replace pending/pinned chunks lists with io tree"), and locking the chunk mutex is no longer necessary because of that. The same happens at btrfs_unfreeze_block_group(), we lock the chunk mutex because the block group's extent map could be part of the pinned chunks list and the call to remove_extent_mapping() could be deleting it from that list, which used to be protected by that mutex. So just remove those lock and unlock calls as they are not needed anymore. Reviewed-by: Nikolay Borisov Signed-off-by: Filipe Manana Signed-off-by: David Sterba --- fs/btrfs/block-group.c | 5 ----- 1 file changed, 5 deletions(-) (limited to 'fs/btrfs/block-group.c') diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c index 11244b67434e..31ca2cfb7e3e 100644 --- a/fs/btrfs/block-group.c +++ b/fs/btrfs/block-group.c @@ -1111,7 +1111,6 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans, if (ret < 0) goto out; - mutex_lock(&fs_info->chunk_mutex); spin_lock(&block_group->lock); block_group->removed = 1; /* @@ -1143,8 +1142,6 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans, remove_em = (atomic_read(&block_group->frozen) == 0); spin_unlock(&block_group->lock); - mutex_unlock(&fs_info->chunk_mutex); - if (remove_em) { struct extent_map_tree *em_tree; @@ -3437,7 +3434,6 @@ void btrfs_unfreeze_block_group(struct btrfs_block_group *block_group) spin_unlock(&block_group->lock); if (cleanup) { - mutex_lock(&fs_info->chunk_mutex); em_tree = &fs_info->mapping_tree; write_lock(&em_tree->lock); em = lookup_extent_mapping(em_tree, block_group->start, @@ -3445,7 +3441,6 @@ void btrfs_unfreeze_block_group(struct btrfs_block_group *block_group) BUG_ON(!em); /* logic error, can't happen */ remove_extent_mapping(em_tree, em); write_unlock(&em_tree->lock); - mutex_unlock(&fs_info->chunk_mutex); /* once for us and once for the tree */ free_extent_map(em); -- cgit From 36ea6f3e931391c2adbb38af8c5dd4a043d26ac5 Mon Sep 17 00:00:00 2001 From: Nikolay Borisov Date: Wed, 3 Jun 2020 08:55:41 +0300 Subject: btrfs: make btrfs_check_data_free_space take btrfs_inode Instead of calling BTRFS_I on the passed vfs_inode take btrfs_inode directly. Signed-off-by: Nikolay Borisov Reviewed-by: David Sterba Signed-off-by: David Sterba --- fs/btrfs/block-group.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'fs/btrfs/block-group.c') diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c index 31ca2cfb7e3e..6ade9d345e66 100644 --- a/fs/btrfs/block-group.c +++ b/fs/btrfs/block-group.c @@ -2508,7 +2508,8 @@ again: num_pages *= 16; num_pages *= PAGE_SIZE; - ret = btrfs_check_data_free_space(inode, &data_reserved, 0, num_pages); + ret = btrfs_check_data_free_space(BTRFS_I(inode), &data_reserved, 0, + num_pages); if (ret) goto out_put; -- cgit From 48aaeebe4e1f3c76a703dc40e9245af1753dcefe Mon Sep 17 00:00:00 2001 From: Josef Bacik Date: Mon, 6 Jul 2020 09:14:11 -0400 Subject: btrfs: convert block group refcount to refcount_t We have refcount_t now with the associated library to handle refcounts, which gives us extra debugging around reference count mistakes that may be made. For example it'll warn on any transition from 0->1 or 0->-1, which is handy for noticing cases where we've messed up reference counting. Convert the block group ref counting from an atomic_t to refcount_t and use the appropriate helpers. Reviewed-by: Filipe Manana Signed-off-by: Josef Bacik Reviewed-by: David Sterba Signed-off-by: David Sterba --- fs/btrfs/block-group.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) (limited to 'fs/btrfs/block-group.c') diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c index 6ade9d345e66..884de28a41e4 100644 --- a/fs/btrfs/block-group.c +++ b/fs/btrfs/block-group.c @@ -118,12 +118,12 @@ u64 btrfs_get_alloc_profile(struct btrfs_fs_info *fs_info, u64 orig_flags) void btrfs_get_block_group(struct btrfs_block_group *cache) { - atomic_inc(&cache->count); + refcount_inc(&cache->refs); } void btrfs_put_block_group(struct btrfs_block_group *cache) { - if (atomic_dec_and_test(&cache->count)) { + if (refcount_dec_and_test(&cache->refs)) { WARN_ON(cache->pinned > 0); WARN_ON(cache->reserved > 0); @@ -1805,7 +1805,7 @@ static struct btrfs_block_group *btrfs_create_block_group_cache( cache->discard_index = BTRFS_DISCARD_INDEX_UNUSED; - atomic_set(&cache->count, 1); + refcount_set(&cache->refs, 1); spin_lock_init(&cache->lock); init_rwsem(&cache->data_rwsem); INIT_LIST_HEAD(&cache->list); @@ -3380,7 +3380,7 @@ int btrfs_free_block_groups(struct btrfs_fs_info *info) ASSERT(list_empty(&block_group->dirty_list)); ASSERT(list_empty(&block_group->io_list)); ASSERT(list_empty(&block_group->bg_list)); - ASSERT(atomic_read(&block_group->count) == 1); + ASSERT(refcount_read(&block_group->refs) == 1); btrfs_put_block_group(block_group); spin_lock(&info->block_group_cache_lock); -- cgit From 349e120ecebeb984376c8edb89bf0bfb85bc16f7 Mon Sep 17 00:00:00 2001 From: Josef Bacik Date: Tue, 21 Jul 2020 10:48:45 -0400 Subject: btrfs: don't adjust bg flags and use default allocation profiles MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit btrfs/061 has been failing consistently for me recently with a transaction abort. We run out of space in the system chunk array, which means we've allocated way too many system chunks than we need. Chris added this a long time ago for balance as a poor mans restriping. If you had a single disk and then added another disk and then did a balance, update_block_group_flags would then figure out which RAID level you needed. Fast forward to today and we have restriping behavior, so we can explicitly tell the fs that we're trying to change the raid level. This is accomplished through the normal get_alloc_profile path. Furthermore this code actually causes btrfs/061 to fail, because we do things like mkfs -m dup -d single with multiple devices. This trips this check alloc_flags = update_block_group_flags(fs_info, cache->flags); if (alloc_flags != cache->flags) { ret = btrfs_chunk_alloc(trans, alloc_flags, CHUNK_ALLOC_FORCE); in btrfs_inc_block_group_ro. Because we're balancing and scrubbing, but not actually restriping, we keep forcing chunk allocation of RAID1 chunks. This eventually causes us to run out of system space and the file system aborts and flips read only. We don't need this poor mans restriping any more, simply use the normal get_alloc_profile helper, which will get the correct alloc_flags and thus make the right decision for chunk allocation. This keeps us from allocating a billion system chunks and falling over. Tested-by: Holger Hoffstätte Reviewed-by: Qu Wenruo Signed-off-by: Josef Bacik Signed-off-by: David Sterba --- fs/btrfs/block-group.c | 52 ++------------------------------------------------ 1 file changed, 2 insertions(+), 50 deletions(-) (limited to 'fs/btrfs/block-group.c') diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c index 884de28a41e4..652b35d5a773 100644 --- a/fs/btrfs/block-group.c +++ b/fs/btrfs/block-group.c @@ -2194,54 +2194,6 @@ int btrfs_make_block_group(struct btrfs_trans_handle *trans, u64 bytes_used, return 0; } -static u64 update_block_group_flags(struct btrfs_fs_info *fs_info, u64 flags) -{ - u64 num_devices; - u64 stripped; - - /* - * if restripe for this chunk_type is on pick target profile and - * return, otherwise do the usual balance - */ - stripped = get_restripe_target(fs_info, flags); - if (stripped) - return extended_to_chunk(stripped); - - num_devices = fs_info->fs_devices->rw_devices; - - stripped = BTRFS_BLOCK_GROUP_RAID0 | BTRFS_BLOCK_GROUP_RAID56_MASK | - BTRFS_BLOCK_GROUP_RAID1_MASK | BTRFS_BLOCK_GROUP_RAID10; - - if (num_devices == 1) { - stripped |= BTRFS_BLOCK_GROUP_DUP; - stripped = flags & ~stripped; - - /* turn raid0 into single device chunks */ - if (flags & BTRFS_BLOCK_GROUP_RAID0) - return stripped; - - /* turn mirroring into duplication */ - if (flags & (BTRFS_BLOCK_GROUP_RAID1_MASK | - BTRFS_BLOCK_GROUP_RAID10)) - return stripped | BTRFS_BLOCK_GROUP_DUP; - } else { - /* they already had raid on here, just return */ - if (flags & stripped) - return flags; - - stripped |= BTRFS_BLOCK_GROUP_DUP; - stripped = flags & ~stripped; - - /* switch duplicated blocks with raid1 */ - if (flags & BTRFS_BLOCK_GROUP_DUP) - return stripped | BTRFS_BLOCK_GROUP_RAID1; - - /* this is drive concat, leave it alone */ - } - - return flags; -} - /* * Mark one block group RO, can be called several times for the same block * group. @@ -2287,7 +2239,7 @@ again: * If we are changing raid levels, try to allocate a * corresponding block group with the new raid level. */ - alloc_flags = update_block_group_flags(fs_info, cache->flags); + alloc_flags = btrfs_get_alloc_profile(fs_info, cache->flags); if (alloc_flags != cache->flags) { ret = btrfs_chunk_alloc(trans, alloc_flags, CHUNK_ALLOC_FORCE); @@ -2314,7 +2266,7 @@ again: ret = inc_block_group_ro(cache, 0); out: if (cache->flags & BTRFS_BLOCK_GROUP_SYSTEM) { - alloc_flags = update_block_group_flags(fs_info, cache->flags); + alloc_flags = btrfs_get_alloc_profile(fs_info, cache->flags); mutex_lock(&fs_info->chunk_mutex); check_system_chunk(trans, alloc_flags); mutex_unlock(&fs_info->chunk_mutex); -- cgit From 162e0a16b7d93629c8325ea32d99b38d17bc03eb Mon Sep 17 00:00:00 2001 From: Josef Bacik Date: Tue, 21 Jul 2020 10:48:46 -0400 Subject: btrfs: if we're restriping, use the target restripe profile MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Previously we depended on some weird behavior in our chunk allocator to force the allocation of new stripes, so by the time we got to doing the reduce we would usually already have a chunk with the proper target. However that behavior causes other problems and needs to be removed. First however we need to remove this check to only restripe if we already have those available profiles, because if we're allocating our first chunk it obviously will not be available. Simply use the target as specified, and if that fails it'll be because we're out of space. Tested-by: Holger Hoffstätte Signed-off-by: Josef Bacik Signed-off-by: David Sterba --- fs/btrfs/block-group.c | 7 ++----- 1 file changed, 2 insertions(+), 5 deletions(-) (limited to 'fs/btrfs/block-group.c') diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c index 652b35d5a773..613920c17ac1 100644 --- a/fs/btrfs/block-group.c +++ b/fs/btrfs/block-group.c @@ -65,11 +65,8 @@ static u64 btrfs_reduce_alloc_profile(struct btrfs_fs_info *fs_info, u64 flags) spin_lock(&fs_info->balance_lock); target = get_restripe_target(fs_info, flags); if (target) { - /* Pick target profile only if it's already available */ - if ((flags & target) & BTRFS_EXTENDED_PROFILE_MASK) { - spin_unlock(&fs_info->balance_lock); - return extended_to_chunk(target); - } + spin_unlock(&fs_info->balance_lock); + return extended_to_chunk(target); } spin_unlock(&fs_info->balance_lock); -- cgit