From 9f47eb5461aaeb6cb8696f9d11503ae90e4d5cb0 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Fri, 8 May 2020 14:15:37 -0700 Subject: fs/btrfs: Add cond_resched() for try_release_extent_mapping() stalls Very large I/Os can cause the following RCU CPU stall warning: RIP: 0010:rb_prev+0x8/0x50 Code: 49 89 c0 49 89 d1 48 89 c2 48 89 f8 e9 e5 fd ff ff 4c 89 48 10 c3 4c = 89 06 c3 4c 89 40 10 c3 0f 1f 00 48 8b 0f 48 39 cf 74 38 <48> 8b 47 10 48 85 c0 74 22 48 8b 50 08 48 85 d2 74 0c 48 89 d0 48 RSP: 0018:ffffc9002212bab0 EFLAGS: 00000287 ORIG_RAX: ffffffffffffff13 RAX: ffff888821f93630 RBX: ffff888821f93630 RCX: ffff888821f937e0 RDX: 0000000000000000 RSI: 0000000000102000 RDI: ffff888821f93630 RBP: 0000000000103000 R08: 000000000006c000 R09: 0000000000000238 R10: 0000000000102fff R11: ffffc9002212bac8 R12: 0000000000000001 R13: ffffffffffffffff R14: 0000000000102000 R15: ffff888821f937e0 __lookup_extent_mapping+0xa0/0x110 try_release_extent_mapping+0xdc/0x220 btrfs_releasepage+0x45/0x70 shrink_page_list+0xa39/0xb30 shrink_inactive_list+0x18f/0x3b0 shrink_lruvec+0x38e/0x6b0 shrink_node+0x14d/0x690 do_try_to_free_pages+0xc6/0x3e0 try_to_free_mem_cgroup_pages+0xe6/0x1e0 reclaim_high.constprop.73+0x87/0xc0 mem_cgroup_handle_over_high+0x66/0x150 exit_to_usermode_loop+0x82/0xd0 do_syscall_64+0xd4/0x100 entry_SYSCALL_64_after_hwframe+0x44/0xa9 On a PREEMPT=n kernel, the try_release_extent_mapping() function's "while" loop might run for a very long time on a large I/O. This commit therefore adds a cond_resched() to this loop, providing RCU any needed quiescent states. Signed-off-by: Paul E. McKenney --- fs/btrfs/extent_io.c | 2 ++ 1 file changed, 2 insertions(+) (limited to 'fs/btrfs/extent_io.c') diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 68c96057ad2d..704239546093 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -4515,6 +4515,8 @@ int try_release_extent_mapping(struct page *page, gfp_t mask) /* once for us */ free_extent_map(em); + + cond_resched(); /* Allow large-extent preemption. */ } } return try_release_extent_state(tree, page, mask); -- cgit From 6bf9cd2eed9aee6d742bb9296c994a91f5316949 Mon Sep 17 00:00:00 2001 From: Boris Burkov Date: Wed, 17 Jun 2020 11:35:19 -0700 Subject: btrfs: fix fatal extent_buffer readahead vs releasepage race Under somewhat convoluted conditions, it is possible to attempt to release an extent_buffer that is under io, which triggers a BUG_ON in btrfs_release_extent_buffer_pages. This relies on a few different factors. First, extent_buffer reads done as readahead for searching use WAIT_NONE, so they free the local extent buffer reference while the io is outstanding. However, they should still be protected by TREE_REF. However, if the system is doing signficant reclaim, and simultaneously heavily accessing the extent_buffers, it is possible for releasepage to race with two concurrent readahead attempts in a way that leaves TREE_REF unset when the readahead extent buffer is released. Essentially, if two tasks race to allocate a new extent_buffer, but the winner who attempts the first io is rebuffed by a page being locked (likely by the reclaim itself) then the loser will still go ahead with issuing the readahead. The loser's call to find_extent_buffer must also race with the reclaim task reading the extent_buffer's refcount as 1 in a way that allows the reclaim to re-clear the TREE_REF checked by find_extent_buffer. The following represents an example execution demonstrating the race: CPU0 CPU1 CPU2 reada_for_search reada_for_search readahead_tree_block readahead_tree_block find_create_tree_block find_create_tree_block alloc_extent_buffer alloc_extent_buffer find_extent_buffer // not found allocates eb lock pages associate pages to eb insert eb into radix tree set TREE_REF, refs == 2 unlock pages read_extent_buffer_pages // WAIT_NONE not uptodate (brand new eb) lock_page if !trylock_page goto unlock_exit // not an error free_extent_buffer release_extent_buffer atomic_dec_and_test refs to 1 find_extent_buffer // found try_release_extent_buffer take refs_lock reads refs == 1; no io atomic_inc_not_zero refs to 2 mark_buffer_accessed check_buffer_tree_ref // not STALE, won't take refs_lock refs == 2; TREE_REF set // no action read_extent_buffer_pages // WAIT_NONE clear TREE_REF release_extent_buffer atomic_dec_and_test refs to 1 unlock_page still not uptodate (CPU1 read failed on trylock_page) locks pages set io_pages > 0 submit io return free_extent_buffer release_extent_buffer dec refs to 0 delete from radix tree btrfs_release_extent_buffer_pages BUG_ON(io_pages > 0)!!! We observe this at a very low rate in production and were also able to reproduce it in a test environment by introducing some spurious delays and by introducing probabilistic trylock_page failures. To fix it, we apply check_tree_ref at a point where it could not possibly be unset by a competing task: after io_pages has been incremented. All the codepaths that clear TREE_REF check for io, so they would not be able to clear it after this point until the io is done. Stack trace, for reference: [1417839.424739] ------------[ cut here ]------------ [1417839.435328] kernel BUG at fs/btrfs/extent_io.c:4841! [1417839.447024] invalid opcode: 0000 [#1] SMP [1417839.502972] RIP: 0010:btrfs_release_extent_buffer_pages+0x20/0x1f0 [1417839.517008] Code: ed e9 ... [1417839.558895] RSP: 0018:ffffc90020bcf798 EFLAGS: 00010202 [1417839.570816] RAX: 0000000000000002 RBX: ffff888102d6def0 RCX: 0000000000000028 [1417839.586962] RDX: 0000000000000002 RSI: ffff8887f0296482 RDI: ffff888102d6def0 [1417839.603108] RBP: ffff88885664a000 R08: 0000000000000046 R09: 0000000000000238 [1417839.619255] R10: 0000000000000028 R11: ffff88885664af68 R12: 0000000000000000 [1417839.635402] R13: 0000000000000000 R14: ffff88875f573ad0 R15: ffff888797aafd90 [1417839.651549] FS: 00007f5a844fa700(0000) GS:ffff88885f680000(0000) knlGS:0000000000000000 [1417839.669810] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [1417839.682887] CR2: 00007f7884541fe0 CR3: 000000049f609002 CR4: 00000000003606e0 [1417839.699037] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [1417839.715187] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [1417839.731320] Call Trace: [1417839.737103] release_extent_buffer+0x39/0x90 [1417839.746913] read_block_for_search.isra.38+0x2a3/0x370 [1417839.758645] btrfs_search_slot+0x260/0x9b0 [1417839.768054] btrfs_lookup_file_extent+0x4a/0x70 [1417839.778427] btrfs_get_extent+0x15f/0x830 [1417839.787665] ? submit_extent_page+0xc4/0x1c0 [1417839.797474] ? __do_readpage+0x299/0x7a0 [1417839.806515] __do_readpage+0x33b/0x7a0 [1417839.815171] ? btrfs_releasepage+0x70/0x70 [1417839.824597] extent_readpages+0x28f/0x400 [1417839.833836] read_pages+0x6a/0x1c0 [1417839.841729] ? startup_64+0x2/0x30 [1417839.849624] __do_page_cache_readahead+0x13c/0x1a0 [1417839.860590] filemap_fault+0x6c7/0x990 [1417839.869252] ? xas_load+0x8/0x80 [1417839.876756] ? xas_find+0x150/0x190 [1417839.884839] ? filemap_map_pages+0x295/0x3b0 [1417839.894652] __do_fault+0x32/0x110 [1417839.902540] __handle_mm_fault+0xacd/0x1000 [1417839.912156] handle_mm_fault+0xaa/0x1c0 [1417839.921004] __do_page_fault+0x242/0x4b0 [1417839.930044] ? page_fault+0x8/0x30 [1417839.937933] page_fault+0x1e/0x30 [1417839.945631] RIP: 0033:0x33c4bae [1417839.952927] Code: Bad RIP value. [1417839.960411] RSP: 002b:00007f5a844f7350 EFLAGS: 00010206 [1417839.972331] RAX: 000000000000006e RBX: 1614b3ff6a50398a RCX: 0000000000000000 [1417839.988477] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000002 [1417840.004626] RBP: 00007f5a844f7420 R08: 000000000000006e R09: 00007f5a94aeccb8 [1417840.020784] R10: 00007f5a844f7350 R11: 0000000000000000 R12: 00007f5a94aecc79 [1417840.036932] R13: 00007f5a94aecc78 R14: 00007f5a94aecc90 R15: 00007f5a94aecc40 CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Filipe Manana Signed-off-by: Boris Burkov Signed-off-by: David Sterba --- fs/btrfs/extent_io.c | 40 ++++++++++++++++++++++++---------------- 1 file changed, 24 insertions(+), 16 deletions(-) (limited to 'fs/btrfs/extent_io.c') diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index c59e07360083..95313bb7fe40 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -5086,25 +5086,28 @@ struct extent_buffer *alloc_dummy_extent_buffer(struct btrfs_fs_info *fs_info, static void check_buffer_tree_ref(struct extent_buffer *eb) { int refs; - /* the ref bit is tricky. We have to make sure it is set - * if we have the buffer dirty. Otherwise the - * code to free a buffer can end up dropping a dirty - * page + /* + * The TREE_REF bit is first set when the extent_buffer is added + * to the radix tree. It is also reset, if unset, when a new reference + * is created by find_extent_buffer. * - * Once the ref bit is set, it won't go away while the - * buffer is dirty or in writeback, and it also won't - * go away while we have the reference count on the - * eb bumped. + * It is only cleared in two cases: freeing the last non-tree + * reference to the extent_buffer when its STALE bit is set or + * calling releasepage when the tree reference is the only reference. * - * We can't just set the ref bit without bumping the - * ref on the eb because free_extent_buffer might - * see the ref bit and try to clear it. If this happens - * free_extent_buffer might end up dropping our original - * ref by mistake and freeing the page before we are able - * to add one more ref. + * In both cases, care is taken to ensure that the extent_buffer's + * pages are not under io. However, releasepage can be concurrently + * called with creating new references, which is prone to race + * conditions between the calls to check_buffer_tree_ref in those + * codepaths and clearing TREE_REF in try_release_extent_buffer. * - * So bump the ref count first, then set the bit. If someone - * beat us to it, drop the ref we added. + * The actual lifetime of the extent_buffer in the radix tree is + * adequately protected by the refcount, but the TREE_REF bit and + * its corresponding reference are not. To protect against this + * class of races, we call check_buffer_tree_ref from the codepaths + * which trigger io after they set eb->io_pages. Note that once io is + * initiated, TREE_REF can no longer be cleared, so that is the + * moment at which any such race is best fixed. */ refs = atomic_read(&eb->refs); if (refs >= 2 && test_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags)) @@ -5555,6 +5558,11 @@ int read_extent_buffer_pages(struct extent_buffer *eb, int wait, int mirror_num) clear_bit(EXTENT_BUFFER_READ_ERR, &eb->bflags); eb->read_mirror = 0; atomic_set(&eb->io_pages, num_reads); + /* + * It is possible for releasepage to clear the TREE_REF bit before we + * set io_pages. See check_buffer_tree_ref for a more detailed comment. + */ + check_buffer_tree_ref(eb); for (i = 0; i < num_pages; i++) { page = eb->pages[i]; -- cgit From 5909ca110b29aa16b23b52b8de8d3bb1035fd738 Mon Sep 17 00:00:00 2001 From: Robbie Ko Date: Mon, 20 Jul 2020 09:42:09 +0800 Subject: btrfs: fix page leaks after failure to lock page for delalloc When locking pages for delalloc, we check if it's dirty and mapping still matches. If it does not match, we need to return -EAGAIN and release all pages. Only the current page was put though, iterate over all the remaining pages too. CC: stable@vger.kernel.org # 4.14+ Reviewed-by: Filipe Manana Reviewed-by: Nikolay Borisov Signed-off-by: Robbie Ko Reviewed-by: David Sterba Signed-off-by: David Sterba --- fs/btrfs/extent_io.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'fs/btrfs/extent_io.c') diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 95313bb7fe40..72c836b54efc 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -1999,7 +1999,8 @@ static int __process_pages_contig(struct address_space *mapping, if (!PageDirty(pages[i]) || pages[i]->mapping != mapping) { unlock_page(pages[i]); - put_page(pages[i]); + for (; i < ret; i++) + put_page(pages[i]); err = -EAGAIN; goto out; } -- cgit From ad7ff17b65a0567b826c88009d3ea080431816c3 Mon Sep 17 00:00:00 2001 From: Nikolay Borisov Date: Wed, 3 Jun 2020 08:55:06 +0300 Subject: btrfs: make extent_clear_unlock_delalloc take btrfs_inode It has one VFS and 1 btrfs inode usages but converting it to btrfs_inode interface will allow seamless conversion of its callers. Signed-off-by: Nikolay Borisov Reviewed-by: David Sterba Signed-off-by: David Sterba --- fs/btrfs/extent_io.c | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) (limited to 'fs/btrfs/extent_io.c') diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 60278e52c37a..1b18960c9c5d 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -2018,15 +2018,14 @@ out: return err; } -void extent_clear_unlock_delalloc(struct inode *inode, u64 start, u64 end, +void extent_clear_unlock_delalloc(struct btrfs_inode *inode, u64 start, u64 end, struct page *locked_page, unsigned clear_bits, unsigned long page_ops) { - clear_extent_bit(&BTRFS_I(inode)->io_tree, start, end, clear_bits, 1, 0, - NULL); + clear_extent_bit(&inode->io_tree, start, end, clear_bits, 1, 0, NULL); - __process_pages_contig(inode->i_mapping, locked_page, + __process_pages_contig(inode->vfs_inode.i_mapping, locked_page, start >> PAGE_SHIFT, end >> PAGE_SHIFT, page_ops, NULL); } -- cgit From bab16e21e8bbd644067289cfa328f8a67f3e333d Mon Sep 17 00:00:00 2001 From: David Sterba Date: Tue, 23 Jun 2020 20:56:12 +0200 Subject: btrfs: don't use UAPI types for fiemap callback The fiemap callback is not part of UAPI interface and the prototypes don't have the __u64 types either. Reviewed-by: Nikolay Borisov Signed-off-by: David Sterba --- fs/btrfs/extent_io.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'fs/btrfs/extent_io.c') diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 1b18960c9c5d..e3078ed2261b 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -4669,7 +4669,7 @@ static int emit_last_fiemap_cache(struct fiemap_extent_info *fieinfo, } int extent_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo, - __u64 start, __u64 len) + u64 start, u64 len) { int ret = 0; u64 off = start; -- cgit From 98456b9c46c140275b0fe81970ddfad250c68ed4 Mon Sep 17 00:00:00 2001 From: Nikolay Borisov Date: Wed, 3 Jun 2020 08:55:29 +0300 Subject: btrfs: make btrfs_run_delalloc_range take btrfs_inode All children now take btrfs_inode so convert it to taking it as a parameter as well. Signed-off-by: Nikolay Borisov Reviewed-by: David Sterba Signed-off-by: David Sterba --- fs/btrfs/extent_io.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'fs/btrfs/extent_io.c') diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index e3078ed2261b..fde2fcaeefaa 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -3439,7 +3439,7 @@ static noinline_for_stack int writepage_delalloc(struct inode *inode, delalloc_start = delalloc_end + 1; continue; } - ret = btrfs_run_delalloc_range(inode, page, delalloc_start, + ret = btrfs_run_delalloc_range(BTRFS_I(inode), page, delalloc_start, delalloc_end, &page_started, nr_written, wbc); if (ret) { SetPageError(page); -- cgit From d4580fe25dd397b6f15bfc3eb628d28944ffd26b Mon Sep 17 00:00:00 2001 From: Nikolay Borisov Date: Wed, 3 Jun 2020 08:55:33 +0300 Subject: btrfs: make __extent_writepage_io take btrfs_inode It has only a single use for a generic vfs inode vs 3 for btrfs_inode. Signed-off-by: Nikolay Borisov Reviewed-by: David Sterba Signed-off-by: David Sterba --- fs/btrfs/extent_io.c | 15 +++++++-------- 1 file changed, 7 insertions(+), 8 deletions(-) (limited to 'fs/btrfs/extent_io.c') diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index fde2fcaeefaa..cb9969ffafbd 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -3496,7 +3496,7 @@ done: * 0 if all went well (page still locked) * < 0 if there were errors (page still locked) */ -static noinline_for_stack int __extent_writepage_io(struct inode *inode, +static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode, struct page *page, struct writeback_control *wbc, struct extent_page_data *epd, @@ -3504,7 +3504,7 @@ static noinline_for_stack int __extent_writepage_io(struct inode *inode, unsigned long nr_written, int *nr_ret) { - struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree; + struct extent_io_tree *tree = &inode->io_tree; u64 start = page_offset(page); u64 page_end = start + PAGE_SIZE - 1; u64 end; @@ -3536,7 +3536,7 @@ static noinline_for_stack int __extent_writepage_io(struct inode *inode, update_nr_written(wbc, nr_written + 1); end = page_end; - blocksize = inode->i_sb->s_blocksize; + blocksize = inode->vfs_inode.i_sb->s_blocksize; while (cur <= end) { u64 em_end; @@ -3547,8 +3547,7 @@ static noinline_for_stack int __extent_writepage_io(struct inode *inode, page_end, 1); break; } - em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, cur, - end - cur + 1); + em = btrfs_get_extent(inode, NULL, 0, cur, end - cur + 1); if (IS_ERR_OR_NULL(em)) { SetPageError(page); ret = PTR_ERR_OR_ZERO(em); @@ -3585,7 +3584,7 @@ static noinline_for_stack int __extent_writepage_io(struct inode *inode, btrfs_set_range_writeback(tree, cur, cur + iosize - 1); if (!PageWriteback(page)) { - btrfs_err(BTRFS_I(inode)->root->fs_info, + btrfs_err(inode->root->fs_info, "page %lu not writeback, cur %llu end %llu", page->index, cur, end); } @@ -3665,8 +3664,8 @@ static int __extent_writepage(struct page *page, struct writeback_control *wbc, goto done; } - ret = __extent_writepage_io(inode, page, wbc, epd, - i_size, nr_written, &nr); + ret = __extent_writepage_io(BTRFS_I(inode), page, wbc, epd, i_size, + nr_written, &nr); if (ret == 1) return 0; -- cgit From cd4c0bf94292b60ab6b0f5dfa5e1297572800f1f Mon Sep 17 00:00:00 2001 From: Nikolay Borisov Date: Fri, 5 Jun 2020 10:42:10 +0300 Subject: btrfs: make writepage_delalloc take btrfs_inode Only find_lock_delalloc_range uses vfs_inode so let's take the btrfs_inode as a parameter. Signed-off-by: Nikolay Borisov Reviewed-by: David Sterba Signed-off-by: David Sterba --- fs/btrfs/extent_io.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) (limited to 'fs/btrfs/extent_io.c') diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index cb9969ffafbd..abd36d92c077 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -3419,7 +3419,7 @@ static void update_nr_written(struct writeback_control *wbc, * This returns 0 if all went well (page still locked) * This returns < 0 if there were errors (page still locked) */ -static noinline_for_stack int writepage_delalloc(struct inode *inode, +static noinline_for_stack int writepage_delalloc(struct btrfs_inode *inode, struct page *page, struct writeback_control *wbc, u64 delalloc_start, unsigned long *nr_written) { @@ -3432,14 +3432,14 @@ static noinline_for_stack int writepage_delalloc(struct inode *inode, while (delalloc_end < page_end) { - found = find_lock_delalloc_range(inode, page, + found = find_lock_delalloc_range(&inode->vfs_inode, page, &delalloc_start, &delalloc_end); if (!found) { delalloc_start = delalloc_end + 1; continue; } - ret = btrfs_run_delalloc_range(BTRFS_I(inode), page, delalloc_start, + ret = btrfs_run_delalloc_range(inode, page, delalloc_start, delalloc_end, &page_started, nr_written, wbc); if (ret) { SetPageError(page); @@ -3657,7 +3657,8 @@ static int __extent_writepage(struct page *page, struct writeback_control *wbc, set_page_extent_mapped(page); if (!epd->extent_locked) { - ret = writepage_delalloc(inode, page, wbc, start, &nr_written); + ret = writepage_delalloc(BTRFS_I(inode), page, wbc, start, + &nr_written); if (ret == 1) return 0; if (ret) -- cgit From 2279a270534c678e78f961fed8e09c9c69611ceb Mon Sep 17 00:00:00 2001 From: Nikolay Borisov Date: Thu, 2 Jul 2020 15:23:28 +0300 Subject: btrfs: make get_state_failrec return failrec directly Only failure that get_state_failrec can get is if there is no failure for the given address. There is no reason why the function should return a status code and use a separate parameter for returning the actual failure rec (if one is found). Simplify it by making the return type a pointer and return ERR_PTR value in case of errors. Signed-off-by: Nikolay Borisov Reviewed-by: David Sterba Signed-off-by: David Sterba --- fs/btrfs/extent_io.c | 22 +++++++++++----------- 1 file changed, 11 insertions(+), 11 deletions(-) (limited to 'fs/btrfs/extent_io.c') diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index abd36d92c077..59ec53464746 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -2122,12 +2122,11 @@ out: return ret; } -int get_state_failrec(struct extent_io_tree *tree, u64 start, - struct io_failure_record **failrec) +struct io_failure_record *get_state_failrec(struct extent_io_tree *tree, u64 start) { struct rb_node *node; struct extent_state *state; - int ret = 0; + struct io_failure_record *failrec; spin_lock(&tree->lock); /* @@ -2136,18 +2135,19 @@ int get_state_failrec(struct extent_io_tree *tree, u64 start, */ node = tree_search(tree, start); if (!node) { - ret = -ENOENT; + failrec = ERR_PTR(-ENOENT); goto out; } state = rb_entry(node, struct extent_state, rb_node); if (state->start != start) { - ret = -ENOENT; + failrec = ERR_PTR(-ENOENT); goto out; } - *failrec = state->failrec; + + failrec = state->failrec; out: spin_unlock(&tree->lock); - return ret; + return failrec; } /* @@ -2377,8 +2377,8 @@ int clean_io_failure(struct btrfs_fs_info *fs_info, if (!ret) return 0; - ret = get_state_failrec(failure_tree, start, &failrec); - if (ret) + failrec = get_state_failrec(failure_tree, start); + if (IS_ERR(failrec)) return 0; BUG_ON(!failrec->this_mirror); @@ -2462,8 +2462,8 @@ int btrfs_get_io_failure_record(struct inode *inode, u64 start, u64 end, int ret; u64 logical; - ret = get_state_failrec(failure_tree, start, &failrec); - if (ret) { + failrec = get_state_failrec(failure_tree, start); + if (IS_ERR(failrec)) { failrec = kzalloc(sizeof(*failrec), GFP_NOFS); if (!failrec) return -ENOMEM; -- cgit From 3526302f26162ae738926d44eb39e2e0780b11bc Mon Sep 17 00:00:00 2001 From: Nikolay Borisov Date: Thu, 2 Jul 2020 15:23:29 +0300 Subject: btrfs: streamline btrfs_get_io_failure_record logic Make the function directly return a pointer to a failure record and adjust callers to handle it. Also refactor the logic inside so that the case which allocates the failure record for the first time is not handled in an 'if' arm, saving us a level of indentation. Finally make the function static as it's not used outside of extent_io.c . Reviewed-by: Josef Bacik Signed-off-by: Nikolay Borisov Reviewed-by: David Sterba Signed-off-by: David Sterba --- fs/btrfs/extent_io.c | 129 +++++++++++++++++++++++++-------------------------- 1 file changed, 63 insertions(+), 66 deletions(-) (limited to 'fs/btrfs/extent_io.c') diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 59ec53464746..e37950a576fc 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -2450,8 +2450,8 @@ void btrfs_free_io_failure_record(struct btrfs_inode *inode, u64 start, u64 end) spin_unlock(&failure_tree->lock); } -int btrfs_get_io_failure_record(struct inode *inode, u64 start, u64 end, - struct io_failure_record **failrec_ret) +static struct io_failure_record *btrfs_get_io_failure_record(struct inode *inode, + u64 start, u64 end) { struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); struct io_failure_record *failrec; @@ -2463,64 +2463,7 @@ int btrfs_get_io_failure_record(struct inode *inode, u64 start, u64 end, u64 logical; failrec = get_state_failrec(failure_tree, start); - if (IS_ERR(failrec)) { - failrec = kzalloc(sizeof(*failrec), GFP_NOFS); - if (!failrec) - return -ENOMEM; - - failrec->start = start; - failrec->len = end - start + 1; - failrec->this_mirror = 0; - failrec->bio_flags = 0; - failrec->in_validation = 0; - - read_lock(&em_tree->lock); - em = lookup_extent_mapping(em_tree, start, failrec->len); - if (!em) { - read_unlock(&em_tree->lock); - kfree(failrec); - return -EIO; - } - - if (em->start > start || em->start + em->len <= start) { - free_extent_map(em); - em = NULL; - } - read_unlock(&em_tree->lock); - if (!em) { - kfree(failrec); - return -EIO; - } - - logical = start - em->start; - logical = em->block_start + logical; - if (test_bit(EXTENT_FLAG_COMPRESSED, &em->flags)) { - logical = em->block_start; - failrec->bio_flags = EXTENT_BIO_COMPRESSED; - extent_set_compress_type(&failrec->bio_flags, - em->compress_type); - } - - btrfs_debug(fs_info, - "Get IO Failure Record: (new) logical=%llu, start=%llu, len=%llu", - logical, start, failrec->len); - - failrec->logical = logical; - free_extent_map(em); - - /* set the bits in the private failure tree */ - ret = set_extent_bits(failure_tree, start, end, - EXTENT_LOCKED | EXTENT_DIRTY); - if (ret >= 0) - ret = set_state_failrec(failure_tree, start, failrec); - /* set the bits in the inode's tree */ - if (ret >= 0) - ret = set_extent_bits(tree, start, end, EXTENT_DAMAGED); - if (ret < 0) { - kfree(failrec); - return ret; - } - } else { + if (!IS_ERR(failrec)) { btrfs_debug(fs_info, "Get IO Failure Record: (found) logical=%llu, start=%llu, len=%llu, validation=%d", failrec->logical, failrec->start, failrec->len, @@ -2530,11 +2473,66 @@ int btrfs_get_io_failure_record(struct inode *inode, u64 start, u64 end, * (e.g. with a list for failed_mirror) to make * clean_io_failure() clean all those errors at once. */ + + return failrec; } - *failrec_ret = failrec; + failrec = kzalloc(sizeof(*failrec), GFP_NOFS); + if (!failrec) + return ERR_PTR(-ENOMEM); - return 0; + failrec->start = start; + failrec->len = end - start + 1; + failrec->this_mirror = 0; + failrec->bio_flags = 0; + failrec->in_validation = 0; + + read_lock(&em_tree->lock); + em = lookup_extent_mapping(em_tree, start, failrec->len); + if (!em) { + read_unlock(&em_tree->lock); + kfree(failrec); + return ERR_PTR(-EIO); + } + + if (em->start > start || em->start + em->len <= start) { + free_extent_map(em); + em = NULL; + } + read_unlock(&em_tree->lock); + if (!em) { + kfree(failrec); + return ERR_PTR(-EIO); + } + + logical = start - em->start; + logical = em->block_start + logical; + if (test_bit(EXTENT_FLAG_COMPRESSED, &em->flags)) { + logical = em->block_start; + failrec->bio_flags = EXTENT_BIO_COMPRESSED; + extent_set_compress_type(&failrec->bio_flags, em->compress_type); + } + + btrfs_debug(fs_info, + "Get IO Failure Record: (new) logical=%llu, start=%llu, len=%llu", + logical, start, failrec->len); + + failrec->logical = logical; + free_extent_map(em); + + /* Set the bits in the private failure tree */ + ret = set_extent_bits(failure_tree, start, end, + EXTENT_LOCKED | EXTENT_DIRTY); + if (ret >= 0) { + ret = set_state_failrec(failure_tree, start, failrec); + /* Set the bits in the inode's tree */ + ret = set_extent_bits(tree, start, end, EXTENT_DAMAGED); + } else if (ret < 0) { + kfree(failrec); + return ERR_PTR(ret); + } + + return failrec; } static bool btrfs_check_repairable(struct inode *inode, bool needs_validation, @@ -2659,16 +2657,15 @@ blk_status_t btrfs_submit_read_repair(struct inode *inode, struct bio *repair_bio; struct btrfs_io_bio *repair_io_bio; blk_status_t status; - int ret; btrfs_debug(fs_info, "repair read error: read error at %llu", start); BUG_ON(bio_op(failed_bio) == REQ_OP_WRITE); - ret = btrfs_get_io_failure_record(inode, start, end, &failrec); - if (ret) - return errno_to_blk_status(ret); + failrec = btrfs_get_io_failure_record(inode, start, end); + if (IS_ERR(failrec)) + return errno_to_blk_status(PTR_ERR(failrec)); need_validation = btrfs_io_needs_validation(inode, failed_bio); -- cgit From b69d1ee923ec59533672ff4953ab45c57576bce0 Mon Sep 17 00:00:00 2001 From: Nikolay Borisov Date: Thu, 16 Jul 2020 18:17:19 +0300 Subject: btrfs: remove done label in writepage_delalloc Since there is not common cleanup run after the label it makes it somewhat redundant. Reviewed-by: Johannes Thumshirn Signed-off-by: Nikolay Borisov Reviewed-by: David Sterba Signed-off-by: David Sterba --- fs/btrfs/extent_io.c | 8 ++------ 1 file changed, 2 insertions(+), 6 deletions(-) (limited to 'fs/btrfs/extent_io.c') diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index e37950a576fc..73c9c59cd535 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -3446,8 +3446,7 @@ static noinline_for_stack int writepage_delalloc(struct btrfs_inode *inode, * started, so we don't want to return > 0 unless * things are going well. */ - ret = ret < 0 ? ret : -EIO; - goto done; + return ret < 0 ? ret : -EIO; } /* * delalloc_end is already one less than the total length, so @@ -3479,10 +3478,7 @@ static noinline_for_stack int writepage_delalloc(struct btrfs_inode *inode, return 1; } - ret = 0; - -done: - return ret; + return 0; } /* -- cgit From fbabd4a36faaf74c83142d0b3d950c11ec14fda1 Mon Sep 17 00:00:00 2001 From: Josef Bacik Date: Tue, 21 Jul 2020 10:38:37 -0400 Subject: btrfs: return EROFS for BTRFS_FS_STATE_ERROR cases Eric reported seeing this message while running generic/475 BTRFS: error (device dm-3) in btrfs_sync_log:3084: errno=-117 Filesystem corrupted Full stack trace: BTRFS: error (device dm-0) in btrfs_commit_transaction:2323: errno=-5 IO failure (Error while writing out transaction) BTRFS info (device dm-0): forced readonly BTRFS warning (device dm-0): Skipping commit of aborted transaction. ------------[ cut here ]------------ BTRFS: error (device dm-0) in cleanup_transaction:1894: errno=-5 IO failure BTRFS: Transaction aborted (error -117) BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c6480 len 4096 err no 10 BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c6488 len 4096 err no 10 BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c6490 len 4096 err no 10 BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c6498 len 4096 err no 10 BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64a0 len 4096 err no 10 BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64a8 len 4096 err no 10 BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64b0 len 4096 err no 10 BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64b8 len 4096 err no 10 BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64c0 len 4096 err no 10 BTRFS warning (device dm-0): direct IO failed ino 3572 rw 0,0 sector 0x1b85e8 len 4096 err no 10 BTRFS warning (device dm-0): direct IO failed ino 3572 rw 0,0 sector 0x1b85f0 len 4096 err no 10 WARNING: CPU: 3 PID: 23985 at fs/btrfs/tree-log.c:3084 btrfs_sync_log+0xbc8/0xd60 [btrfs] BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d4288 len 4096 err no 10 BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d4290 len 4096 err no 10 BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d4298 len 4096 err no 10 BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42a0 len 4096 err no 10 BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42a8 len 4096 err no 10 BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42b0 len 4096 err no 10 BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42b8 len 4096 err no 10 BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42c0 len 4096 err no 10 BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42c8 len 4096 err no 10 BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42d0 len 4096 err no 10 CPU: 3 PID: 23985 Comm: fsstress Tainted: G W L 5.8.0-rc4-default+ #1181 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014 RIP: 0010:btrfs_sync_log+0xbc8/0xd60 [btrfs] RSP: 0018:ffff909a44d17bd0 EFLAGS: 00010286 RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000001 RDX: ffff8f3be41cb940 RSI: ffffffffb0108d2b RDI: ffffffffb0108ff7 RBP: ffff909a44d17e70 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000037988 R12: ffff8f3bd20e4000 R13: ffff8f3bd20e4428 R14: 00000000ffffff8b R15: ffff909a44d17c70 FS: 00007f6a6ed3fb80(0000) GS:ffff8f3c3dc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f6a6ed3e000 CR3: 00000000525c0003 CR4: 0000000000160ee0 Call Trace: ? finish_wait+0x90/0x90 ? __mutex_unlock_slowpath+0x45/0x2a0 ? lock_acquire+0xa3/0x440 ? lockref_put_or_lock+0x9/0x30 ? dput+0x20/0x4a0 ? dput+0x20/0x4a0 ? do_raw_spin_unlock+0x4b/0xc0 ? _raw_spin_unlock+0x1f/0x30 btrfs_sync_file+0x335/0x490 [btrfs] do_fsync+0x38/0x70 __x64_sys_fsync+0x10/0x20 do_syscall_64+0x50/0xe0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f6a6ef1b6e3 Code: Bad RIP value. RSP: 002b:00007ffd01e20038 EFLAGS: 00000246 ORIG_RAX: 000000000000004a RAX: ffffffffffffffda RBX: 000000000007a120 RCX: 00007f6a6ef1b6e3 RDX: 00007ffd01e1ffa0 RSI: 00007ffd01e1ffa0 RDI: 0000000000000003 RBP: 0000000000000003 R08: 0000000000000001 R09: 00007ffd01e2004c R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000009f R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 irq event stamp: 0 hardirqs last enabled at (0): [<0000000000000000>] 0x0 hardirqs last disabled at (0): [] copy_process+0x67b/0x1b00 softirqs last enabled at (0): [] copy_process+0x67b/0x1b00 softirqs last disabled at (0): [<0000000000000000>] 0x0 ---[ end trace af146e0e38433456 ]--- BTRFS: error (device dm-0) in btrfs_sync_log:3084: errno=-117 Filesystem corrupted This ret came from btrfs_write_marked_extents(). If we get an aborted transaction via EIO before, we'll see it in btree_write_cache_pages() and return EUCLEAN, which gets printed as "Filesystem corrupted". Except we shouldn't be returning EUCLEAN here, we need to be returning EROFS because EUCLEAN is reserved for actual corruption, not IO errors. We are inconsistent about our handling of BTRFS_FS_STATE_ERROR elsewhere, but we want to use EROFS for this particular case. The original transaction abort has the real error code for why we ended up with an aborted transaction, all subsequent actions just need to return EROFS because they may not have a trans handle and have no idea about the original cause of the abort. After patch "btrfs: don't WARN if we abort a transaction with EROFS" the stacktrace will not be dumped either. Reported-by: Eric Sandeen CC: stable@vger.kernel.org # 5.4+ Signed-off-by: Josef Bacik Reviewed-by: David Sterba [ add full test stacktrace ] Signed-off-by: David Sterba --- fs/btrfs/extent_io.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'fs/btrfs/extent_io.c') diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 73c9c59cd535..3fbc37692592 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -4119,7 +4119,7 @@ retry: if (!test_bit(BTRFS_FS_STATE_ERROR, &fs_info->fs_state)) { ret = flush_write_bio(&epd); } else { - ret = -EUCLEAN; + ret = -EROFS; end_write_bio(&epd, ret); } return ret; -- cgit From 3d6448e631591756da36efb3ea6355ff6f383c3a Mon Sep 17 00:00:00 2001 From: Filipe Manana Date: Wed, 22 Jul 2020 12:28:37 +0100 Subject: btrfs: fix race between page release and a fast fsync When releasing an extent map, done through the page release callback, we can race with an ongoing fast fsync and cause the fsync to miss a new extent and not log it. The steps for this to happen are the following: 1) A page is dirtied for some inode I; 2) Writeback for that page is triggered by a path other than fsync, for example by the system due to memory pressure; 3) When the ordered extent for the extent (a single 4K page) finishes, we unpin the corresponding extent map and set its generation to N, the current transaction's generation; 4) The btrfs_releasepage() callback is invoked by the system due to memory pressure for that no longer dirty page of inode I; 5) At the same time, some task calls fsync on inode I, joins transaction N, and at btrfs_log_inode() it sees that the inode does not have the full sync flag set, so we proceed with a fast fsync. But before we get into btrfs_log_changed_extents() and lock the inode's extent map tree: 6) Through btrfs_releasepage() we end up at try_release_extent_mapping() and we remove the extent map for the new 4Kb extent, because it is neither pinned anymore nor locked. By calling remove_extent_mapping(), we remove the extent map from the list of modified extents, since the extent map does not have the logging flag set. We unlock the inode's extent map tree; 7) The task doing the fast fsync now enters btrfs_log_changed_extents(), locks the inode's extent map tree and iterates its list of modified extents, which no longer has the 4Kb extent in it, so it does not log the extent; 8) The fsync finishes; 9) Before transaction N is committed, a power failure happens. After replaying the log, the 4K extent of inode I will be missing, since it was not logged due to the race with try_release_extent_mapping(). So fix this by teaching try_release_extent_mapping() to not remove an extent map if it's still in the list of modified extents. Fixes: ff44c6e36dc9dc ("Btrfs: do not hold the write_lock on the extent tree while logging") CC: stable@vger.kernel.org # 5.4+ Signed-off-by: Filipe Manana Signed-off-by: David Sterba --- fs/btrfs/extent_io.c | 16 +++++++++++++--- 1 file changed, 13 insertions(+), 3 deletions(-) (limited to 'fs/btrfs/extent_io.c') diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 3fbc37692592..c049a33ed0df 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -4494,15 +4494,25 @@ int try_release_extent_mapping(struct page *page, gfp_t mask) free_extent_map(em); break; } - if (!test_range_bit(tree, em->start, - extent_map_end(em) - 1, - EXTENT_LOCKED, 0, NULL)) { + if (test_range_bit(tree, em->start, + extent_map_end(em) - 1, + EXTENT_LOCKED, 0, NULL)) + goto next; + /* + * If it's not in the list of modified extents, used + * by a fast fsync, we can remove it. If it's being + * logged we can safely remove it since fsync took an + * extra reference on the em. + */ + if (list_empty(&em->list) || + test_bit(EXTENT_FLAG_LOGGING, &em->flags)) { set_bit(BTRFS_INODE_NEEDS_FULL_SYNC, &btrfs_inode->runtime_flags); remove_extent_mapping(map, em); /* once for the rb tree */ free_extent_map(em); } +next: start = extent_map_end(em); write_unlock(&map->lock); -- cgit From fbc2bd7e7ab95704a96dade6a7d00ea99a68b015 Mon Sep 17 00:00:00 2001 From: Filipe Manana Date: Wed, 22 Jul 2020 12:28:52 +0100 Subject: btrfs: release old extent maps during page release When removing an extent map at try_release_extent_mapping(), called through the page release callback (btrfs_releasepage()), we never release an extent map that is in the list of modified extents. This is to prevent races with a concurrent fsync using the fast path, which could lead to not logging an extent created in the current transaction. However we can safely remove an extent map created in a past transaction that is still in the list of modified extents (because no one fsynced yet the inode after that transaction got commited), because such extents are skipped during an fsync as it is pointless to log them. This change does that. Signed-off-by: Filipe Manana Signed-off-by: David Sterba --- fs/btrfs/extent_io.c | 31 ++++++++++++++++++++++++------- 1 file changed, 24 insertions(+), 7 deletions(-) (limited to 'fs/btrfs/extent_io.c') diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index c049a33ed0df..d619babede0f 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -4481,6 +4481,9 @@ int try_release_extent_mapping(struct page *page, gfp_t mask) page->mapping->host->i_size > SZ_16M) { u64 len; while (start <= end) { + struct btrfs_fs_info *fs_info; + u64 cur_gen; + len = end - start + 1; write_lock(&map->lock); em = lookup_extent_mapping(map, start, len); @@ -4505,13 +4508,27 @@ int try_release_extent_mapping(struct page *page, gfp_t mask) * extra reference on the em. */ if (list_empty(&em->list) || - test_bit(EXTENT_FLAG_LOGGING, &em->flags)) { - set_bit(BTRFS_INODE_NEEDS_FULL_SYNC, - &btrfs_inode->runtime_flags); - remove_extent_mapping(map, em); - /* once for the rb tree */ - free_extent_map(em); - } + test_bit(EXTENT_FLAG_LOGGING, &em->flags)) + goto remove_em; + /* + * If it's in the list of modified extents, remove it + * only if its generation is older then the current one, + * in which case we don't need it for a fast fsync. + * Otherwise don't remove it, we could be racing with an + * ongoing fast fsync that could miss the new extent. + */ + fs_info = btrfs_inode->root->fs_info; + spin_lock(&fs_info->trans_lock); + cur_gen = fs_info->generation; + spin_unlock(&fs_info->trans_lock); + if (em->generation >= cur_gen) + goto next; +remove_em: + set_bit(BTRFS_INODE_NEEDS_FULL_SYNC, + &btrfs_inode->runtime_flags); + remove_extent_mapping(map, em); + /* once for the rb tree */ + free_extent_map(em); next: start = extent_map_end(em); write_unlock(&map->lock); -- cgit From 5e548b32018d96c377fda4bdac2bf511a448ca67 Mon Sep 17 00:00:00 2001 From: Filipe Manana Date: Wed, 22 Jul 2020 12:29:01 +0100 Subject: btrfs: do not set the full sync flag on the inode during page release When removing an extent map at try_release_extent_mapping(), called through the page release callback (btrfs_releasepage()), we always set the full sync flag on the inode, which forces the next fsync to use a slower code path. This hurts performance for workloads that dirty an amount of data that exceeds or is very close to the system's RAM memory and do frequent fsync operations (like database servers can for example). In particular if there are concurrent fsyncs against different files, by falling back to a full fsync we do a lot more checksum lookups in the checksums btree, as we do it for all the extents created in the current transaction, instead of only the new ones since the last fsync. These checksums lookups not only take some time but, more importantly, they also cause contention on the checksums btree locks due to the concurrency with checksum insertions in the btree by ordered extents from other inodes. We actually don't need to set the full sync flag on the inode, because we only remove extent maps that are in the list of modified extents if they were created in a past transaction, in which case an fsync skips them as it's pointless to log them. So stop setting the full fsync flag on the inode whenever we remove an extent map. This patch is part of a patchset that consists of 3 patches, which have the following subjects: 1/3 btrfs: fix race between page release and a fast fsync 2/3 btrfs: release old extent maps during page release 3/3 btrfs: do not set the full sync flag on the inode during page release Performance tests were ran against a branch (misc-next) containing the whole patchset. The test exercises a workload where there are multiple processes writing to files and fsyncing them (each writing and fsyncing its own file), and in total the amount of data dirtied ranges from 2x to 4x the system's RAM memory (16GiB), so that the page release callback is invoked frequently. The following script, using fio, was used to perform the tests: $ cat test-fsync.sh #!/bin/bash DEV=/dev/sdk MNT=/mnt/sdk MOUNT_OPTIONS="-o ssd" MKFS_OPTIONS="-d single -m single" if [ $# -ne 3 ]; then echo "Use $0 NUM_JOBS FILE_SIZE FSYNC_FREQ" exit 1 fi NUM_JOBS=$1 FILE_SIZE=$2 FSYNC_FREQ=$3 cat < /tmp/fio-job.ini [writers] rw=write fsync=$FSYNC_FREQ fallocate=none group_reporting=1 direct=0 bs=64k ioengine=sync size=$FILE_SIZE directory=$MNT numjobs=$NUM_JOBS thread EOF echo "Using config:" echo cat /tmp/fio-job.ini echo mkfs.btrfs -f $MKFS_OPTIONS $DEV &> /dev/null mount $MOUNT_OPTIONS $DEV $MNT fio /tmp/fio-job.ini umount $MNT The tests were performed for different numbers of jobs, file sizes and fsync frequency. A qemu VM using kvm was used, with 8 cores (the host has 12 cores, with cpu governance set to performance mode on all cores), 16GiB of ram (the host has 64GiB) and using a NVMe device directly (without an intermediary filesystem in the host). While running the tests, the host was not used for anything else, to avoid disturbing the tests. The obtained results were the following, and the last line printed by fio is pasted (includes aggregated throughput and test run time). ***************************************************** **** 1 job, 32GiB file, fsync frequency 1 **** ***************************************************** Before patchset: WRITE: bw=29.1MiB/s (30.5MB/s), 29.1MiB/s-29.1MiB/s (30.5MB/s-30.5MB/s), io=32.0GiB (34.4GB), run=1127557-1127557msec After patchset: WRITE: bw=29.3MiB/s (30.7MB/s), 29.3MiB/s-29.3MiB/s (30.7MB/s-30.7MB/s), io=32.0GiB (34.4GB), run=1119042-1119042msec (+0.7% throughput, -0.8% run time) ***************************************************** **** 2 jobs, 16GiB files, fsync frequency 1 **** ***************************************************** Before patchset: WRITE: bw=33.5MiB/s (35.1MB/s), 33.5MiB/s-33.5MiB/s (35.1MB/s-35.1MB/s), io=32.0GiB (34.4GB), run=979000-979000msec After patchset: WRITE: bw=39.9MiB/s (41.8MB/s), 39.9MiB/s-39.9MiB/s (41.8MB/s-41.8MB/s), io=32.0GiB (34.4GB), run=821283-821283msec (+19.1% throughput, -16.1% runtime) ***************************************************** **** 4 jobs, 8GiB files, fsync frequency 1 **** ***************************************************** Before patchset: WRITE: bw=52.1MiB/s (54.6MB/s), 52.1MiB/s-52.1MiB/s (54.6MB/s-54.6MB/s), io=32.0GiB (34.4GB), run=629130-629130msec After patchset: WRITE: bw=71.8MiB/s (75.3MB/s), 71.8MiB/s-71.8MiB/s (75.3MB/s-75.3MB/s), io=32.0GiB (34.4GB), run=456357-456357msec (+37.8% throughput, -27.5% runtime) ***************************************************** **** 8 jobs, 4GiB files, fsync frequency 1 **** ***************************************************** Before patchset: WRITE: bw=76.1MiB/s (79.8MB/s), 76.1MiB/s-76.1MiB/s (79.8MB/s-79.8MB/s), io=32.0GiB (34.4GB), run=430708-430708msec After patchset: WRITE: bw=133MiB/s (140MB/s), 133MiB/s-133MiB/s (140MB/s-140MB/s), io=32.0GiB (34.4GB), run=245458-245458msec (+74.7% throughput, -43.0% run time) ***************************************************** **** 16 jobs, 2GiB files, fsync frequency 1 **** ***************************************************** Before patchset: WRITE: bw=74.7MiB/s (78.3MB/s), 74.7MiB/s-74.7MiB/s (78.3MB/s-78.3MB/s), io=32.0GiB (34.4GB), run=438625-438625msec After patchset: WRITE: bw=184MiB/s (193MB/s), 184MiB/s-184MiB/s (193MB/s-193MB/s), io=32.0GiB (34.4GB), run=177864-177864msec (+146.3% throughput, -59.5% run time) ***************************************************** **** 32 jobs, 2GiB files, fsync frequency 1 **** ***************************************************** Before patchset: WRITE: bw=72.6MiB/s (76.1MB/s), 72.6MiB/s-72.6MiB/s (76.1MB/s-76.1MB/s), io=64.0GiB (68.7GB), run=902615-902615msec After patchset: WRITE: bw=227MiB/s (238MB/s), 227MiB/s-227MiB/s (238MB/s-238MB/s), io=64.0GiB (68.7GB), run=288936-288936msec (+212.7% throughput, -68.0% run time) ***************************************************** **** 64 jobs, 1GiB files, fsync frequency 1 **** ***************************************************** Before patchset: WRITE: bw=98.8MiB/s (104MB/s), 98.8MiB/s-98.8MiB/s (104MB/s-104MB/s), io=64.0GiB (68.7GB), run=663126-663126msec After patchset: WRITE: bw=294MiB/s (308MB/s), 294MiB/s-294MiB/s (308MB/s-308MB/s), io=64.0GiB (68.7GB), run=222940-222940msec (+197.6% throughput, -66.4% run time) Signed-off-by: Filipe Manana Signed-off-by: David Sterba --- fs/btrfs/extent_io.c | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) (limited to 'fs/btrfs/extent_io.c') diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index d619babede0f..617ea38e6fd7 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -4524,8 +4524,14 @@ int try_release_extent_mapping(struct page *page, gfp_t mask) if (em->generation >= cur_gen) goto next; remove_em: - set_bit(BTRFS_INODE_NEEDS_FULL_SYNC, - &btrfs_inode->runtime_flags); + /* + * We only remove extent maps that are not in the list of + * modified extents or that are in the list but with a + * generation lower then the current generation, so there + * is no need to set the full fsync flag on the inode (it + * hurts the fsync performance for workloads with a data + * size that exceeds or is close to the system's memory). + */ remove_extent_mapping(map, em); /* once for the rb tree */ free_extent_map(em); -- cgit