From 410f954cb1d1c79ae485dd83a175f21954fd87cd Mon Sep 17 00:00:00 2001 From: Filipe Manana Date: Tue, 10 Sep 2019 15:26:49 +0100 Subject: Btrfs: fix assertion failure during fsync and use of stale transaction Sometimes when fsync'ing a file we need to log that other inodes exist and when we need to do that we acquire a reference on the inodes and then drop that reference using iput() after logging them. That generally is not a problem except if we end up doing the final iput() (dropping the last reference) on the inode and that inode has a link count of 0, which can happen in a very short time window if the logging path gets a reference on the inode while it's being unlinked. In that case we end up getting the eviction callback, btrfs_evict_inode(), invoked through the iput() call chain which needs to drop all of the inode's items from its subvolume btree, and in order to do that, it needs to join a transaction at the helper function evict_refill_and_join(). However because the task previously started a transaction at the fsync handler, btrfs_sync_file(), it has current->journal_info already pointing to a transaction handle and therefore evict_refill_and_join() will get that transaction handle from btrfs_join_transaction(). From this point on, two different problems can happen: 1) evict_refill_and_join() will often change the transaction handle's block reserve (->block_rsv) and set its ->bytes_reserved field to a value greater than 0. If evict_refill_and_join() never commits the transaction, the eviction handler ends up decreasing the reference count (->use_count) of the transaction handle through the call to btrfs_end_transaction(), and after that point we have a transaction handle with a NULL ->block_rsv (which is the value prior to the transaction join from evict_refill_and_join()) and a ->bytes_reserved value greater than 0. If after the eviction/iput completes the inode logging path hits an error or it decides that it must fallback to a transaction commit, the btrfs fsync handle, btrfs_sync_file(), gets a non-zero value from btrfs_log_dentry_safe(), and because of that non-zero value it tries to commit the transaction using a handle with a NULL ->block_rsv and a non-zero ->bytes_reserved value. This makes the transaction commit hit an assertion failure at btrfs_trans_release_metadata() because ->bytes_reserved is not zero but the ->block_rsv is NULL. The produced stack trace for that is like the following: [192922.917158] assertion failed: !trans->bytes_reserved, file: fs/btrfs/transaction.c, line: 816 [192922.917553] ------------[ cut here ]------------ [192922.917922] kernel BUG at fs/btrfs/ctree.h:3532! [192922.918310] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI [192922.918666] CPU: 2 PID: 883 Comm: fsstress Tainted: G W 5.1.4-btrfs-next-47 #1 [192922.919035] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014 [192922.919801] RIP: 0010:assfail.constprop.25+0x18/0x1a [btrfs] (...) [192922.920925] RSP: 0018:ffffaebdc8a27da8 EFLAGS: 00010286 [192922.921315] RAX: 0000000000000051 RBX: ffff95c9c16a41c0 RCX: 0000000000000000 [192922.921692] RDX: 0000000000000000 RSI: ffff95cab6b16838 RDI: ffff95cab6b16838 [192922.922066] RBP: ffff95c9c16a41c0 R08: 0000000000000000 R09: 0000000000000000 [192922.922442] R10: ffffaebdc8a27e70 R11: 0000000000000000 R12: ffff95ca731a0980 [192922.922820] R13: 0000000000000000 R14: ffff95ca84c73338 R15: ffff95ca731a0ea8 [192922.923200] FS: 00007f337eda4e80(0000) GS:ffff95cab6b00000(0000) knlGS:0000000000000000 [192922.923579] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [192922.923948] CR2: 00007f337edad000 CR3: 00000001e00f6002 CR4: 00000000003606e0 [192922.924329] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [192922.924711] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [192922.925105] Call Trace: [192922.925505] btrfs_trans_release_metadata+0x10c/0x170 [btrfs] [192922.925911] btrfs_commit_transaction+0x3e/0xaf0 [btrfs] [192922.926324] btrfs_sync_file+0x44c/0x490 [btrfs] [192922.926731] do_fsync+0x38/0x60 [192922.927138] __x64_sys_fdatasync+0x13/0x20 [192922.927543] do_syscall_64+0x60/0x1c0 [192922.927939] entry_SYSCALL_64_after_hwframe+0x49/0xbe (...) [192922.934077] ---[ end trace f00808b12068168f ]--- 2) If evict_refill_and_join() decides to commit the transaction, it will be able to do it, since the nested transaction join only increments the transaction handle's ->use_count reference counter and it does not prevent the transaction from getting committed. This means that after eviction completes, the fsync logging path will be using a transaction handle that refers to an already committed transaction. What happens when using such a stale transaction can be unpredictable, we are at least having a use-after-free on the transaction handle itself, since the transaction commit will call kmem_cache_free() against the handle regardless of its ->use_count value, or we can end up silently losing all the updates to the log tree after that iput() in the logging path, or using a transaction handle that in the meanwhile was allocated to another task for a new transaction, etc, pretty much unpredictable what can happen. In order to fix both of them, instead of using iput() during logging, use btrfs_add_delayed_iput(), so that the logging path of fsync never drops the last reference on an inode, that step is offloaded to a safe context (usually the cleaner kthread). The assertion failure issue was sporadically triggered by the test case generic/475 from fstests, which loads the dm error target while fsstress is running, which lead to fsync failing while logging inodes with -EIO errors and then trying later to commit the transaction, triggering the assertion failure. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Josef Bacik Signed-off-by: Filipe Manana Signed-off-by: David Sterba --- fs/btrfs/tree-log.c | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index 6c8297bcfeb7..1bfd7e34f31e 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -4985,7 +4985,7 @@ static int log_conflicting_inodes(struct btrfs_trans_handle *trans, BTRFS_I(inode), LOG_OTHER_INODE_ALL, 0, LLONG_MAX, ctx); - iput(inode); + btrfs_add_delayed_iput(inode); } } continue; @@ -5000,7 +5000,7 @@ static int log_conflicting_inodes(struct btrfs_trans_handle *trans, ret = btrfs_log_inode(trans, root, BTRFS_I(inode), LOG_OTHER_INODE, 0, LLONG_MAX, ctx); if (ret) { - iput(inode); + btrfs_add_delayed_iput(inode); continue; } @@ -5009,7 +5009,7 @@ static int log_conflicting_inodes(struct btrfs_trans_handle *trans, key.offset = 0; ret = btrfs_search_slot(NULL, root, &key, path, 0, 0); if (ret < 0) { - iput(inode); + btrfs_add_delayed_iput(inode); continue; } @@ -5056,7 +5056,7 @@ static int log_conflicting_inodes(struct btrfs_trans_handle *trans, } path->slots[0]++; } - iput(inode); + btrfs_add_delayed_iput(inode); } return ret; @@ -5689,7 +5689,7 @@ process_leaf: } if (btrfs_inode_in_log(BTRFS_I(di_inode), trans->transid)) { - iput(di_inode); + btrfs_add_delayed_iput(di_inode); break; } @@ -5701,7 +5701,7 @@ process_leaf: if (!ret && btrfs_must_commit_transaction(trans, BTRFS_I(di_inode))) ret = 1; - iput(di_inode); + btrfs_add_delayed_iput(di_inode); if (ret) goto next_dir_inode; if (ctx->log_new_dentries) { @@ -5848,7 +5848,7 @@ static int btrfs_log_all_parents(struct btrfs_trans_handle *trans, if (!ret && ctx && ctx->log_new_dentries) ret = log_new_dir_dentries(trans, root, BTRFS_I(dir_inode), ctx); - iput(dir_inode); + btrfs_add_delayed_iput(dir_inode); if (ret) goto out; } @@ -5891,7 +5891,7 @@ static int log_new_ancestors(struct btrfs_trans_handle *trans, ret = btrfs_log_inode(trans, root, BTRFS_I(inode), LOG_INODE_EXISTS, 0, LLONG_MAX, ctx); - iput(inode); + btrfs_add_delayed_iput(inode); if (ret) return ret; -- cgit From 18dfa7117a3f379862dcd3f67cadd678013bb9dd Mon Sep 17 00:00:00 2001 From: Filipe Manana Date: Wed, 11 Sep 2019 17:42:00 +0100 Subject: Btrfs: fix unwritten extent buffers and hangs on future writeback attempts The lock_extent_buffer_io() returns 1 to the caller to tell it everything went fine and the callers needs to start writeback for the extent buffer (submit a bio, etc), 0 to tell the caller everything went fine but it does not need to start writeback for the extent buffer, and a negative value if some error happened. When it's about to return 1 it tries to lock all pages, and if a try lock on a page fails, and we didn't flush any existing bio in our "epd", it calls flush_write_bio(epd) and overwrites the return value of 1 to 0 or an error. The page might have been locked elsewhere, not with the goal of starting writeback of the extent buffer, and even by some code other than btrfs, like page migration for example, so it does not mean the writeback of the extent buffer was already started by some other task, so returning a 0 tells the caller (btree_write_cache_pages()) to not start writeback for the extent buffer. Note that epd might currently have either no bio, so flush_write_bio() returns 0 (success) or it might have a bio for another extent buffer with a lower index (logical address). Since we return 0 with the EXTENT_BUFFER_WRITEBACK bit set on the extent buffer and writeback is never started for the extent buffer, future attempts to writeback the extent buffer will hang forever waiting on that bit to be cleared, since it can only be cleared after writeback completes. Such hang is reported with a trace like the following: [49887.347053] INFO: task btrfs-transacti:1752 blocked for more than 122 seconds. [49887.347059] Not tainted 5.2.13-gentoo #2 [49887.347060] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [49887.347062] btrfs-transacti D 0 1752 2 0x80004000 [49887.347064] Call Trace: [49887.347069] ? __schedule+0x265/0x830 [49887.347071] ? bit_wait+0x50/0x50 [49887.347072] ? bit_wait+0x50/0x50 [49887.347074] schedule+0x24/0x90 [49887.347075] io_schedule+0x3c/0x60 [49887.347077] bit_wait_io+0x8/0x50 [49887.347079] __wait_on_bit+0x6c/0x80 [49887.347081] ? __lock_release.isra.29+0x155/0x2d0 [49887.347083] out_of_line_wait_on_bit+0x7b/0x80 [49887.347084] ? var_wake_function+0x20/0x20 [49887.347087] lock_extent_buffer_for_io+0x28c/0x390 [49887.347089] btree_write_cache_pages+0x18e/0x340 [49887.347091] do_writepages+0x29/0xb0 [49887.347093] ? kmem_cache_free+0x132/0x160 [49887.347095] ? convert_extent_bit+0x544/0x680 [49887.347097] filemap_fdatawrite_range+0x70/0x90 [49887.347099] btrfs_write_marked_extents+0x53/0x120 [49887.347100] btrfs_write_and_wait_transaction.isra.4+0x38/0xa0 [49887.347102] btrfs_commit_transaction+0x6bb/0x990 [49887.347103] ? start_transaction+0x33e/0x500 [49887.347105] transaction_kthread+0x139/0x15c So fix this by not overwriting the return value (ret) with the result from flush_write_bio(). We also need to clear the EXTENT_BUFFER_WRITEBACK bit in case flush_write_bio() returns an error, otherwise it will hang any future attempts to writeback the extent buffer, and undo all work done before (set back EXTENT_BUFFER_DIRTY, etc). This is a regression introduced in the 5.2 kernel. Fixes: 2e3c25136adfb ("btrfs: extent_io: add proper error handling to lock_extent_buffer_for_io()") Fixes: f4340622e0226 ("btrfs: extent_io: Move the BUG_ON() in flush_write_bio() one level up") Reported-by: Zdenek Sojka Link: https://lore.kernel.org/linux-btrfs/GpO.2yos.3WGDOLpx6t%7D.1TUDYM@seznam.cz/T/#u Reported-by: Stefan Priebe - Profihost AG Link: https://lore.kernel.org/linux-btrfs/5c4688ac-10a7-fb07-70e8-c5d31a3fbb38@profihost.ag/T/#t Reported-by: Drazen Kacar Link: https://lore.kernel.org/linux-btrfs/DB8PR03MB562876ECE2319B3E579590F799C80@DB8PR03MB5628.eurprd03.prod.outlook.com/ Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=204377 Signed-off-by: Filipe Manana Signed-off-by: David Sterba --- fs/btrfs/extent_io.c | 35 ++++++++++++++++++++++++++--------- 1 file changed, 26 insertions(+), 9 deletions(-) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index aea990473392..9d1f42200af5 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -3628,6 +3628,13 @@ void wait_on_extent_buffer_writeback(struct extent_buffer *eb) TASK_UNINTERRUPTIBLE); } +static void end_extent_buffer_writeback(struct extent_buffer *eb) +{ + clear_bit(EXTENT_BUFFER_WRITEBACK, &eb->bflags); + smp_mb__after_atomic(); + wake_up_bit(&eb->bflags, EXTENT_BUFFER_WRITEBACK); +} + /* * Lock eb pages and flush the bio if we can't the locks * @@ -3699,8 +3706,11 @@ static noinline_for_stack int lock_extent_buffer_for_io(struct extent_buffer *eb if (!trylock_page(p)) { if (!flush) { - ret = flush_write_bio(epd); - if (ret < 0) { + int err; + + err = flush_write_bio(epd); + if (err < 0) { + ret = err; failed_page_nr = i; goto err_unlock; } @@ -3715,16 +3725,23 @@ err_unlock: /* Unlock already locked pages */ for (i = 0; i < failed_page_nr; i++) unlock_page(eb->pages[i]); + /* + * Clear EXTENT_BUFFER_WRITEBACK and wake up anyone waiting on it. + * Also set back EXTENT_BUFFER_DIRTY so future attempts to this eb can + * be made and undo everything done before. + */ + btrfs_tree_lock(eb); + spin_lock(&eb->refs_lock); + set_bit(EXTENT_BUFFER_DIRTY, &eb->bflags); + end_extent_buffer_writeback(eb); + spin_unlock(&eb->refs_lock); + percpu_counter_add_batch(&fs_info->dirty_metadata_bytes, eb->len, + fs_info->dirty_metadata_batch); + btrfs_clear_header_flag(eb, BTRFS_HEADER_FLAG_WRITTEN); + btrfs_tree_unlock(eb); return ret; } -static void end_extent_buffer_writeback(struct extent_buffer *eb) -{ - clear_bit(EXTENT_BUFFER_WRITEBACK, &eb->bflags); - smp_mb__after_atomic(); - wake_up_bit(&eb->bflags, EXTENT_BUFFER_WRITEBACK); -} - static void set_btree_ioerr(struct page *page) { struct extent_buffer *eb = (struct extent_buffer *)page->private; -- cgit