summaryrefslogtreecommitdiff
path: root/fs/btrfs/transaction.c
AgeCommit message (Collapse)Author
2020-02-19btrfs: destroy qgroup extent records on transaction abortJeff Mahoney
We clean up the delayed references when we abort a transaction but we leave the pending qgroup extent records behind, leaking memory. This patch destroys the extent records when we destroy the delayed refs and makes sure ensure they're gone before releasing the transaction. Fixes: 3368d001ba5d ("btrfs: qgroup: Record possible quota-related extent for qgroup.") CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jeff Mahoney <jeffm@suse.com> [ Rebased to latest upstream, remove to_qgroup() helper, use rbtree_postorder_for_each_entry_safe() wrapper ] Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-23btrfs: set trans->drity in btrfs_commit_transactionJosef Bacik
If we abort a transaction we have the following sequence if (!trans->dirty && list_empty(&trans->new_bgs)) return; WRITE_ONCE(trans->transaction->aborted, err); The idea being if we didn't modify anything with our trans handle then we don't really need to abort the whole transaction, maybe the other trans handles are fine and we can carry on. However in the case of create_snapshot we add a pending_snapshot object to our transaction and then commit the transaction. We don't actually modify anything. sync() behaves the same way, attach to an existing transaction and commit it. This means that if we have an IO error in the right places we could abort the committing transaction with our trans->dirty being not set and thus not set transaction->aborted. This is a problem because in the create_snapshot() case we depend on pending->error being set to something, or btrfs_commit_transaction returning an error. If we are not the trans handle that gets to commit the transaction, and we're waiting on the commit to happen we get our return value from cur_trans->aborted. If this was not set to anything because sync() hit an error in the transaction commit before it could modify anything then cur_trans->aborted would be 0. Thus we'd return 0 from btrfs_commit_transaction() in create_snapshot. This is a problem because we then try to do things with pending_snapshot->snap, which will be NULL because we didn't create the snapshot, and then we'll get a NULL pointer dereference like the following "BUG: kernel NULL pointer dereference, address: 00000000000001f0" RIP: 0010:btrfs_orphan_cleanup+0x2d/0x330 Call Trace: ? btrfs_mksubvol.isra.31+0x3f2/0x510 btrfs_mksubvol.isra.31+0x4bc/0x510 ? __sb_start_write+0xfa/0x200 ? mnt_want_write_file+0x24/0x50 btrfs_ioctl_snap_create_transid+0x16c/0x1a0 btrfs_ioctl_snap_create_v2+0x11e/0x1a0 btrfs_ioctl+0x1534/0x2c10 ? free_debug_processing+0x262/0x2a3 do_vfs_ioctl+0xa6/0x6b0 ? do_sys_open+0x188/0x220 ? syscall_trace_enter+0x1f8/0x330 ksys_ioctl+0x60/0x90 __x64_sys_ioctl+0x16/0x20 do_syscall_64+0x4a/0x1b0 In order to fix this we need to make sure anybody who calls commit_transaction has trans->dirty set so that they properly set the trans->transaction->aborted value properly so any waiters know bad things happened. This was found while I was running generic/475 with my modified fsstress, it reproduced within a few runs. I ran with this patch all night and didn't see the problem again. CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-23btrfs: drop log root for dropped rootsJosef Bacik
If we fsync on a subvolume and create a log root for that volume, and then later delete that subvolume we'll never clean up its log root. Fix this by making switch_commit_roots free the log for any dropped roots we encounter. The extra churn is because we need a btrfs_trans_handle, not the btrfs_transaction. CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: rename btrfs_block_group_cacheDavid Sterba
The type name is misleading, a single entry is named 'cache' while this normally means a collection of objects. Rename that everywhere. Also the identifier was quite long, making function prototypes harder to format. Suggested-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: Rename btrfs_join_transaction_nolockNikolay Borisov
This function is used only during the final phase of freespace cache writeout. This is necessary since using the plain btrfs_join_transaction api is deadlock prone. The deadlock looks like: T1: btrfs_commit_transaction commit_cowonly_roots btrfs_write_dirty_block_groups btrfs_wait_cache_io __btrfs_wait_cache_io btrfs_wait_ordered_range <-- Triggers ordered IO for freespace inode and blocks transaction commit until freespace cache writeout T2: <-- after T1 has triggered the writeout finish_ordered_fn btrfs_finish_ordered_io btrfs_join_transaction <--- this would block waiting for current transaction to commit, but since trans commit is waiting for this writeout to finish The special purpose functions prevents it by simply skipping the "wait for writeout" since it's guaranteed the transaction won't proceed until we are done. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: transaction: Cleanup unused TRANS_STATE_BLOCKEDQu Wenruo
The state was introduced in commit 4a9d8bdee368 ("Btrfs: make the state of the transaction more readable"), then in commit 302167c50b32 ("btrfs: don't end the transaction for delayed refs in throttle") the state is completely removed. So we can just clean up the state since it's only compared but never set. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: transaction: describe transaction states and transitionsQu Wenruo
Add an overview of the basic btrfs transaction transitions, including the following states: - No transaction states - Transaction N [[TRANS_STATE_RUNNING]] - Transaction N [[TRANS_STATE_COMMIT_START]] - Transaction N [[TRANS_STATE_COMMIT_DOING]] - Transaction N [[TRANS_STATE_UNBLOCKED]] - Transaction N [[TRANS_STATE_COMPLETED]] For each state, the comment will include: - Basic explaination about current state - How to go next stage - What will happen if we call various start_transaction() functions - Relationship to transaction N+1 This doesn't provide tech details, but serves as a cheat sheet for reader to get into the code a little easier. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18Btrfs: make btrfs_wait_extents() staticFilipe Manana
It's not used ouside of transaction.c Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: Add assert to catch nested transaction commitNikolay Borisov
A recent patch to btrfs showed that there was at least 1 case where a nested transaction was committed. Nested transaction in this case means a code which has a transaction handle calls some function which in turn obtains a copy of the same transaction handle. In such cases the correct thing to do is for the lower callee to call btrfs_end_transaction which contains appropriate checks so as to not commit the transaction which will result in stale trans handler for the caller. To catch such cases add an assert in btrfs_commit_transaction ensuring btrfs_trans_handle::use_count is always 1. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09btrfs: move cond_wake_up functions out of ctreeDavid Sterba
The file ctree.h serves as a header for everything and has become quite bloated. Split some helpers that are generic and create a new file that should be the catch-all for code that's not btrfs-specific. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09btrfs: rename the btrfs_calc_*_metadata_size helpersJosef Bacik
btrfs_calc_trunc_metadata_size differs from trans_metadata_size in that it doesn't take into account any splitting at the levels, because truncate will never split nodes. However truncate _and_ changing will never split nodes, so rename btrfs_calc_trunc_metadata_size to btrfs_calc_metadata_size. Also btrfs_calc_trans_metadata_size is purely for inserting items, so rename this to btrfs_calc_insert_metadata_size. Making these clearer will help when I start using them differently in upcoming patches. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09btrfs: move basic block_group definitions to their own headerJosef Bacik
This is prep work for moving all of the block group cache code into its own file. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ minor comment updates ] Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-30Btrfs: fix deadlock between fiemap and transaction commitsFilipe Manana
The fiemap handler locks a file range that can have unflushed delalloc, and after locking the range, it tries to attach to a running transaction. If the running transaction started its commit, that is, it is in state TRANS_STATE_COMMIT_START, and either the filesystem was mounted with the flushoncommit option or the transaction is creating a snapshot for the subvolume that contains the file that fiemap is operating on, we end up deadlocking. This happens because fiemap is blocked on the transaction, waiting for it to complete, and the transaction is waiting for the flushed dealloc to complete, which requires locking the file range that the fiemap task already locked. The following stack traces serve as an example of when this deadlock happens: (...) [404571.515510] Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs] [404571.515956] Call Trace: [404571.516360] ? __schedule+0x3ae/0x7b0 [404571.516730] schedule+0x3a/0xb0 [404571.517104] lock_extent_bits+0x1ec/0x2a0 [btrfs] [404571.517465] ? remove_wait_queue+0x60/0x60 [404571.517832] btrfs_finish_ordered_io+0x292/0x800 [btrfs] [404571.518202] normal_work_helper+0xea/0x530 [btrfs] [404571.518566] process_one_work+0x21e/0x5c0 [404571.518990] worker_thread+0x4f/0x3b0 [404571.519413] ? process_one_work+0x5c0/0x5c0 [404571.519829] kthread+0x103/0x140 [404571.520191] ? kthread_create_worker_on_cpu+0x70/0x70 [404571.520565] ret_from_fork+0x3a/0x50 [404571.520915] kworker/u8:6 D 0 31651 2 0x80004000 [404571.521290] Workqueue: btrfs-flush_delalloc btrfs_flush_delalloc_helper [btrfs] (...) [404571.537000] fsstress D 0 13117 13115 0x00004000 [404571.537263] Call Trace: [404571.537524] ? __schedule+0x3ae/0x7b0 [404571.537788] schedule+0x3a/0xb0 [404571.538066] wait_current_trans+0xc8/0x100 [btrfs] [404571.538349] ? remove_wait_queue+0x60/0x60 [404571.538680] start_transaction+0x33c/0x500 [btrfs] [404571.539076] btrfs_check_shared+0xa3/0x1f0 [btrfs] [404571.539513] ? extent_fiemap+0x2ce/0x650 [btrfs] [404571.539866] extent_fiemap+0x2ce/0x650 [btrfs] [404571.540170] do_vfs_ioctl+0x526/0x6f0 [404571.540436] ksys_ioctl+0x70/0x80 [404571.540734] __x64_sys_ioctl+0x16/0x20 [404571.540997] do_syscall_64+0x60/0x1d0 [404571.541279] entry_SYSCALL_64_after_hwframe+0x49/0xbe (...) [404571.543729] btrfs D 0 14210 14208 0x00004000 [404571.544023] Call Trace: [404571.544275] ? __schedule+0x3ae/0x7b0 [404571.544526] ? wait_for_completion+0x112/0x1a0 [404571.544795] schedule+0x3a/0xb0 [404571.545064] schedule_timeout+0x1ff/0x390 [404571.545351] ? lock_acquire+0xa6/0x190 [404571.545638] ? wait_for_completion+0x49/0x1a0 [404571.545890] ? wait_for_completion+0x112/0x1a0 [404571.546228] wait_for_completion+0x131/0x1a0 [404571.546503] ? wake_up_q+0x70/0x70 [404571.546775] btrfs_wait_ordered_extents+0x27c/0x400 [btrfs] [404571.547159] btrfs_commit_transaction+0x3b0/0xae0 [btrfs] [404571.547449] ? btrfs_mksubvol+0x4a4/0x640 [btrfs] [404571.547703] ? remove_wait_queue+0x60/0x60 [404571.547969] btrfs_mksubvol+0x605/0x640 [btrfs] [404571.548226] ? __sb_start_write+0xd4/0x1c0 [404571.548512] ? mnt_want_write_file+0x24/0x50 [404571.548789] btrfs_ioctl_snap_create_transid+0x169/0x1a0 [btrfs] [404571.549048] btrfs_ioctl_snap_create_v2+0x11d/0x170 [btrfs] [404571.549307] btrfs_ioctl+0x133f/0x3150 [btrfs] [404571.549549] ? mem_cgroup_charge_statistics+0x4c/0xd0 [404571.549792] ? mem_cgroup_commit_charge+0x84/0x4b0 [404571.550064] ? __handle_mm_fault+0xe3e/0x11f0 [404571.550306] ? do_raw_spin_unlock+0x49/0xc0 [404571.550608] ? _raw_spin_unlock+0x24/0x30 [404571.550976] ? __handle_mm_fault+0xedf/0x11f0 [404571.551319] ? do_vfs_ioctl+0xa2/0x6f0 [404571.551659] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs] [404571.552087] do_vfs_ioctl+0xa2/0x6f0 [404571.552355] ksys_ioctl+0x70/0x80 [404571.552621] __x64_sys_ioctl+0x16/0x20 [404571.552864] do_syscall_64+0x60/0x1d0 [404571.553104] entry_SYSCALL_64_after_hwframe+0x49/0xbe (...) If we were joining the transaction instead of attaching to it, we would not risk a deadlock because a join only blocks if the transaction is in a state greater then or equals to TRANS_STATE_COMMIT_DOING, and the delalloc flush performed by a transaction is done before it reaches that state, when it is in the state TRANS_STATE_COMMIT_START. However a transaction join is intended for use cases where we do modify the filesystem, and fiemap only needs to peek at delayed references from the current transaction in order to determine if extents are shared, and, besides that, when there is no current transaction or when it blocks to wait for a current committing transaction to complete, it creates a new transaction without reserving any space. Such unnecessary transactions, besides doing unnecessary IO, can cause transaction aborts (-ENOSPC) and unnecessary rotation of the precious backup roots. So fix this by adding a new transaction join variant, named join_nostart, which behaves like the regular join, but it does not create a transaction when none currently exists or after waiting for a committing transaction to complete. Fixes: 03628cdbc64db6 ("Btrfs: do not start a transaction during fiemap") Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-30Btrfs: fix race leading to fs corruption after transaction abortFilipe Manana
When one transaction is finishing its commit, it is possible for another transaction to start and enter its initial commit phase as well. If the first ends up getting aborted, we have a small time window where the second transaction commit does not notice that the previous transaction aborted and ends up committing, writing a superblock that points to btrees that reference extent buffers (nodes and leafs) that were not persisted to disk. The consequence is that after mounting the filesystem again, we will be unable to load some btree nodes/leafs, either because the content on disk is either garbage (or just zeroes) or corresponds to the old content of a previouly COWed or deleted node/leaf, resulting in the well known error messages "parent transid verify failed on ...". The following sequence diagram illustrates how this can happen. CPU 1 CPU 2 <at transaction N> btrfs_commit_transaction() (...) --> sets transaction state to TRANS_STATE_UNBLOCKED --> sets fs_info->running_transaction to NULL (...) btrfs_start_transaction() start_transaction() wait_current_trans() --> returns immediately because fs_info->running_transaction is NULL join_transaction() --> creates transaction N + 1 --> sets fs_info->running_transaction to transaction N + 1 --> adds transaction N + 1 to the fs_info->trans_list list --> returns transaction handle pointing to the new transaction N + 1 (...) btrfs_sync_file() btrfs_start_transaction() --> returns handle to transaction N + 1 (...) btrfs_write_and_wait_transaction() --> writeback of some extent buffer fails, returns an error btrfs_handle_fs_error() --> sets BTRFS_FS_STATE_ERROR in fs_info->fs_state --> jumps to label "scrub_continue" cleanup_transaction() btrfs_abort_transaction(N) --> sets BTRFS_FS_STATE_TRANS_ABORTED flag in fs_info->fs_state --> sets aborted field in the transaction and transaction handle structures, for transaction N only --> removes transaction from the list fs_info->trans_list btrfs_commit_transaction(N + 1) --> transaction N + 1 was not aborted, so it proceeds (...) --> sets the transaction's state to TRANS_STATE_COMMIT_START --> does not find the previous transaction (N) in the fs_info->trans_list, so it doesn't know that transaction was aborted, and the commit of transaction N + 1 proceeds (...) --> sets transaction N + 1 state to TRANS_STATE_UNBLOCKED btrfs_write_and_wait_transaction() --> succeeds writing all extent buffers created in the transaction N + 1 write_all_supers() --> succeeds --> we now have a superblock on disk that points to trees that refer to at least one extent buffer that was never persisted So fix this by updating the transaction commit path to check if the flag BTRFS_FS_STATE_TRANS_ABORTED is set on fs_info->fs_state if after setting the transaction to the TRANS_STATE_COMMIT_START we do not find any previous transaction in the fs_info->trans_list. If the flag is set, just fail the transaction commit with -EROFS, as we do in other places. The exact error code for the previous transaction abort was already logged and reported. Fixes: 49b25e0540904b ("btrfs: enhance transaction abort infrastructure") CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-04btrfs: migrate btrfs_trans_release_chunk_metadataJosef Bacik
Move this into transaction.c with the rest of the transaction related code. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29Btrfs: remove no longer used member num_dirty_bgs from transactionFilipe Manana
The member num_dirty_bgs of struct btrfs_transaction is not used anymore, it is set and incremented but nothing reads its value anymore. Its last read use was removed by commit 64403612b73a94 ("btrfs: rework btrfs_check_space_for_delayed_refs"). So just remove that member. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29btrfs: get fs_info from trans in btrfs_run_dev_replaceDavid Sterba
We can read fs_info from the transaction and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29btrfs: get fs_info from trans in btrfs_run_dev_statsDavid Sterba
We can read fs_info from the transaction and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29btrfs: get fs_info from trans in btrfs_write_dirty_block_groupsDavid Sterba
We can read fs_info from the transaction and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29btrfs: get fs_info from trans in btrfs_setup_space_cacheDavid Sterba
We can read fs_info from the transaction and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29btrfs: replace pending/pinned chunks lists with io treeJeff Mahoney
The pending chunks list contains chunks that are allocated in the current transaction but haven't been created yet. The pinned chunks list contains chunks that are being released in the current transaction. Both describe chunks that are not reflected on disk as in use but are unavailable just the same. The pending chunks list is anchored by the transaction handle, which means that we need to hold a reference to a transaction when working with the list. The way we use them is by iterating over both lists to perform comparisons on the stripes they describe for each device. This is backwards and requires that we keep a transaction handle open while we're trimming. This patchset adds an extent_io_tree to btrfs_device that maintains the allocation state of the device. Extents are set dirty when chunks are first allocated -- when the extent maps are added to the mapping tree. They're cleared when last removed -- when the extent maps are removed from the mapping tree. This matches the lifespan of the pending and pinned chunks list and allows us to do trims on unallocated space safely without pinning the transaction for what may be a lengthy operation. We can also use this io tree to mark which chunks have already been trimmed so we don't repeat the operation. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29btrfs: Rename and export clear_btree_io_treeNikolay Borisov
This function is going to be used to clear out the device extent allocation information. Give it a more generic name and export it. This is in preparation to replacing the pending/pinned chunk lists with an extent tree. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29btrfs: combine device update operations during transaction commitNikolay Borisov
We currently overload the pending_chunks list to handle updating btrfs_device->commit_bytes used. We don't actually care about the extent mapping or even the device mapping for the chunk - we just need the device, and we can end up processing it multiple times. The fs_devices->resized_list does more or less the same thing, but with the disk size. They are called consecutively during commit and have more or less the same purpose. We can combine the two lists into a single list that attaches to the transaction and contains a list of devices that need updating. Since we always add the device to a list when we change bytes_used or disk_total_size, there's no harm in copying both values at once. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29btrfs: Introduce extent_io_tree::owner to distinguish different io_treesQu Wenruo
Btrfs has the following different extent_io_trees used: - fs_info::free_extents[2] - btrfs_inode::io_tree - for both normal inodes and the btree inode - btrfs_inode::io_failure_tree - btrfs_transaction::dirty_pages - btrfs_root::dirty_log_pages If we want to trace changes in those trees, it will be pretty hard to distinguish them. Instead of using hard-to-read pointer address, this patch will introduce a new member extent_io_tree::owner to track the owner. This modification needs all the callers of extent_io_tree_init() to accept a new parameter @owner. This patch provides the basis for later trace events. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29btrfs: Introduce fs_info to extent_io_treeQu Wenruo
This patch will add a new member fs_info to extent_io_tree. This provides the basis for later trace events to distinguish the output between different btrfs filesystems. While this increases the size of the structure, we want to know the source of the trace events and passing the fs_info as an argument to all contexts is not possible. The selftests are now allowed to set it to NULL as they don't use the tracepoints. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-03-13Btrfs: fix file corruption after snapshotting due to mix of buffered/DIO writesFilipe Manana
When we are mixing buffered writes with direct IO writes against the same file and snapshotting is happening concurrently, we can end up with a corrupt file content in the snapshot. Example: 1) Inode/file is empty. 2) Snapshotting starts. 2) Buffered write at offset 0 length 256Kb. This updates the i_size of the inode to 256Kb, disk_i_size remains zero. This happens after the task doing the snapshot flushes all existing delalloc. 3) DIO write at offset 256Kb length 768Kb. Once the ordered extent completes it sets the inode's disk_i_size to 1Mb (256Kb + 768Kb) and updates the inode item in the fs tree with a size of 1Mb (which is the value of disk_i_size). 4) The dealloc for the range [0, 256Kb[ did not start yet. 5) The transaction used in the DIO ordered extent completion, which updated the inode item, is committed by the snapshotting task. 6) Snapshot creation completes. 7) Dealloc for the range [0, 256Kb[ is flushed. After that when reading the file from the snapshot we always get zeroes for the range [0, 256Kb[, the file has a size of 1Mb and the data written by the direct IO write is found. From an application's point of view this is a corruption, since in the source subvolume it could never read a version of the file that included the data from the direct IO write without the data from the buffered write included as well. In the snapshot's tree, file extent items are missing for the range [0, 256Kb[. The issue, obviously, does not happen when using the -o flushoncommit mount option. Fix this by flushing delalloc for all the roots that are about to be snapshotted when committing a transaction. This guarantees total ordering when updating the disk_i_size of an inode since the flush for dealloc is done when a transaction is in the TRANS_STATE_COMMIT_START state and wait is done once no more external writers exist. This is similar to what we do when using the flushoncommit mount option, but we do it only if the transaction has snapshots to create and only for the roots of the subvolumes to be snapshotted. The bulk of the dealloc is flushed in the snapshot creation ioctl, so the flush work we do inside the transaction is minimized. This issue, involving buffered and direct IO writes with snapshotting, is often triggered by fstest btrfs/078, and got reported by fsck when not using the NO_HOLES features, for example: $ cat results/btrfs/078.full (...) _check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent *** fsck.btrfs output *** [1/7] checking root items [2/7] checking extents [3/7] checking free space cache [4/7] checking fs roots root 258 inode 264 errors 100, file extent discount Found file extent holes: start: 524288, len: 65536 ERROR: errors found in fs roots Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-25btrfs: open code now trivial btrfs_set_lock_blockingDavid Sterba
btrfs_set_lock_blocking is now only a simple wrapper around btrfs_set_lock_blocking_write. The name does not bring any semantic value that could not be inferred from the new function so there's no point keeping it. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-25btrfs: qgroup: Introduce per-root swapped blocks infrastructureQu Wenruo
To allow delayed subtree swap rescan, btrfs needs to record per-root information about which tree blocks get swapped. This patch introduces the required infrastructure. The designed workflow will be: 1) Record the subtree root block that gets swapped. During subtree swap: O = Old tree blocks N = New tree blocks reloc tree subvolume tree X Root Root / \ / \ NA OB OA OB / | | \ / | | \ NC ND OE OF OC OD OE OF In this case, NA and OA are going to be swapped, record (NA, OA) into subvolume tree X. 2) After subtree swap. reloc tree subvolume tree X Root Root / \ / \ OA OB NA OB / | | \ / | | \ OC OD OE OF NC ND OE OF 3a) COW happens for OB If we are going to COW tree block OB, we check OB's bytenr against tree X's swapped_blocks structure. If it doesn't fit any, nothing will happen. 3b) COW happens for NA Check NA's bytenr against tree X's swapped_blocks, and get a hit. Then we do subtree scan on both subtrees OA and NA. Resulting 6 tree blocks to be scanned (OA, OC, OD, NA, NC, ND). Then no matter what we do to subvolume tree X, qgroup numbers will still be correct. Then NA's record gets removed from X's swapped_blocks. 4) Transaction commit Any record in X's swapped_blocks gets removed, since there is no modification to swapped subtrees, no need to trigger heavy qgroup subtree rescan for them. This will introduce 128 bytes overhead for each btrfs_root even qgroup is not enabled. This is to reduce memory allocations and potential failures. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-25btrfs: call btrfs_create_pending_block_groups unconditionallyJosef Bacik
The first thing we do is loop through the list, this if (!list_empty()) btrfs_create_pending_block_groups(); thing is just wasted space. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-01-30btrfs: clean up pending block groups when transaction commit abortsDavid Sterba
The fstests generic/475 stresses transaction aborts and can reveal space accounting or use-after-free bugs regarding block goups. In this case the pending block groups that remain linked to the structures after transaction commit aborts in the middle. The corrupted slabs lead to failures in following tests, eg. generic/476 [ 8172.752887] BUG: unable to handle kernel NULL pointer dereference at 0000000000000058 [ 8172.755799] #PF error: [normal kernel read fault] [ 8172.757571] PGD 661ae067 P4D 661ae067 PUD 3db8e067 PMD 0 [ 8172.759000] Oops: 0000 [#1] PREEMPT SMP [ 8172.760209] CPU: 0 PID: 39 Comm: kswapd0 Tainted: G W 5.0.0-rc2-default #408 [ 8172.762495] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626cc-prebuilt.qemu-project.org 04/01/2014 [ 8172.765772] RIP: 0010:shrink_page_list+0x2f9/0xe90 [ 8172.770453] RSP: 0018:ffff967f00663b18 EFLAGS: 00010287 [ 8172.771184] RAX: 0000000000000000 RBX: ffff967f00663c20 RCX: 0000000000000000 [ 8172.772850] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8c0620ab20e0 [ 8172.774629] RBP: ffff967f00663dd8 R08: 0000000000000000 R09: 0000000000000000 [ 8172.776094] R10: ffff8c0620ab22f8 R11: ffff8c063f772688 R12: ffff967f00663b78 [ 8172.777533] R13: ffff8c063f625600 R14: ffff8c063f625608 R15: dead000000000200 [ 8172.778886] FS: 0000000000000000(0000) GS:ffff8c063d400000(0000) knlGS:0000000000000000 [ 8172.780545] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 8172.781787] CR2: 0000000000000058 CR3: 000000004e962000 CR4: 00000000000006f0 [ 8172.783547] Call Trace: [ 8172.784112] shrink_inactive_list+0x194/0x410 [ 8172.784747] shrink_node_memcg.constprop.85+0x3a5/0x6a0 [ 8172.785472] shrink_node+0x62/0x1e0 [ 8172.786011] balance_pgdat+0x216/0x460 [ 8172.786577] kswapd+0xe3/0x4a0 [ 8172.787085] ? finish_wait+0x80/0x80 [ 8172.787795] ? balance_pgdat+0x460/0x460 [ 8172.788799] kthread+0x116/0x130 [ 8172.789640] ? kthread_create_on_node+0x60/0x60 [ 8172.790323] ret_from_fork+0x24/0x30 [ 8172.794253] CR2: 0000000000000058 or accounting errors at umount time: [ 8159.537251] WARNING: CPU: 2 PID: 19031 at fs/btrfs/extent-tree.c:5987 btrfs_free_block_groups+0x3d5/0x410 [btrfs] [ 8159.543325] CPU: 2 PID: 19031 Comm: umount Tainted: G W 5.0.0-rc2-default #408 [ 8159.545472] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626cc-prebuilt.qemu-project.org 04/01/2014 [ 8159.548155] RIP: 0010:btrfs_free_block_groups+0x3d5/0x410 [btrfs] [ 8159.554030] RSP: 0018:ffff967f079cbde8 EFLAGS: 00010206 [ 8159.555144] RAX: 0000000001000000 RBX: ffff8c06366cf800 RCX: 0000000000000000 [ 8159.556730] RDX: 0000000000000002 RSI: 0000000000000001 RDI: ffff8c06255ad800 [ 8159.558279] RBP: ffff8c0637ac0000 R08: 0000000000000001 R09: 0000000000000000 [ 8159.559797] R10: 0000000000000000 R11: 0000000000000001 R12: ffff8c0637ac0108 [ 8159.561296] R13: ffff8c0637ac0158 R14: 0000000000000000 R15: dead000000000100 [ 8159.562852] FS: 00007f7f693b9fc0(0000) GS:ffff8c063d800000(0000) knlGS:0000000000000000 [ 8159.564839] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 8159.566160] CR2: 00007f7f68fab7b0 CR3: 000000000aec7000 CR4: 00000000000006e0 [ 8159.567898] Call Trace: [ 8159.568597] close_ctree+0x17f/0x350 [btrfs] [ 8159.569628] generic_shutdown_super+0x64/0x100 [ 8159.570808] kill_anon_super+0x14/0x30 [ 8159.571857] btrfs_kill_super+0x12/0xa0 [btrfs] [ 8159.573063] deactivate_locked_super+0x29/0x60 [ 8159.574234] cleanup_mnt+0x3b/0x70 [ 8159.575176] task_work_run+0x98/0xc0 [ 8159.576177] exit_to_usermode_loop+0x83/0x90 [ 8159.577315] do_syscall_64+0x15b/0x180 [ 8159.578339] entry_SYSCALL_64_after_hwframe+0x49/0xbe This fix is based on 2 Josef's patches that used sideefects of btrfs_create_pending_block_groups, this fix introduces the helper that does what we need. CC: stable@vger.kernel.org # 4.4+ CC: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-01-28btrfs: don't end the transaction for delayed refs in throttleJosef Bacik
Previously callers to btrfs_end_transaction_throttle() would commit the transaction if there wasn't enough delayed refs space. This happens in relocation, and if the fs is relatively empty we'll run out of delayed refs space basically immediately, so we'll just be stuck in this loop of committing the transaction over and over again. This code existed because we didn't have a good feedback mechanism for running delayed refs, but with the delayed refs rsv we do now. Delete this throttling code and let the btrfs_start_transaction() in relocation deal with putting pressure on the delayed refs infrastructure. With this patch we no longer take 5 minutes to balance a metadata only fs. Qu has submitted a fstest to catch slow balance or excessive transaction commits. Steps to reproduce: * create subvolume * create many (eg. 16000) inlined files, of size 2KiB * iteratively snapshot and touch several files to trigger metadata updates * start balance -m Reported-by: Qu Wenruo <wqu@suse.com> Fixes: 64403612b73a ("btrfs: rework btrfs_check_space_for_delayed_refs") Signed-off-by: Josef Bacik <josef@toxicpanda.com> [ add tags and steps to reproduce ] Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17btrfs: Fix typos in comments and stringsAndrea Gelmini
The typos accumulate over time so once in a while time they get fixed in a large patch. Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17btrfs: don't run delayed refs in the end transaction logicJosef Bacik
Over the years we have built up a lot of infrastructure to keep delayed refs in check, mostly by running them at btrfs_end_transaction() time. We have a lot of different maths we do to figure out how much, if we should do it inline or async, etc. This existed because we had no feedback mechanism to force the flushing of delayed refs when they became a problem. However with the enospc flushing infrastructure in place for flushing delayed refs when they put too much pressure on the enospc system we have this problem solved. Rip out all of this code as it is no longer needed. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17btrfs: rework btrfs_check_space_for_delayed_refsJosef Bacik
Now with the delayed_refs_rsv we can now know exactly how much pending delayed refs space we need. This means we can drastically simplify btrfs_check_space_for_delayed_refs by simply checking how much space we have reserved for the global rsv (which acts as a spill over buffer) and the delayed refs rsv. If our total size is beyond that amount then we know it's time to commit the transaction and stop any more delayed refs from being generated. With the introduction of dealyed_refs_rsv infrastructure, namely btrfs_update_delayed_refs_rsv we now know exactly how much pending delayed refs space is required. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17btrfs: introduce delayed_refs_rsvJosef Bacik
Traditionally we've had voodoo in btrfs to account for the space that delayed refs may take up by having a global_block_rsv. This works most of the time, except when it doesn't. We've had issues reported and seen in production where sometimes the global reserve is exhausted during transaction commit before we can run all of our delayed refs, resulting in an aborted transaction. Because of this voodoo we have equally dubious flushing semantics around throttling delayed refs which we often get wrong. So instead give them their own block_rsv. This way we can always know exactly how much outstanding space we need for delayed refs. This allows us to make sure we are constantly filling that reservation up with space, and allows us to put more precise pressure on the enospc system. Instead of doing math to see if its a good time to throttle, the normal enospc code will be invoked if we have a lot of delayed refs pending, and they will be run via the normal flushing mechanism. For now the delayed_refs_rsv will hold the reservations for the delayed refs, the block group updates, and deleting csums. We could have a separate rsv for the block group updates, but the csum deletion stuff is still handled via the delayed_refs so that will stay there. Historical background: The global reserve has grown to cover everything we don't reserve space explicitly for, and we've grown a lot of weird ad-hoc heuristics to know if we're running short on space and when it's time to force a commit. A failure rate of 20-40 file systems when we run hundreds of thousands of them isn't super high, but cleaning up this code will make things less ugly and more predictible. Thus the delayed refs rsv. We always know how many delayed refs we have outstanding, and although running them generates more we can use the global reserve for that spill over, which fits better into it's desired use than a full blown reservation. This first approach is to simply take how many times we're reserving space for and multiply that by 2 in order to save enough space for the delayed refs that could be generated. This is a niave approach and will probably evolve, but for now it works. Signed-off-by: Josef Bacik <jbacik@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> # high-level review [ added background notes from the cover letter ] Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17btrfs: cleanup the useless DEFINE_WAIT in cleanup_transactionLu Fengqi
When it was introduced in commit f094ac32aba3 ("Btrfs: fix NULL pointer after aborting a transaction"), it was not used. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17Btrfs: remove no longer used stuff for tracking pending ordered extentsFilipe Manana
Tracking pending ordered extents per transaction was introduced in commit 50d9aa99bd35 ("Btrfs: make sure logged extents complete in the current transaction V3") and later updated in commit 161c3549b45a ("Btrfs: change how we wait for pending ordered extents"). However now that on fsync we always wait for ordered extents to complete before logging, done in commit 5636cf7d6dc8 ("btrfs: remove the logged extents infrastructure"), we no longer need the stuff to track for pending ordered extents, which was not completely removed in the mentioned commit. So remove the remaining of the pending ordered extents infrastructure. Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-19btrfs: don't run delayed_iputs in commitJosef Bacik
This could result in a really bad case where we do something like evict evict_refill_and_join btrfs_commit_transaction btrfs_run_delayed_iputs evict evict_refill_and_join btrfs_commit_transaction ... forever We have plenty of other places where we run delayed iputs that are much safer, let those do the work. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15btrfs: remove fs_info from btrfs_should_throttle_delayed_refsLu Fengqi
The avg_delayed_ref_runtime can be referenced from the transaction handle. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15btrfs: remove fs_info from btrfs_check_space_for_delayed_refsLu Fengqi
It can be referenced from the transaction handle. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15btrfs: release metadata before running delayed refsJosef Bacik
We want to release the unused reservation we have since it refills the delayed refs reserve, which will make everything go smoother when running the delayed refs if we're short on our reservation. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15btrfs: open code btrfs_after_dev_replace_commitDavid Sterba
Too trivial, the purpose can be simply documented in a comment. Reviewed-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15Btrfs: delayed-refs: use rb_first_cached for href_rootLiu Bo
rb_first_cached() trades an extra pointer "leftmost" for doing the same job as rb_first() but in O(1). Functions manipulating href_root need to get the first entry, this converts href_root to use rb_first_cached(). This patch is first in the sequenct of similar updates to other rbtrees and this is analysis of the expected behaviour and improvements. There's a common pattern: while (node = rb_first) { entry = rb_entry(node) next = rb_next(node) rb_erase(node) cleanup(entry) } rb_first needs to traverse the tree up to logN depth, rb_erase can completely reshuffle the tree. With the caching we'll skip the traversal in rb_first. That's a cached memory access vs looped pointer dereference trade-off that IMHO has a clear winner. Measurements show there's not much difference in a sample tree with 10000 nodes: 4.5s / rb_first and 4.8s / rb_first_cached. Real effects of caching and pointer chasing are unpredictable though. Further optimzations can be done to avoid the expensive rb_erase step. In some cases it's ok to process the nodes in any order, so the tree can be traversed in post-order, not rebalancing the children nodes and just calling free. Care must be taken regarding the next node. Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog from mail discussions ] Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15btrfs: Remove 'objectid' member from struct btrfs_rootMisono Tomohiro
There are two members in struct btrfs_root which indicate root's objectid: objectid and root_key.objectid. They are both set to the same value in __setup_root(): static void __setup_root(struct btrfs_root *root, struct btrfs_fs_info *fs_info, u64 objectid) { ... root->objectid = objectid; ... root->root_key.objectid = objecitd; ... } and not changed to other value after initialization. grep in btrfs directory shows both are used in many places: $ grep -rI "root->root_key.objectid" | wc -l 133 $ grep -rI "root->objectid" | wc -l 55 (4.17, inc. some noise) It is confusing to have two similar variable names and it seems that there is no rule about which should be used in a certain case. Since ->root_key itself is needed for tree reloc tree, let's remove 'objecitd' member and unify code to use ->root_key.objectid in all places. Signed-off-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15btrfs: Remove root parameter from btrfs_insert_dir_itemLu Fengqi
All callers pass the root tree of dir, we can push that down to the function itself. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06btrfs: simplify IS_ERR/PTR_ERR checksAl Viro
IS_ERR(p) && PTR_ERR(p) == n is a weird way to spell p == ERR_PTR(n). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Reviewed-by: David Sterba <dsterba@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06btrfs: Remove fs_info from btrfs_add_root_refLu Fengqi
It can be referenced from the passed transaction handle. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06btrfs: qgroup: Drop fs_info parameter from btrfs_qgroup_inheritLu Fengqi
It can be fetched from the transaction handle. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06btrfs: qgroup: Drop fs_info parameter from btrfs_run_qgroupsLu Fengqi
It can be fetched from the transaction handle. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06btrfs: use monotonic time for transaction handlingArnd Bergmann
The transaction times were changed to ktime_get_real_seconds to avoid the y2038 overflow, but they still have a minor problem when they go backwards or jump due to settimeofday() or leap seconds. This changes the transaction handling to instead use ktime_get_seconds(), which returns a CLOCK_MONOTONIC timestamp that has neither of those problems. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>