summaryrefslogtreecommitdiff
path: root/fs/btrfs/qgroup.c
AgeCommit message (Collapse)Author
2019-02-27btrfs: move ulist allocation out of transaction in quota enableDavid Sterba
The allocation happens with GFP_KERNEL after a transaction has been started, this can potentially cause deadlock if reclaim tries to get the memory by flushing filesystem data. The fs_info::qgroup_ulist is not used during transaction start when quotas are not enabled. The status bit BTRFS_FS_QUOTA_ENABLED is set later in btrfs_quota_enable so it's safe to move it before the transaction start. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-25btrfs: qgroup: Make qgroup async transaction commit more aggressiveQu Wenruo
[BUG] Btrfs qgroup will still hit EDQUOT under the following case: $ dev=/dev/test/test $ mnt=/mnt/btrfs $ umount $mnt &> /dev/null $ umount $dev &> /dev/null $ mkfs.btrfs -f $dev $ mount $dev $mnt -o nospace_cache $ btrfs subv create $mnt/subv $ btrfs quota enable $mnt $ btrfs quota rescan -w $mnt $ btrfs qgroup limit -e 1G $mnt/subv $ fallocate -l 900M $mnt/subv/padding $ sync $ rm $mnt/subv/padding # Hit EDQUOT $ xfs_io -f -c "pwrite 0 512M" $mnt/subv/real_file [CAUSE] Since commit a514d63882c3 ("btrfs: qgroup: Commit transaction in advance to reduce early EDQUOT"), btrfs is not forced to commit transaction to reclaim more quota space. Instead, we just check pertrans metadata reservation against some threshold and try to do asynchronously transaction commit. However in above case, the pertrans metadata reservation is pretty small thus it will never trigger asynchronous transaction commit. [FIX] Instead of only accounting pertrans metadata reservation, we calculate how much free space we have, and if there isn't much free space left, commit transaction asynchronously to try to free some space. This may slow down the fs when we have less than 32M free qgroup space, but should reduce a lot of false EDQUOT, so the cost should be acceptable. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-25btrfs: qgroup: Move reserved data accounting from btrfs_delayed_ref_head to ↵Qu Wenruo
btrfs_qgroup_extent_record [BUG] Btrfs/139 will fail with a high probability if the testing machine (VM) has only 2G RAM. Resulting the final write success while it should fail due to EDQUOT, and the fs will have quota exceeding the limit by 16K. The simplified reproducer will be: (needs a 2G ram VM) $ mkfs.btrfs -f $dev $ mount $dev $mnt $ btrfs subv create $mnt/subv $ btrfs quota enable $mnt $ btrfs quota rescan -w $mnt $ btrfs qgroup limit -e 1G $mnt/subv $ for i in $(seq -w 1 8); do xfs_io -f -c "pwrite 0 128M" $mnt/subv/file_$i > /dev/null echo "file $i written" > /dev/kmsg done $ sync $ btrfs qgroup show -pcre --raw $mnt The last pwrite will not trigger EDQUOT and final 'qgroup show' will show something like: qgroupid rfer excl max_rfer max_excl parent child -------- ---- ---- -------- -------- ------ ----- 0/5 16384 16384 none none --- --- 0/256 1073758208 1073758208 none 1073741824 --- --- And 1073758208 is larger than > 1073741824. [CAUSE] It's a bug in btrfs qgroup data reserved space management. For quota limit, we must ensure that: reserved (data + metadata) + rfer/excl <= limit Since rfer/excl is only updated at transaction commmit time, reserved space needs to be taken special care. One important part of reserved space is data, and for a new data extent written to disk, we still need to take the reserved space until rfer/excl numbers get updated. Originally when an ordered extent finishes, we migrate the reserved qgroup data space from extent_io tree to delayed ref head of the data extent, expecting delayed ref will only be cleaned up at commit transaction time. However for small RAM machine, due to memory pressure dirty pages can be flushed back to disk without committing a transaction. The related events will be something like: file 1 written btrfs_finish_ordered_io: ino=258 ordered offset=0 len=54947840 btrfs_finish_ordered_io: ino=258 ordered offset=54947840 len=5636096 btrfs_finish_ordered_io: ino=258 ordered offset=61153280 len=57344 btrfs_finish_ordered_io: ino=258 ordered offset=61210624 len=8192 btrfs_finish_ordered_io: ino=258 ordered offset=60583936 len=569344 cleanup_ref_head: num_bytes=54947840 cleanup_ref_head: num_bytes=5636096 cleanup_ref_head: num_bytes=569344 cleanup_ref_head: num_bytes=57344 cleanup_ref_head: num_bytes=8192 ^^^^^^^^^^^^^^^^ This will free qgroup data reserved space file 2 written ... file 8 written cleanup_ref_head: num_bytes=8192 ... btrfs_commit_transaction <<< the only transaction committed during the test When file 2 is written, we have already freed 128M reserved qgroup data space for ino 258. Thus later write won't trigger EDQUOT. This allows us to write more data beyond qgroup limit. In my 2G ram VM, it could reach about 1.2G before hitting EDQUOT. [FIX] By moving reserved qgroup data space from btrfs_delayed_ref_head to btrfs_qgroup_extent_record, we can ensure that reserved qgroup data space won't be freed half way before commit transaction, thus fix the problem. Fixes: f64d5ca86821 ("btrfs: delayed_ref: Add new function to record reserved space into delayed ref") Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-25btrfs: qgroup: Remove duplicated trace points for qgroup_rsv_add/releaseQu Wenruo
Inside qgroup_rsv_add/release(), we have trace events trace_qgroup_update_reserve() to catch reserved space update. However we still have two manual trace_qgroup_update_reserve() calls just outside these functions. Remove these duplicated calls. Fixes: 64ee4e751a1c ("btrfs: qgroup: Update trace events to use new separate rsv types") Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-25btrfs: replace btrfs_set_lock_blocking_rw with appropriate helpersDavid Sterba
We can use the right helper where the lock type is a fixed parameter. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-25btrfs: qgroup: Cleanup old subtree swap codeQu Wenruo
Since it's replaced by new delayed subtree swap code, remove the original code. The cleanup is small since most of its core function is still used by delayed subtree swap trace. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-25btrfs: qgroup: Use delayed subtree rescan for balanceQu Wenruo
Before this patch, qgroup code traces the whole subtree of subvolume and reloc trees unconditionally. This makes qgroup numbers consistent, but it could cause tons of unnecessary extent tracing, which causes a lot of overhead. However for subtree swap of balance, just swap both subtrees because they contain the same contents and tree structure, so qgroup numbers won't change. It's the race window between subtree swap and transaction commit could cause qgroup number change. This patch will delay the qgroup subtree scan until COW happens for the subtree root. So if there is no other operations for the fs, balance won't cause extra qgroup overhead. (best case scenario) Depending on the workload, most of the subtree scan can still be avoided. Only for worst case scenario, it will fall back to old subtree swap overhead. (scan all swapped subtrees) [[Benchmark]] Hardware: VM 4G vRAM, 8 vCPUs, disk is using 'unsafe' cache mode, backing device is SAMSUNG 850 evo SSD. Host has 16G ram. Mkfs parameter: --nodesize 4K (To bump up tree size) Initial subvolume contents: 4G data copied from /usr and /lib. (With enough regular small files) Snapshots: 16 snapshots of the original subvolume. each snapshot has 3 random files modified. balance parameter: -m So the content should be pretty similar to a real world root fs layout. And after file system population, there is no other activity, so it should be the best case scenario. | v4.20-rc1 | w/ patchset | diff ----------------------------------------------------------------------- relocated extents | 22615 | 22457 | -0.1% qgroup dirty extents | 163457 | 121606 | -25.6% time (sys) | 22.884s | 18.842s | -17.6% time (real) | 27.724s | 22.884s | -17.5% Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-25btrfs: qgroup: Introduce per-root swapped blocks infrastructureQu Wenruo
To allow delayed subtree swap rescan, btrfs needs to record per-root information about which tree blocks get swapped. This patch introduces the required infrastructure. The designed workflow will be: 1) Record the subtree root block that gets swapped. During subtree swap: O = Old tree blocks N = New tree blocks reloc tree subvolume tree X Root Root / \ / \ NA OB OA OB / | | \ / | | \ NC ND OE OF OC OD OE OF In this case, NA and OA are going to be swapped, record (NA, OA) into subvolume tree X. 2) After subtree swap. reloc tree subvolume tree X Root Root / \ / \ OA OB NA OB / | | \ / | | \ OC OD OE OF NC ND OE OF 3a) COW happens for OB If we are going to COW tree block OB, we check OB's bytenr against tree X's swapped_blocks structure. If it doesn't fit any, nothing will happen. 3b) COW happens for NA Check NA's bytenr against tree X's swapped_blocks, and get a hit. Then we do subtree scan on both subtrees OA and NA. Resulting 6 tree blocks to be scanned (OA, OC, OD, NA, NC, ND). Then no matter what we do to subvolume tree X, qgroup numbers will still be correct. Then NA's record gets removed from X's swapped_blocks. 4) Transaction commit Any record in X's swapped_blocks gets removed, since there is no modification to swapped subtrees, no need to trigger heavy qgroup subtree rescan for them. This will introduce 128 bytes overhead for each btrfs_root even qgroup is not enabled. This is to reduce memory allocations and potential failures. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-25btrfs: qgroup: Refactor btrfs_qgroup_trace_subtree_swapQu Wenruo
Refactor btrfs_qgroup_trace_subtree_swap() into qgroup_trace_subtree_swap(), which only needs two extent buffer and some other bool to control the behavior. This provides the basis for later delayed subtree scan work. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17btrfs: Fix typos in comments and stringsAndrea Gelmini
The typos accumulate over time so once in a while time they get fixed in a large patch. Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17Btrfs: fix deadlock when enabling quotas due to concurrent snapshot creationFilipe Manana
If the quota enable and snapshot creation ioctls are called concurrently we can get into a deadlock where the task enabling quotas will deadlock on the fs_info->qgroup_ioctl_lock mutex because it attempts to lock it twice, or the task creating a snapshot tries to commit the transaction while the task enabling quota waits for the former task to commit the transaction while holding the mutex. The following time diagrams show how both cases happen. First scenario: CPU 0 CPU 1 btrfs_ioctl() btrfs_ioctl_quota_ctl() btrfs_quota_enable() mutex_lock(fs_info->qgroup_ioctl_lock) btrfs_start_transaction() btrfs_ioctl() btrfs_ioctl_snap_create_v2 create_snapshot() --> adds snapshot to the list pending_snapshots of the current transaction btrfs_commit_transaction() create_pending_snapshots() create_pending_snapshot() qgroup_account_snapshot() btrfs_qgroup_inherit() mutex_lock(fs_info->qgroup_ioctl_lock) --> deadlock, mutex already locked by this task at btrfs_quota_enable() Second scenario: CPU 0 CPU 1 btrfs_ioctl() btrfs_ioctl_quota_ctl() btrfs_quota_enable() mutex_lock(fs_info->qgroup_ioctl_lock) btrfs_start_transaction() btrfs_ioctl() btrfs_ioctl_snap_create_v2 create_snapshot() --> adds snapshot to the list pending_snapshots of the current transaction btrfs_commit_transaction() --> waits for task at CPU 0 to release its transaction handle btrfs_commit_transaction() --> sees another task started the transaction commit first --> releases its transaction handle --> waits for the transaction commit to be completed by the task at CPU 1 create_pending_snapshot() qgroup_account_snapshot() btrfs_qgroup_inherit() mutex_lock(fs_info->qgroup_ioctl_lock) --> deadlock, task at CPU 0 has the mutex locked but it is waiting for us to finish the transaction commit So fix this by setting the quota enabled flag in fs_info after committing the transaction at btrfs_quota_enable(). This ends up serializing quota enable and snapshot creation as if the snapshot creation happened just before the quota enable request. The quota rescan task, scheduled after committing the transaction in btrfs_quote_enable(), will do the accounting. Fixes: 6426c7ad697d ("btrfs: qgroup: Fix qgroup accounting when creating snapshot") Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17btrfs: Remove unnecessary tree locking code in qgroup_rescan_leafNikolay Borisov
In qgroup_rescan_leaf a copy is made of the target leaf by calling btrfs_clone_extent_buffer. The latter allocates a new buffer and attaches a new set of pages and copies the content of the source buffer. The new scratch buffer is only used to iterate it's items, it's not published anywhere and cannot be accessed by a third party. Hence, it's not necessary to perform any locking on it whatsoever. Furthermore, remove the extra extent_buffer_get call since the new buffer is always allocated with a reference count of 1 which is sufficient here. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-11-22Btrfs: fix race between enabling quotas and subvolume creationFilipe Manana
We have a race between enabling quotas end subvolume creation that cause subvolume creation to fail with -EINVAL, and the following diagram shows how it happens: CPU 0 CPU 1 btrfs_ioctl() btrfs_ioctl_quota_ctl() btrfs_quota_enable() mutex_lock(fs_info->qgroup_ioctl_lock) btrfs_ioctl() create_subvol() btrfs_qgroup_inherit() -> save fs_info->quota_root into quota_root -> stores a NULL value -> tries to lock the mutex qgroup_ioctl_lock -> blocks waiting for the task at CPU0 -> sets BTRFS_FS_QUOTA_ENABLED in fs_info -> sets quota_root in fs_info->quota_root (non-NULL value) mutex_unlock(fs_info->qgroup_ioctl_lock) -> checks quota enabled flag is set -> returns -EINVAL because fs_info->quota_root was NULL before it acquired the mutex qgroup_ioctl_lock -> ioctl returns -EINVAL Returning -EINVAL to user space will be confusing if all the arguments passed to the subvolume creation ioctl were valid. Fix it by grabbing the value from fs_info->quota_root after acquiring the mutex. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15btrfs: qgroup: move the qgroup->members check out from (!qgroup)'s else branchLu Fengqi
There is no reason to put this check in (!qgroup)'s else branch because if qgroup is null, it will goto out directly. So move it out to reduce indentation level. No functional change. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15btrfs: qgroup: Avoid calling qgroup functions if qgroup is not enabledQu Wenruo
Some qgroup trace events like btrfs_qgroup_release_data() and btrfs_qgroup_free_delayed_ref() can still be triggered even if qgroup is not enabled. This is caused by the lack of qgroup status check before calling some qgroup functions. Thankfully the functions can handle quota disabled case well and just do nothing for qgroup disabled case. This patch will do earlier check before triggering related trace events. And for enabled <-> disabled race case: 1) For enabled->disabled case Disable will wipe out all qgroups data including reservation and excl/rfer. Even if we leak some reservation or numbers, it will still be cleared, so nothing will go wrong. 2) For disabled -> enabled case Current btrfs_qgroup_release_data() will use extent_io tree to ensure we won't underflow reservation. And for delayed_ref we use head->qgroup_reserved to record the reserved space, so in that case head->qgroup_reserved should be 0 and we won't underflow. CC: stable@vger.kernel.org # 4.14+ Reported-by: Chris Murphy <lists@colorremedies.com> Link: https://lore.kernel.org/linux-btrfs/CAJCQCtQau7DtuUUeycCkZ36qjbKuxNzsgqJ7+sJ6W0dK_NLE3w@mail.gmail.com/ Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15btrfs: qgroup: Only trace data extents in leaves if we're relocating data ↵Qu Wenruo
block group For qgroup_trace_extent_swap(), if we find one leaf that needs to be traced, we will also iterate all file extents and trace them. This is OK if we're relocating data block groups, but if we're relocating metadata block groups, balance code itself has ensured that both subtree of file tree and reloc tree contain the same contents. That's to say, if we're relocating metadata block groups, all file extents in reloc and file tree should match, thus no need to trace them. This should reduce the total number of dirty extents processed in metadata block group balance. [[Benchmark]] (with all previous enhancement) Hardware: VM 4G vRAM, 8 vCPUs, disk is using 'unsafe' cache mode, backing device is SAMSUNG 850 evo SSD. Host has 16G ram. Mkfs parameter: --nodesize 4K (To bump up tree size) Initial subvolume contents: 4G data copied from /usr and /lib. (With enough regular small files) Snapshots: 16 snapshots of the original subvolume. each snapshot has 3 random files modified. balance parameter: -m So the content should be pretty similar to a real world root fs layout. | v4.19-rc1 | w/ patchset | diff (*) --------------------------------------------------------------- relocated extents | 22929 | 22851 | -0.3% qgroup dirty extents | 227757 | 140886 | -38.1% time (sys) | 65.253s | 37.464s | -42.6% time (real) | 74.032s | 44.722s | -39.6% Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15btrfs: qgroup: Use generation-aware subtree swap to mark dirty extentsQu Wenruo
Before this patch, with quota enabled during balance, we need to mark the whole subtree dirty for quota. E.g. OO = Old tree blocks (from file tree) NN = New tree blocks (from reloc tree) File tree (src) Reloc tree (dst) OO (a) NN (a) / \ / \ (b) OO OO (c) (b) NN NN (c) / \ / \ / \ / \ OO OO OO OO (d) OO OO OO NN (d) For old balance + quota case, quota will mark the whole src and dst tree dirty, including all the 3 old tree blocks in reloc tree. It's doable for small file tree or new tree blocks are all located at lower level. But for large file tree or new tree blocks are all located at higher level, this will lead to mark the whole tree dirty, and be unbelievably slow. This patch will change how we handle such balance with quota enabled case. Now we will search from (b) and (c) for any new tree blocks whose generation is equal to @last_snapshot, and only mark them dirty. In above case, we only need to trace tree blocks NN(b), NN(c) and NN(d). (NN(a) will be traced when COW happens for nodeptr modification). And also for tree blocks OO(b), OO(c), OO(d). (OO(a) will be traced when COW happens for nodeptr modification.) For above case, we could skip 3 tree blocks, but for larger tree, we can skip tons of unmodified tree blocks, and hugely speed up balance. This patch will introduce a new function, btrfs_qgroup_trace_subtree_swap(), which will do the following main work: 1) Read out real root eb And setup basic dst_path for later calls 2) Call qgroup_trace_new_subtree_blocks() To trace all new tree blocks in reloc tree and their counter parts in the file tree. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15btrfs: qgroup: Introduce function to find all new tree blocks of reloc treeQu Wenruo
Introduce new function, qgroup_trace_new_subtree_blocks(), to iterate all new tree blocks in a reloc tree. So that qgroup could skip unrelated tree blocks during balance, which should hugely speedup balance speed when quota is enabled. The function qgroup_trace_new_subtree_blocks() itself only cares about new tree blocks in reloc tree. All its main works are: 1) Read out tree blocks according to parent pointers 2) Do recursive depth-first search Will call the same function on all its children tree blocks, with search level set to current level -1. And will also skip all children whose generation is smaller than @last_snapshot. 3) Call qgroup_trace_extent_swap() to trace tree blocks So although we have parameter list related to source file tree, it's not used at all, but only passed to qgroup_trace_extent_swap(). Thus despite the tree read code, the core should be pretty short and all about recursive depth-first search. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15btrfs: qgroup: Introduce function to trace two swaped extentsQu Wenruo
Introduce a new function, qgroup_trace_extent_swap(), which will be used later for balance qgroup speedup. The basis idea of balance is swapping tree blocks between reloc tree and the real file tree. The swap will happen in highest tree block, but there may be a lot of tree blocks involved. For example: OO = Old tree blocks NN = New tree blocks allocated during balance File tree (257) Reloc tree for 257 L2 OO NN / \ / \ L1 OO OO (a) OO NN (a) / \ / \ / \ / \ L0 OO OO OO OO OO OO NN NN (b) (c) (b) (c) When calling qgroup_trace_extent_swap(), we will pass: @src_eb = OO(a) @dst_path = [ nodes[1] = NN(a), nodes[0] = NN(c) ] @dst_level = 0 @root_level = 1 In that case, qgroup_trace_extent_swap() will search from OO(a) to reach OO(c), then mark both OO(c) and NN(c) as qgroup dirty. The main work of qgroup_trace_extent_swap() can be split into 3 parts: 1) Tree search from @src_eb It should acts as a simplified btrfs_search_slot(). The key for search can be extracted from @dst_path->nodes[dst_level] (first key). 2) Mark the final tree blocks in @src_path and @dst_path qgroup dirty NOTE: In above case, OO(a) and NN(a) won't be marked qgroup dirty. They should be marked during preivous (@dst_level = 1) iteration. 3) Mark file extents in leaves dirty We don't have good way to pick out new file extents only. So we still follow the old method by scanning all file extents in the leave. This function can free us from keeping two pathes, thus later we only need to care about how to iterate all new tree blocks in reloc tree. Signed-off-by: Qu Wenruo <wqu@suse.com> [ copy changelog to function comment ] Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15btrfs: qgroup: Introduce trace event to analyse the number of dirty extents ↵Qu Wenruo
accounted Number of qgroup dirty extents is directly linked to the performance overhead, so add a new trace event, trace_qgroup_num_dirty_extents(), to record how many dirty extents is processed in btrfs_qgroup_account_extents(). This will be pretty handy to analyze later balance performance improvement. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15btrfs: qgroup: Dirty all qgroups before rescanQu Wenruo
[BUG] In the following case, rescan won't zero out the number of qgroup 1/0: $ mkfs.btrfs -fq $DEV $ mount $DEV /mnt $ btrfs quota enable /mnt $ btrfs qgroup create 1/0 /mnt $ btrfs sub create /mnt/sub $ btrfs qgroup assign 0/257 1/0 /mnt $ dd if=/dev/urandom of=/mnt/sub/file bs=1k count=1000 $ btrfs sub snap /mnt/sub /mnt/snap $ btrfs quota rescan -w /mnt $ btrfs qgroup show -pcre /mnt qgroupid rfer excl max_rfer max_excl parent child -------- ---- ---- -------- -------- ------ ----- 0/5 16.00KiB 16.00KiB none none --- --- 0/257 1016.00KiB 16.00KiB none none 1/0 --- 0/258 1016.00KiB 16.00KiB none none --- --- 1/0 1016.00KiB 16.00KiB none none --- 0/257 So far so good, but: $ btrfs qgroup remove 0/257 1/0 /mnt WARNING: quotas may be inconsistent, rescan needed $ btrfs quota rescan -w /mnt $ btrfs qgroup show -pcre /mnt qgoupid rfer excl max_rfer max_excl parent child -------- ---- ---- -------- -------- ------ ----- 0/5 16.00KiB 16.00KiB none none --- --- 0/257 1016.00KiB 16.00KiB none none --- --- 0/258 1016.00KiB 16.00KiB none none --- --- 1/0 1016.00KiB 16.00KiB none none --- --- ^^^^^^^^^^ ^^^^^^^^ not cleared [CAUSE] Before rescan we call qgroup_rescan_zero_tracking() to zero out all qgroups' accounting numbers. However we don't mark all qgroups dirty, but rely on rescan to do so. If we have any high level qgroup without children, it won't be marked dirty during rescan, since we cannot reach that qgroup. This will cause QGROUP_INFO items of childless qgroups never get updated in the quota tree, thus their numbers will stay the same in "btrfs qgroup show" output. [FIX] Just mark all qgroups dirty in qgroup_rescan_zero_tracking(), so even if we have childless qgroups, their QGROUP_INFO items will still get updated during rescan. Reported-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com> CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com> Tested-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15btrfs: Remove 'objectid' member from struct btrfs_rootMisono Tomohiro
There are two members in struct btrfs_root which indicate root's objectid: objectid and root_key.objectid. They are both set to the same value in __setup_root(): static void __setup_root(struct btrfs_root *root, struct btrfs_fs_info *fs_info, u64 objectid) { ... root->objectid = objectid; ... root->root_key.objectid = objecitd; ... } and not changed to other value after initialization. grep in btrfs directory shows both are used in many places: $ grep -rI "root->root_key.objectid" | wc -l 133 $ grep -rI "root->objectid" | wc -l 55 (4.17, inc. some noise) It is confusing to have two similar variable names and it seems that there is no rule about which should be used in a certain case. Since ->root_key itself is needed for tree reloc tree, let's remove 'objecitd' member and unify code to use ->root_key.objectid in all places. Signed-off-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-23btrfs: use after free in btrfs_quota_enableDan Carpenter
The issue here is that btrfs_commit_transaction() frees "trans" on both the error and the success path. So the problem would be if btrfs_commit_transaction() succeeds, and then qgroup_rescan_init() fails. That means that "ret" is non-zero and "trans" is non-NULL and it leads to a use after free inside the btrfs_end_transaction() macro. Fixes: 340f1aa27f36 ("btrfs: qgroups: Move transaction management inside btrfs_quota_enable/disable") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06btrfs: Remove fs_info from btrfs_del_rootLu Fengqi
It can be referenced from the passed transaction handle. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06btrfs: qgroup: Drop fs_info parameter from qgroup_rescan_leafLu Fengqi
It can be fetched from the transaction handle. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06btrfs: qgroup: Drop fs_info parameter from btrfs_qgroup_inheritLu Fengqi
It can be fetched from the transaction handle. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06btrfs: qgroup: Drop fs_info parameter from btrfs_run_qgroupsLu Fengqi
It can be fetched from the transaction handle. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06btrfs: qgroup: Drop fs_info parameter from btrfs_qgroup_account_extentLu Fengqi
It can be fetched from the transaction handle. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06btrfs: qgroup: Drop root parameter from btrfs_qgroup_trace_subtreeLu Fengqi
The fs_info can be fetched from the transaction handle directly. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06btrfs: qgroup: Drop fs_info parameter from btrfs_qgroup_trace_leaf_itemsLu Fengqi
It can be fetched from the transaction handle. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06btrfs: qgroup: Drop fs_info parameter from btrfs_qgroup_trace_extentLu Fengqi
It can be fetched from the transaction handle. In addition, remove the WARN_ON(trans == NULL) because it's not possible to hit this condition. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06btrfs: qgroup: Drop fs_info parameter from btrfs_limit_qgroupLu Fengqi
It can be fetched from the transaction handle. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06btrfs: qgroup: Drop fs_info parameter from btrfs_remove_qgroupLu Fengqi
It can be fetched from the transaction handle. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06btrfs: qgroup: Drop fs_info parameter from btrfs_create_qgroupLu Fengqi
It can be fetched from the transaction handle. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06btrfs: qgroup: Drop fs_info parameter from btrfs_del_qgroup_relationLu Fengqi
It can be fetched from the transaction handle. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06btrfs: qgroup: Drop fs_info parameter from __del_qgroup_relationLu Fengqi
It can be fetched from the transaction handle. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06btrfs: qgroup: Drop fs_info parameter from btrfs_add_qgroup_relationLu Fengqi
It can be fetched from the transaction handle. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06btrfs: qgroup: Drop quota_root and fs_info parameters from ↵Lu Fengqi
update_qgroup_status_item They can be fetched from the transaction handle. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06btrfs: qgroup: Drop root parameter from update_qgroup_info_itemLu Fengqi
It can be fetched from the transaction handle. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06btrfs: qgroup: Drop root parameter from update_qgroup_limit_itemLu Fengqi
It can be fetched from the transaction handle. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06btrfs: qgroup: Drop quota_root parameter from del_qgroup_itemLu Fengqi
It can be fetched from the transaction handle. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06btrfs: qgroup: Drop quota_root parameter from del_qgroup_relation_itemLu Fengqi
It can be fetched from the transaction handle. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06btrfs: qgroup: Drop quota_root parameter from add_qgroup_relation_itemLu Fengqi
It can be fetched from the transaction handle. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06btrfs: qgroup: cleanup the unused srcroot from btrfs_qgroup_inheritLu Fengqi
Since commit 0b246afa62b0 ("btrfs: root->fs_info cleanup, add fs_info convenience variables"), the srcroot is no longer used to get fs_info::nodesize. In fact, it can be dropped after commit 707e8a071528 ("btrfs: use nodesize everywhere, kill leafsize"). Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06btrfs: qgroups: Move transaction management inside btrfs_quota_enable/disableNikolay Borisov
Commit 5d23515be669 ("btrfs: Move qgroup rescan on quota enable to btrfs_quota_enable") not only resulted in an easier to follow code but it also introduced a subtle bug. It changed the timing when the initial transaction rescan was happening: - before the commit: it would happen after transaction commit had occured - after the commit: it might happen before the transaction was committed This results in failure to correctly rescan the quota since there could be data which is still not committed on disk. This patch aims to fix this by moving the transaction creation/commit inside btrfs_quota_enable, which allows to schedule the quota commit after the transaction has been committed. Fixes: 5d23515be669 ("btrfs: Move qgroup rescan on quota enable to btrfs_quota_enable") Reported-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com> Link: https://marc.info/?l=linux-btrfs&m=152999289017582 Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-06-28Btrfs: fix mount failure when qgroup rescan is in progressFilipe Manana
If a power failure happens while the qgroup rescan kthread is running, the next mount operation will always fail. This is because of a recent regression that makes qgroup_rescan_init() incorrectly return -EINVAL when we are mounting the filesystem (through btrfs_read_qgroup_config()). This causes the -EINVAL error to be returned regardless of any qgroup flags being set instead of returning the error only when neither of the flags BTRFS_QGROUP_STATUS_FLAG_RESCAN nor BTRFS_QGROUP_STATUS_FLAG_ON are set. A test case for fstests follows up soon. Fixes: 9593bf49675e ("btrfs: qgroup: show more meaningful qgroup_rescan_init error message") Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-06-28btrfs: quota: Set rescan progress to (u64)-1 if we hit last leafQu Wenruo
Commit ff3d27a048d9 ("btrfs: qgroup: Finish rescan when hit the last leaf of extent tree") added a new exit for rescan finish. However after finishing quota rescan, we set fs_info->qgroup_rescan_progress to (u64)-1 before we exit through the original exit path. While we missed that assignment of (u64)-1 in the new exit path. The end result is, the quota status item doesn't have the same value. (-1 vs the last bytenr + 1) Although it doesn't affect quota accounting, it's still better to keep the original behavior. Reported-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com> Fixes: ff3d27a048d9 ("btrfs: qgroup: Finish rescan when hit the last leaf of extent tree") Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30btrfs: qgroup: show more meaningful qgroup_rescan_init error messageQu Wenruo
Error message from qgroup_rescan_init() mostly looks like: BTRFS info (device nvme0n1p1): qgroup_rescan_init failed with -115 Which is far from meaningful, and sometimes confusing as for above -EINPROGRESS it's mostly (despite the init race) harmless, but sometimes it can also indicate problem if the return value is -EINVAL. Change it to some more meaningful messages like: BTRFS info (device nvme0n1p1): qgroup rescan is already in progress And BTRFS err(device nvme0n1p1): qgroup rescan init failed, qgroup is not enabled Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> [ update the messages and level ] Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28btrfs: qgroup: Finish rescan when hit the last leaf of extent treeQu Wenruo
Under the following case, qgroup rescan can double account cowed tree blocks: In this case, extent tree only has one tree block. - | transid=5 last committed=4 | btrfs_qgroup_rescan_worker() | |- btrfs_start_transaction() | | transid = 5 | |- qgroup_rescan_leaf() | |- btrfs_search_slot_for_read() on extent tree | Get the only extent tree block from commit root (transid = 4). | Scan it, set qgroup_rescan_progress to the last | EXTENT/META_ITEM + 1 | now qgroup_rescan_progress = A + 1. | | fs tree get CoWed, new tree block is at A + 16K | transid 5 get committed - | transid=6 last committed=5 | btrfs_qgroup_rescan_worker() | btrfs_qgroup_rescan_worker() | |- btrfs_start_transaction() | | transid = 5 | |- qgroup_rescan_leaf() | |- btrfs_search_slot_for_read() on extent tree | Get the only extent tree block from commit root (transid = 5). | scan it using qgroup_rescan_progress (A + 1). | found new tree block beyong A, and it's fs tree block, | account it to increase qgroup numbers. - In above case, tree block A, and tree block A + 16K get accounted twice, while qgroup rescan should stop when it already reach the last leaf, other than continue using its qgroup_rescan_progress. Such case could happen by just looping btrfs/017 and with some possibility it can hit such double qgroup accounting problem. Fix it by checking the path to determine if we should finish qgroup rescan, other than relying on next loop to exit. Reported-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28btrfs: qgroup: Search commit root for rescan to avoid missing extentQu Wenruo
When doing qgroup rescan using the following script (modified from btrfs/017 test case), we can sometimes hit qgroup corruption. ------ umount $dev &> /dev/null umount $mnt &> /dev/null mkfs.btrfs -f -n 64k $dev mount $dev $mnt extent_size=8192 xfs_io -f -d -c "pwrite 0 $extent_size" $mnt/foo > /dev/null btrfs subvolume snapshot $mnt $mnt/snap xfs_io -f -c "reflink $mnt/foo" $mnt/foo-reflink > /dev/null xfs_io -f -c "reflink $mnt/foo" $mnt/snap/foo-reflink > /dev/null xfs_io -f -c "reflink $mnt/foo" $mnt/snap/foo-reflink2 > /dev/unll btrfs quota enable $mnt # -W is the new option to only wait rescan while not starting new one btrfs quota rescan -W $mnt btrfs qgroup show -prce $mnt umount $mnt # Need to patch btrfs-progs to report qgroup mismatch as error btrfs check $dev || _fail ------ For fast machine, we can hit some corruption which missed accounting tree blocks: ------ qgroupid rfer excl max_rfer max_excl parent child -------- ---- ---- -------- -------- ------ ----- 0/5 8.00KiB 0.00B none none --- --- 0/257 8.00KiB 0.00B none none --- --- ------ This is due to the fact that we're always searching commit root for btrfs_find_all_roots() at qgroup_rescan_leaf(), but the leaf we get is from current transaction, not commit root. And if our tree blocks get modified in current transaction, we won't find any owner in commit root, thus causing the corruption. Fix it by searching commit root for extent tree for qgroup_rescan_leaf(). Reported-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>