summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2022-01-07btrfs: zoned: drop redundant check for REQ_OP_ZONE_APPEND and btrfs_is_zonedJohannes Thumshirn
REQ_OP_ZONE_APPEND can only work on zoned devices, so it is redundant to check if the filesystem is zoned when REQ_OP_ZONE_APPEND is set as the bio's bio_op. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07btrfs: zoned: sink zone check into btrfs_repair_one_zoneJohannes Thumshirn
Sink zone check into btrfs_repair_one_zone() so we don't need to do it in all callers. Also as btrfs_repair_one_zone() doesn't return a sensible error, make it a boolean function and return false in case it got called on a non-zoned filesystem and true on a zoned filesystem. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07btrfs: zoned: simplify btrfs_check_meta_write_pointerJohannes Thumshirn
btrfs_check_meta_write_pointer() will always be called with a NULL 'cache_ret' argument. As there's no need to check if we have a valid block_group passed in remove these checks. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07btrfs: zoned: encapsulate inode locking for zoned relocationJohannes Thumshirn
Encapsulate the inode lock needed for serializing the data relocation writes on a zoned filesystem into a helper. This streamlines the code reading flow and hides special casing for zoned filesystems. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07btrfs: sysfs: add devinfo/fsid to retrieve actual fsid from the deviceAnand Jain
In the case of the seed device, the fsid can be different from the mounted sprout fsid. The userland has to read the device superblock to know the fsid but, that idea fails if the device is missing. So add a sysfs interface devinfo/<devid>/fsid to show the fsid of the device. For example: $ cd /sys/fs/btrfs/b10b02a5-f9de-4276-b9e8-2bfd09a578a8 $ cat devinfo/1/fsid c44d771f-639d-4df3-99ec-5bc7ad2af93b $ cat devinfo/3/fsid b10b02a5-f9de-4276-b9e8-2bfd09a578a8 Though it's related to seeding, the name of the sysfs file is plain fsid as it matches what blkid says. A path to the device's fsid will aid scripting. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07btrfs: reserve extra space for the free space treeJosef Bacik
Filipe reported a problem where sometimes he'd get an ENOSPC abort when running delayed refs with generic/619 and the free space tree enabled. This is partly because we do not reserve space for modifying the free space tree, nor do we have a block rsv associated with that tree. The delayed_refs_rsv tracks the amount of space required to run delayed refs. This means 1 modification means 1 change to the extent root. With the free space tree this turns into 2 changes, because modifying 1 extent means updating the extent tree and potentially updating the free space tree to either remove that entry or add the free space. Thus if we have the FST enabled, simply double the reservation size for our modification. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07btrfs: include the free space tree in the global rsv minimum calculationJosef Bacik
Filipe reported a problem where generic/619 was failing with an ENOSPC abort while running delayed refs, like the following BTRFS: Transaction aborted (error -28) WARNING: CPU: 3 PID: 522920 at fs/btrfs/free-space-tree.c:1049 add_to_free_space_tree+0xe5/0x110 [btrfs] CPU: 3 PID: 522920 Comm: kworker/u16:19 Tainted: G W 5.16.0-rc2-btrfs-next-106 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs] RIP: 0010:add_to_free_space_tree+0xe5/0x110 [btrfs] RSP: 0000:ffffa65087fb7b20 EFLAGS: 00010282 RAX: 0000000000000000 RBX: 0000000000001000 RCX: 0000000000000000 RDX: 0000000000000001 RSI: ffffffff9131eeaa RDI: 00000000ffffffff RBP: ffff8d62e26481b8 R08: ffffffff9ad97ce0 R09: 0000000000000001 R10: 0000000000000000 R11: 0000000000000001 R12: 00000000ffffffe4 R13: ffff8d61c25fe688 R14: ffff8d61ebd88800 R15: ffff8d61ebd88a90 FS: 0000000000000000(0000) GS:ffff8d64ed400000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fa46a8b1000 CR3: 0000000148d18003 CR4: 0000000000370ee0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> __btrfs_free_extent+0x516/0x950 [btrfs] __btrfs_run_delayed_refs+0x2b1/0x1250 [btrfs] btrfs_run_delayed_refs+0x86/0x210 [btrfs] flush_space+0x403/0x630 [btrfs] ? call_rcu_tasks_generic+0x50/0x80 ? lock_release+0x223/0x4a0 ? btrfs_get_alloc_profile+0xb5/0x290 [btrfs] ? do_raw_spin_unlock+0x4b/0xa0 btrfs_async_reclaim_metadata_space+0x139/0x320 [btrfs] process_one_work+0x24c/0x5b0 worker_thread+0x55/0x3c0 ? process_one_work+0x5b0/0x5b0 kthread+0x17c/0x1a0 ? set_kthread_struct+0x40/0x40 ret_from_fork+0x22/0x30 There's a couple of reasons for this, but in generic/619's case the largest reason is because it is a very small file system, ad we do not reserve enough space for the global reserve. With the free space tree we now have the free space tree that we need to modify when running delayed refs. This means we need the global reserve to take this into account when it calculates the minimum size it needs to be. This is especially important for very small file systems. Fix this by adjusting the minimum global block rsv size math to include the size of the free space tree when calculating the size. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07btrfs: scrub: merge SCRUB_PAGES_PER_RD_BIO and SCRUB_PAGES_PER_WR_BIOQu Wenruo
These two values were introduced in commit ff023aac3119 ("Btrfs: add code to scrub to copy read data to another disk") as an optimization. But the truth is, block layer scheduler can do whatever it wants to merge/split bios to improve performance. Doing such "optimization" is not really going to affect much, especially considering how good current block layer optimizations are doing. Remove such old and immature optimization from our code. Since we're here, also change BUG_ON()s using these two macros to use ASSERT()s. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07btrfs: update SCRUB_MAX_PAGES_PER_BLOCKQu Wenruo
Use BTRFS_MAX_METADATA_BLOCKSIZE and SZ_4K (minimal sectorsize) to calculate this value. And remove one stale comment on the value, in fact with recent subpage support, BTRFS_MAX_METADATA_BLOCKSIZE * PAGE_SIZE is already beyond BTRFS_STRIPE_LEN, just we don't use the full page. Also since we're here, update the BUG_ON() related to SCRUB_MAX_PAGES_PER_BLOCK to ASSERT(). As those ASSERT() are really only for developers to catch early obvious bugs, not to let end users suffer. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07btrfs: do not check -EAGAIN when truncating inodes in the log rootJosef Bacik
We only throttle the btrfs_truncate_inode_items if the root is SHAREABLE, which isn't set on the log root, which means this loop is unnecessary. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07btrfs: make should_throttle loop local in btrfs_truncate_inode_itemsJosef Bacik
We reset this bool on every loop through the truncate loop, make this variable local to the loop. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07btrfs: combine extra if statements in btrfs_truncate_inode_itemsJosef Bacik
We have if (del_item) // do something else // something else if (del_item) // do yet another thing else // something else entirely back to back in btrfs_truncate_inode_items, collapse these two sets of if statements into one. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07btrfs: convert BUG() for pending_del_nr into an ASSERTJosef Bacik
This is a logic correctness check, convert it into an ASSERT() instead of a BUG(). Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07btrfs: convert BUG_ON() in btrfs_truncate_inode_items to ASSERTJosef Bacik
We have a correctness BUG_ON() in btrfs_truncate_inode_items to make sure that we're always using min_type == BTRFS_EXTENT_DATA_KEY if new_size is > 0. Convert this to an ASSERT. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07btrfs: add inode to truncate controlJosef Bacik
In the future we're going to want to use btrfs_truncate_inode_items without looking up the associated inode. In order to accommodate this add the inode to btrfs_truncate_control and handle the case where control->inode is NULL appropriately. This is fairly straightforward, we simply need to add a helper for the trace points, as the file extent map update is controlled by a flag on btrfs_truncate_control. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07btrfs: pass the ino via truncate controlJosef Bacik
In the future we are going to want to truncate inode items without needing to have an btrfs_inode to pass in, so add ino to the btrfs_truncate_control and use that to look up the inode items to truncate. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07btrfs: use a flag to control when to clear the file extent rangeJosef Bacik
We only care about updating the file extent range when we are doing a normal truncation. We skip this for tree logging currently, but we can also skip this for eviction as well. Using a flag makes it more explicit when we want to do this work. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07btrfs: control extent reference updates with a control flag for truncateJosef Bacik
We've had weird bugs in the past where we forgot to adjust the truncate path to deal with the fact that we can be called by the tree log path. Instead of checking if our root is a LOG_ROOT use a flag on the btrfs_truncate_control to indicate that we don't want to do extent reference updates during this truncate. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07btrfs: only call inode_sub_bytes in truncate paths that careJosef Bacik
We currently have a bunch of awkward checks to make sure we only update the inode i_bytes if we're truncating the real inode. Instead keep track of the number of bytes we need to sub in the btrfs_truncate_control, and then do the appropriate adjustment in the truncate paths that care. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07btrfs: only update i_size in truncate paths that careJosef Bacik
We currently will update the i_size of the inode as we truncate it down, however we skip this if we're calling btrfs_truncate_inode_items from the tree log code. However we also don't care about this in the case of evict. Instead keep track of this value in the btrfs_truncate_control and then have btrfs_truncate() and the free space cache truncate path both do the i_size update themselves. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07btrfs: add truncate control structJosef Bacik
I'm going to be adding more arguments and counters to btrfs_truncate_inode_items, so add a control struct to handle all of the extra arguments to make it easier to follow. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07btrfs: remove found_extent from btrfs_truncate_inode_itemsJosef Bacik
We only set this if we find a normal file extent, del_item == 1, and the file extent points to a real extent and isn't a hole extent. We can use del_item == 1 && extent_start != 0 to get the same information that found_extent provides, so remove this variable and use the other variables instead. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07btrfs: move btrfs_kill_delayed_inode_items into evictJosef Bacik
We have a special case in btrfs_truncate_inode_items() to call btrfs_kill_delayed_inode_items() if min_type == 0, which is only called during evict. Instead move this out into evict proper, and add some comments because I erroneously attempted to remove this code altogether without understanding what we were doing. Evict is updating the inode only because we only care about making sure the i_nlink count has hit disk. If we had pending deletions we don't want to process those via the delayed inode updates, we simply want to drop all of them and reclaim the reserved metadata space. Then from there the btrfs_truncate_inode_items() will do the work to remove all of the items as appropriate. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07btrfs: remove free space cache inode check in btrfs_truncate_inode_itemsJosef Bacik
We no longer have inode cache feature, so this check is extraneous as the only inode cache is in the tree_root, which is not marked as SHAREABLE. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07btrfs: move extent locking outside of btrfs_truncate_inode_itemsJosef Bacik
Currently we are locking the extent and dropping the extent cache for any inodes we truncate, unless they're in the tree log. We call this helper from: - truncate - evict - tree log - free space cache truncation For evict we've already dropped all of the extent cache for this inode once we've gotten here, and we're the only one accessing this inode, so this step is unnecessary. For the tree log code we already skip this part. Pull this work into the truncate path and the free space cache truncation path. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07btrfs: move btrfs_truncate_inode_items to inode-item.cJosef Bacik
This is an inode item related manipulation with a few vfs related adjustments. I'm going to remove the vfs related code from this helper and simplify it a lot, but I want those changes to be easily seen via git blame, so move this function now and then the simplification work can be done. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07btrfs: add an inode-item.hJosef Bacik
We have a few helpers in inode-item.c, and I'm going to make a few changes to how we do truncate in the future, so break out these definitions into their own header file to trim down ctree.h some and make it easier to do the work on truncate in the future. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07btrfs: remove stale comment about locking at btrfs_search_slot()Filipe Manana
The comment refers to the old extent buffer locking code, where we used to have custom locks that had blocking and spinning behaviour modes. That is not the case anymore, since we have transitioned to rw semaphores, so the comment does not offer any value anymore. Remove it. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07btrfs: remove BUG_ON() after splitting leafFilipe Manana
After calling split_leaf() we BUG_ON() if the returned value is greater than zero. However split_leaf() only returns 0, in case of success, or a negative value in case of an error. The reason for the BUG_ON() is that if we ever get a positive return value from split_leaf(), we can not simply propagate it to the callers of btrfs_search_slot(), as that would be interpreted as "key not found" and not as an error. That means it could result in callers ending up causing some potential silent corruption. So change the BUG_ON() to an ASSERT(), and in case assertions are disabled, produce a warning and set the return value to an error, to make it not possible to get into a silent corruption and having the error not noticed. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07btrfs: move leaf search logic out of btrfs_search_slot()Filipe Manana
There's quite a significant amount of code for doing the key search for a leaf at btrfs_search_slot(), with a couple labels and gotos in it, plus btrfs_search_slot() is already big enough. So move the logic that does the key search on a leaf into a new helper function. This makes it better organized, removing the need for the labels and the gotos, as well as reducing the indentation level and the size of btrfs_search_slot(). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07btrfs: remove useless condition check before splitting leafFilipe Manana
When inserting a key, we check if the write_lock_level is less than 1, and if so we set it to 1, release the path and retry the tree traversal. However that is unnecessary, because when ins_len is greater than 0, we know that write_lock_level can never be less than 1. The logic to retry is also buggy, because in case ins_len was decremented, due to an exact key match and the search is not meant for item extension (path->search_for_extension is 0), we retry without incrementing ins_len, which would make the next retry decrement it again by the same amount. So remove the check for write_lock_level being less than 1 and add an assertion to assert it's always >= 1. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07btrfs: try to unlock parent nodes earlier when inserting a keyFilipe Manana
When inserting a new key, we release the write lock on the leaf's parent only after doing the binary search on the leaf. This is because if the key ends up at slot 0, we will have to update the key at slot 0 of the parent node. The same reasoning applies to any other upper level nodes when their slot is 0. We also need to keep the parent locked in case the leaf does not have enough free space to insert the new key/item, because in that case we will split the leaf and we will need to add a new key to the parent due to a new leaf resulting from the split operation. However if the leaf has enough space for the new key and the key does not end up at slot 0 of the leaf we could release our write lock on the parent before doing the binary search on the leaf to figure out the destination slot. That leads to reducing the amount of time other tasks are blocked waiting to lock the parent, therefore increasing parallelism when there are other tasks that are trying to access other leaves accessible through the same parent. This also applies to other upper nodes besides the immediate parent, when their slot is 0, since we keep locks on them until we figure out if the leaf slot is slot 0 or not. In fact, having the key ending at up slot 0 when is rare. Typically it only happens when the key is less than or equals to the smallest, the "left most", key of the entire btree, during a split attempt when we try to push to the right sibling leaf or when the caller just wants to update the item of an existing key. It's also very common that a leaf has enough space to insert a new key, since after a split we move about half of the keys from one into the new leaf. So unlock the parent, and any other upper level nodes, when during a key insertion we notice the key is greater then the first key in the leaf and the leaf has enough free space. After unlocking the upper level nodes, do the binary search using a low boundary of slot 1 and not slot 0, to figure out the slot where the key will be inserted (or where the key already is in case it exists and the caller wants to modify its item data). This extra comparison, with the first key, is cheap and the key is very likely already in a cache line because it immediately follows the header of the extent buffer and we have recently read the level field of the header (which in fact is the last field of the header). The following fs_mark test was run on a non-debug kernel (debian's default kernel config), with a 12 cores intel CPU, and using a NVMe device: $ cat run-fsmark.sh #!/bin/bash DEV=/dev/nvme0n1 MNT=/mnt/nvme0n1 MOUNT_OPTIONS="-o ssd" MKFS_OPTIONS="-O no-holes -R free-space-tree" FILES=100000 THREADS=$(nproc --all) FILE_SIZE=0 echo "performance" | \ tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT OPTS="-S 0 -L 10 -n $FILES -s $FILE_SIZE -t $THREADS -k" for ((i = 1; i <= $THREADS; i++)); do OPTS="$OPTS -d $MNT/d$i" done fs_mark $OPTS umount $MNT Before this change: FSUse% Count Size Files/sec App Overhead 0 1200000 0 165273.6 5958381 0 2400000 0 190938.3 6284477 0 3600000 0 181429.1 6044059 0 4800000 0 173979.2 6223418 0 6000000 0 139288.0 6384560 0 7200000 0 163000.4 6520083 1 8400000 0 57799.2 5388544 1 9600000 0 66461.6 5552969 2 10800000 0 49593.5 5163675 2 12000000 0 57672.1 4889398 After this change: FSUse% Count Size Files/sec App Overhead 0 1200000 0 167987.3 (+1.6%) 6272730 0 2400000 0 198563.9 (+4.0%) 6048847 0 3600000 0 197436.6 (+8.8%) 6163637 0 4800000 0 202880.7 (+16.6%) 6371771 1 6000000 0 167275.9 (+20.1%) 6556733 1 7200000 0 204051.2 (+25.2%) 6817091 1 8400000 0 69622.8 (+20.5%) 5525675 1 9600000 0 69384.5 (+4.4%) 5700723 1 10800000 0 61454.1 (+23.9%) 5363754 3 12000000 0 61908.7 (+7.3%) 5370196 Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07btrfs: allow generic_bin_search() to take low boundary as an argumentFilipe Manana
Right now generic_bin_search() always uses a low boundary slot of 0, but in the next patch we'll want to often skip slot 0 when searching for a key. So make generic_bin_search() have the low boundary slot specified as an argument, and move the check for the extent buffer level from btrfs_bin_search() to generic_bin_search() to avoid adding another wrapper around generic_bin_search(). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07btrfs: check the root node for uptodate before returning itJosef Bacik
Now that we clear the extent buffer uptodate if we fail to write it out we need to check to see if our root node is uptodate before we search down it. Otherwise we could return stale data (or potentially corrupt data that was caught by the write verification step) and think that the path is OK to search down. CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07btrfs: allow device add if balance is pausedNikolay Borisov
Currently paused balance precludes adding a device since they are both considered exclusive ops and we can have at most one running at a time. This is problematic in case a filesystem encounters an ENOSPC situation while balance is running, in this case the only thing the user can do is mount the fs with "skip_balance" which pauses balance and delete some data to free up space for balance. However, it should be possible to add a new device when balance is paused. Fix this by allowing device add to proceed when balance is paused. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07btrfs: make device add compatible with paused balance in ↵Nikolay Borisov
btrfs_exclop_start_try_lock This is needed to enable device add to work in cases when a file system has been mounted with 'skip_balance' mount option. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07btrfs: introduce exclusive operation BALANCE_PAUSED stateNikolay Borisov
Current set of exclusive operation states is not sufficient to handle all practical use cases. In particular there is a need to be able to add a device to a filesystem that have paused balance. Currently there is no way to distinguish between a running and a paused balance. Fix this by introducing BTRFS_EXCLOP_BALANCE_PAUSED which is going to be set in 2 occasions: 1. When a filesystem is mounted with skip_balance and there is an unfinished balance it will now be into BALANCE_PAUSED instead of simply BALANCE state. 2. When a running balance is paused. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07btrfs: make send work with concurrent block group relocationFilipe Manana
We don't allow send and balance/relocation to run in parallel in order to prevent send failing or silently producing some bad stream. This is because while send is using an extent (specially metadata) or about to read a metadata extent and expecting it belongs to a specific parent node, relocation can run, the transaction used for the relocation is committed and the extent gets reallocated while send is still using the extent, so it ends up with a different content than expected. This can result in just failing to read a metadata extent due to failure of the validation checks (parent transid, level, etc), failure to find a backreference for a data extent, and other unexpected failures. Besides reallocation, there's also a similar problem of an extent getting discarded when it's unpinned after the transaction used for block group relocation is committed. The restriction between balance and send was added in commit 9e967495e0e0 ("Btrfs: prevent send failures and crashes due to concurrent relocation"), kernel 5.3, while the more general restriction between send and relocation was added in commit 1cea5cf0e664 ("btrfs: ensure relocation never runs while we have send operations running"), kernel 5.14. Both send and relocation can be very long running operations. Relocation because it has to do a lot of IO and expensive backreference lookups in case there are many snapshots, and send due to read IO when operating on very large trees. This makes it inconvenient for users and tools to deal with scheduling both operations. For zoned filesystem we also have automatic block group relocation, so send can fail with -EAGAIN when users least expect it or send can end up delaying the block group relocation for too long. In the future we might also get the automatic block group relocation for non zoned filesystems. This change makes it possible for send and relocation to run in parallel. This is achieved the following way: 1) For all tree searches, send acquires a read lock on the commit root semaphore; 2) After each tree search, and before releasing the commit root semaphore, the leaf is cloned and placed in the search path (struct btrfs_path); 3) After releasing the commit root semaphore, the changed_cb() callback is invoked, which operates on the leaf and writes commands to the pipe (or file in case send/receive is not used with a pipe). It's important here to not hold a lock on the commit root semaphore, because if we did we could deadlock when sending and receiving to the same filesystem using a pipe - the send task blocks on the pipe because it's full, the receive task, which is the only consumer of the pipe, triggers a transaction commit when attempting to create a subvolume or reserve space for a write operation for example, but the transaction commit blocks trying to write lock the commit root semaphore, resulting in a deadlock; 4) Before moving to the next key, or advancing to the next change in case of an incremental send, check if a transaction used for relocation was committed (or is about to finish its commit). If so, release the search path(s) and restart the search, to where we were before, so that we don't operate on stale extent buffers. The search restarts are always possible because both the send and parent roots are RO, and no one can add, remove of update keys (change their offset) in RO trees - the only exception is deduplication, but that is still not allowed to run in parallel with send; 5) Periodically check if there is contention on the commit root semaphore, which means there is a transaction commit trying to write lock it, and release the semaphore and reschedule if there is contention, so as to avoid causing any significant delays to transaction commits. This leaves some room for optimizations for send to have less path releases and re searching the trees when there's relocation running, but for now it's kept simple as it performs quite well (on very large trees with resulting send streams in the order of a few hundred gigabytes). Test case btrfs/187, from fstests, stresses relocation, send and deduplication attempting to run in parallel, but without verifying if send succeeds and if it produces correct streams. A new test case will be added that exercises relocation happening in parallel with send and then checks that send succeeds and the resulting streams are correct. A final note is that for now this still leaves the mutual exclusion between send operations and deduplication on files belonging to a root used by send operations. A solution for that will be slightly more complex but it will eventually be built on top of this change. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07Merge branch 'mptcp-fixes'David S. Miller
Mat Martineau says: ==================== mptcp: Fixes for buffer reclaim and option writing Here are three fixes dealing with a syzkaller crash MPTCP triggers in the memory manager in 5.16-rc8, and some option writing problems. Patches 1 and 2 fix some corner cases in MPTCP option writing. Patch 3 addresses a crash that syzkaller found a way to trigger in the mm subsystem by passing an invalid value to __sk_mem_reduce_allocated(). ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-07mptcp: Check reclaim amount before reducing allocationMat Martineau
syzbot found a page counter underflow that was triggered by MPTCP's reclaim code: page_counter underflow: -4294964789 nr_pages=4294967295 WARNING: CPU: 2 PID: 3785 at mm/page_counter.c:56 page_counter_cancel+0xcf/0xe0 mm/page_counter.c:56 Modules linked in: CPU: 2 PID: 3785 Comm: kworker/2:6 Not tainted 5.16.0-rc1-syzkaller #0 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014 Workqueue: events mptcp_worker RIP: 0010:page_counter_cancel+0xcf/0xe0 mm/page_counter.c:56 Code: c7 04 24 00 00 00 00 45 31 f6 eb 97 e8 2a 2b b5 ff 4c 89 ea 48 89 ee 48 c7 c7 00 9e b8 89 c6 05 a0 c1 ba 0b 01 e8 95 e4 4b 07 <0f> 0b eb a8 4c 89 e7 e8 25 5a fb ff eb c7 0f 1f 00 41 56 41 55 49 RSP: 0018:ffffc90002d4f918 EFLAGS: 00010082 RAX: 0000000000000000 RBX: ffff88806a494120 RCX: 0000000000000000 RDX: ffff8880688c41c0 RSI: ffffffff815e8f28 RDI: fffff520005a9f15 RBP: ffffffff000009cb R08: 0000000000000000 R09: 0000000000000000 R10: ffffffff815e2cfe R11: 0000000000000000 R12: ffff88806a494120 R13: 00000000ffffffff R14: 0000000000000000 R15: 0000000000000001 FS: 0000000000000000(0000) GS:ffff88802cc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000001b2de21000 CR3: 000000005ad59000 CR4: 0000000000150ee0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> page_counter_uncharge+0x2e/0x60 mm/page_counter.c:160 drain_stock+0xc1/0x180 mm/memcontrol.c:2219 refill_stock+0x139/0x2f0 mm/memcontrol.c:2271 __sk_mem_reduce_allocated+0x24d/0x550 net/core/sock.c:2945 __mptcp_rmem_reclaim net/mptcp/protocol.c:167 [inline] __mptcp_mem_reclaim_partial+0x124/0x410 net/mptcp/protocol.c:975 mptcp_mem_reclaim_partial net/mptcp/protocol.c:982 [inline] mptcp_alloc_tx_skb net/mptcp/protocol.c:1212 [inline] mptcp_sendmsg_frag+0x18c6/0x2190 net/mptcp/protocol.c:1279 __mptcp_push_pending+0x232/0x720 net/mptcp/protocol.c:1545 mptcp_release_cb+0xfe/0x200 net/mptcp/protocol.c:2975 release_sock+0xb4/0x1b0 net/core/sock.c:3306 mptcp_worker+0x51e/0xc10 net/mptcp/protocol.c:2443 process_one_work+0x9b2/0x1690 kernel/workqueue.c:2298 worker_thread+0x658/0x11f0 kernel/workqueue.c:2445 kthread+0x405/0x4f0 kernel/kthread.c:327 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295 </TASK> __mptcp_mem_reclaim_partial() could call __mptcp_rmem_reclaim() with a negative value, which passed that negative value to __sk_mem_reduce_allocated() and triggered the splat above. Check for a reclaim amount that is positive and large enough for __mptcp_rmem_reclaim() to actually adjust rmem_fwd_alloc (much like the sk_mem_reclaim_partial() code the function is based on). v2: Use '>' instead of '>=', since SK_MEM_QUANTUM - 1 would get right-shifted into nothing by __mptcp_rmem_reclaim. Fixes: 6511882cdd82 ("mptcp: allocate fwd memory separately on the rx and tx path") Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/252 Reported-and-tested-by: syzbot+bc9e2d2dbcb347dd215a@syzkaller.appspotmail.com Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Michal Hocko <mhocko@suse.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-07mptcp: fix a DSS option writing errorGeliang Tang
'ptr += 1;' was omitted in the original code. If the DSS is the last option -- which is what we have most of the time -- that's not an issue. But it is if we need to send something else after like a RM_ADDR or an MP_PRIO. Fixes: 1bff1e43a30e ("mptcp: optimize out option generation") Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-07mptcp: fix opt size when sending DSS + MP_FAILMatthieu Baerts
When these two options had to be sent -- which is not common -- the DSS size was not being taken into account in the remaining size. Additionally in this situation, the reported size was only the one of the MP_FAIL which can cause issue if at the end, we need to write more in the TCP options than previously said. Here we use a dedicated variable for MP_FAIL size to keep the WARN_ON_ONCE() just after. Fixes: c25aeb4e0953 ("mptcp: MP_FAIL suboption sending") Acked-and-tested-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-07Merge branch 'mptcp-next'David S. Miller
Mat Martineau says: ==================== mptcp: New features and cleanup These patches have been tested in the MPTCP tree for a longer than usual time (thanks to holiday schedules), and are ready for the net-next branch. Changes include feature updates, small fixes, refactoring, and some selftest changes. Patch 1 fixes an OUTQ ioctl issue with TCP fallback sockets. Patches 2, 3, and 6 add support of the MPTCP fastclose option (quick shutdown of the full MPTCP connection, similar to TCP RST in regular TCP), and a related self test. Patch 4 cleans up some accept and poll code that is no longer needed after the fastclose changes. Patch 5 add userspace disconnect using AF_UNSPEC, which is used when testing fastclose and makes the MPTCP socket's handling of AF_UNSPEC in connect() more TCP-like. Patches 7-11 refactor subflow creation to make better use of multiple local endpoints and to better handle individual connection failures when creating multiple subflows. Includes self test updates. Patch 12 cleans up the way subflows are added to the MPTCP connection list, eliminating the need for calls throughout the MPTCP code that had to check the intermediate "join list" for entries to shift over to the main "connection list". Patch 13 refactors the MPTCP release_cb flags to use separate storage for values only accessed with the socket lock held (no atomic ops needed), and for values that need atomic operations. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-07mptcp: avoid atomic bit manipulation when possiblePaolo Abeni
Currently the msk->flags bitmask carries both state for the mptcp_release_cb() - mostly touched under the mptcp data lock - and others state info touched even outside such lock scope. As a consequence, msk->flags is always manipulated with atomic operations. This change splits such bitmask in two separate fields, so that we use plain bit operations when touching the cb-related info. The MPTCP_PUSH_PENDING bit needs additional care, as it is the only CB related field currently accessed either under the mptcp data lock or the mptcp socket lock. Let's add another mask just for such bit's sake. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-07mptcp: cleanup MPJ subflow list handlingPaolo Abeni
We can simplify the join list handling leveraging the mptcp_release_cb(): if we can acquire the msk socket lock at mptcp_finish_join time, move the new subflow directly into the conn_list, otherwise place it on join_list and let the release_cb process such list. Since pending MPJ connection are now always processed in a timely way, we can avoid flushing the join list every time we have to process all the current subflows. Additionally we can now use the mptcp data lock to protect the join_list, removing the additional spin lock. Finally, the MPJ handshake is now always finalized under the msk socket lock, we can drop the additional synchronization between mptcp_finish_join() and mptcp_close(). Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-07selftests: mptcp: add tests for subflow creation failurePaolo Abeni
Verify that, when multiple endpoints are available, subflows creation proceed even when the first additional subflow creation fails - due to packet drop on the relevant link Co-developed-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-07mptcp: do not block subflows creation on errorsPaolo Abeni
If the MPTCP configuration allows for multiple subflows creation, and the first additional subflows never reach the fully established status - e.g. due to packets drop or reset - the in kernel path manager do not move to the next subflow. This patch introduces a new PM helper to cope with MPJ subflow creation failure and delay and hook it where appropriate. Such helper triggers additional subflow creation, as needed and updates the PM subflow counter, if the current one is closing. Additionally start all the needed additional subflows as soon as the MPTCP socket is fully established, so we don't have to cope with slow MPJ handshake blocking the next subflow creation. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-07mptcp: keep track of local endpoint still available for each mskPaolo Abeni
Include into the path manager status a bitmap tracking the list of local endpoints still available - not yet used - for the relevant mptcp socket. Keep such map updated at endpoint creation/deletion time, so that we can easily skip already used endpoint at local address selection time. The endpoint used by the initial subflow is lazyly accounted at subflow creation time: the usage bitmap is be up2date before endpoint selection and we avoid such unneeded task in some relevant scenarios - e.g. busy servers accepting incoming subflows but not creating any additional ones nor annuncing additional addresses. Overall this allows for fair local endpoints usage in case of subflow failure. As a side effect, this patch also enforces that each endpoint is used at most once for each mptcp connection. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-07mptcp: clean-up MPJ option writingPaolo Abeni
Check for all MPJ variant at once, this reduces the number of conditionals traversed on average and will simplify the next patch. No functional change intended. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-07mptcp: fix per socket endpoint accountingPaolo Abeni
Since full-mesh endpoint support, the reception of a single ADD_ADDR option can cause multiple subflows creation. When such option is accepted we increment 'add_addr_accepted' by one. When we received a paired RM_ADDR option, we deleted all the relevant subflows, decrementing 'add_addr_accepted' by one for each of them. We have a similar issue for 'local_addr_used' Fix them moving the pm endpoint accounting outside the subflow traversal. Fixes: 1a0d6136c5f0 ("mptcp: local addresses fullmesh") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>