summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2024-03-04btrfs: preallocate temporary extent buffer for inode logging when neededFilipe Manana
When logging an inode and we require to copy items from subvolume leaves to the log tree, we clone each subvolume leaf and than use that clone to copy items to the log tree. This is required to avoid possible deadlocks as stated in commit 796787c978ef ("btrfs: do not modify log tree while holding a leaf from fs tree locked"). The cloning requires allocating an extent buffer (struct extent_buffer) and then allocating pages (folios) to attach to the extent buffer. This may be slow in case we are under memory pressure, and since we are doing the cloning while holding a read lock on a subvolume leaf, it means we can be blocking other operations on that leaf for significant periods of time, which can increase latency on operations like creating other files, renaming files, etc. Similarly because we're under a log transaction, we may also cause extra delay on other tasks doing an fsync, because syncing the log requires waiting for tasks that joined a log transaction to exit the transaction. So to improve this, for any inode logging operation that needs to copy items from a subvolume leaf ("full sync" or "copy everything" bit set in the inode), preallocate a dummy extent buffer before locking any extent buffer from the subvolume tree, and even before joining a log transaction, add it to the log context and then use it when we need to copy items from a subvolume leaf to the log tree. This avoids making other operations get extra latency when waiting to lock a subvolume leaf that is used during inode logging and we are under heavy memory pressure. The following test script with bonnie++ was used to test this: $ cat test.sh #!/bin/bash DEV=/dev/sdh MNT=/mnt/sdh MOUNT_OPTIONS="-o ssd" MEMTOTAL_BYTES=`free -b | grep Mem: | awk '{ print $2 }'` NR_DIRECTORIES=20 NR_FILES=20480 DATASET_SIZE=$((MEMTOTAL_BYTES * 2 / 1048576)) DIRECTORY_SIZE=$((MEMTOTAL_BYTES * 2 / NR_FILES)) NR_FILES=$((NR_FILES / 1024)) echo "performance" | \ tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor umount $DEV &> /dev/null mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT bonnie++ -u root -d $MNT \ -n $NR_FILES:$DIRECTORY_SIZE:$DIRECTORY_SIZE:$NR_DIRECTORIES \ -r 0 -s $DATASET_SIZE -b umount $MNT The results of this test on a 8G VM running a non-debug kernel (Debian's default kernel config), were the following. Before this change: Version 2.00a ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Name:Size etc /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP debian0 7501M 376k 99 1.4g 96 117m 14 1510k 99 2.5g 95 +++++ +++ Latency 35068us 24976us 2944ms 30725us 71770us 26152us Version 2.00a ------Sequential Create------ --------Random Create-------- debian0 -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files:max:min /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 20:384100:384100/20 20480 32 20480 58 20480 48 20480 39 20480 56 20480 61 Latency 411ms 11914us 119ms 617ms 10296us 110ms After this change: Version 2.00a ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Name:Size etc /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP debian0 7501M 375k 99 1.4g 97 117m 14 1546k 99 2.3g 98 +++++ +++ Latency 35975us 20945us 2144ms 10297us 2217us 6004us Version 2.00a ------Sequential Create------ --------Random Create-------- debian0 -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files:max:min /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 20:384100:384100/20 20480 35 20480 58 20480 48 20480 40 20480 57 20480 59 Latency 320ms 11237us 77779us 518ms 6470us 86389us Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: add comment about list_is_singular() use at btrfs_delete_unused_bgs()Filipe Manana
At btrfs_delete_unused_bgs(), the use of the list_is_singular() check on a block group may not be immediately obvious. It is there to prevent losing raid profile information for a block group type (data, metadata or system), as that information is removed from fs_info->avail_[data|metadata|system]_alloc_bits when the last block group of a given type is deleted. So deleting the block group would later result in creating block groups of that type with a single profile (because fs_info->avail_*_alloc_bits would have a value of 0). This check was added in commit aefbe9a633b5 ("btrfs: Fix lost-data-profile caused by auto removing bg"). So add a comment mentioning the need for the check. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: document what the spinlock unused_bgs_lock protectsFilipe Manana
Add some comments to struct btrfs_fs_info to explicitly document which members are protected by the spinlock unused_bgs_lock. It is currently used to protect two linked lists, the reclaim_bgs and unused_bgs lists. So add an explicit comment on top of each list to mention its protected by unused_bgs_lock, as well as comment on top of unused_bgs_lock to mention the lists it protects. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: make btrfs_error_unpin_extent_range() return voidDavid Sterba
This helper is used in transaction abort or cleanup context and the callers cannot handle all errors, only do best effort. btrfs_cleanup_one_transaction btrfs_destroy_delayed_refs btrfs_error_unpin_extent_range btrfs_destroy_pinned_extent btrfs_error_unpin_extent_range Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: return errors from unpin_extent_range()David Sterba
Handle the lookup failure of the block group to unpin, this is a logic error as the block group must exist at this point. If not, something else must have freed it, like clean_pinned_extents() would do without locking the unused_bg_unpin_mutex. Push the errors to the callers, proper handling will be done in followup patches. Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: handle errors returned from unpin_extent_cache()David Sterba
We've had numerous attempts to let function unpin_extent_cache() return void as it only returns 0. There are still error cases to handle so do that, in addition to the verbose messages. The only caller btrfs_finish_one_ordered() will now abort the transaction, previously it let it continue which could lead to further problems. Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: zlib: Fix spelling mistake "infalte" -> "inflate"Colin Ian King
There is a spelling mistake in a warning message. Fix it. Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: zstd: fix and simplify the inline extent decompression (v2)Qu Wenruo
Note: this is a fixed version that was previously reverted as e01a83e12604 ("Revert "btrfs: zstd: fix and simplify the inline extent decompression""), with fixed parameters to memzero_page(). [BUG] If we have a filesystem with 4k sectorsize, and an inlined compressed extent created like this: item 4 key (257 INODE_ITEM 0) itemoff 15863 itemsize 160 generation 8 transid 8 size 4096 nbytes 4096 block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0 sequence 1 flags 0x0(none) item 5 key (257 INODE_REF 256) itemoff 15839 itemsize 24 index 2 namelen 14 name: source_inlined item 6 key (257 EXTENT_DATA 0) itemoff 15770 itemsize 69 generation 8 type 0 (inline) inline extent data size 48 ram_bytes 4096 compression 3 (zstd) Then trying to reflink that extent in an aarch64 system with 64K page size, the reflink would just fail: # xfs_io -f -c "reflink $mnt/source_inlined 0 60k 4k" $mnt/dest XFS_IOC_CLONE_RANGE: Input/output error [CAUSE] In zstd_decompress(), we didn't treat @start_byte as just a page offset, but also use it as an indicator on whether we should error out, without any proper explanation (this is copied from other decompression code). In reality, for subpage cases, although @start_byte can be non-zero, we should never switch input/output buffer nor error out, since the whole input/output buffer should never exceed one sector, thus we should not need to do any buffer switch. Thus the current code using @start_byte as a condition to switch input/output buffer or finish the decompression is completely incorrect. [FIX] The fix involves several modification: - Rename @start_byte to @dest_pgoff to properly express its meaning - Use @sectorsize other than PAGE_SIZE to properly initialize the output buffer size - Use correct destination offset inside the destination page - Simplify the main loop Since the input/output buffer should never switch, we only need one zstd_decompress_stream() call. - Consider early end as an error After the fix, even on 64K page sized aarch64, above reflink now works as expected: # xfs_io -f -c "reflink $mnt/source_inlined 0 60k 4k" $mnt/dest linked 4096/4096 bytes at offset 61440 And results the correct file layout: item 9 key (258 INODE_ITEM 0) itemoff 15542 itemsize 160 generation 10 transid 10 size 65536 nbytes 4096 block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0 sequence 1 flags 0x0(none) item 10 key (258 INODE_REF 256) itemoff 15528 itemsize 14 index 3 namelen 4 name: dest item 11 key (258 XATTR_ITEM 3817753667) itemoff 15445 itemsize 83 location key (0 UNKNOWN.0 0) type XATTR transid 10 data_len 37 name_len 16 name: security.selinux data unconfined_u:object_r:unlabeled_t:s0 item 12 key (258 EXTENT_DATA 61440) itemoff 15392 itemsize 53 generation 10 type 1 (regular) extent data disk byte 13631488 nr 4096 extent data offset 0 nr 4096 ram 4096 extent compression 0 (none) Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: remove unused included headersDavid Sterba
With help of neovim, LSP and clangd we can identify header files that are not actually needed to be included in the .c files. This is focused only on removal (with minor fixups), further cleanups are possible but will require doing the header files properly with forward declarations, minimized includes and include-what-you-use care. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: replace i_blocksize by fs_info::sectorsizeDavid Sterba
The block size calculated by i_blocksize from inode is the same as what we have in fs_info, initalized in inode_init_always(). Unify that to use the fs_info value everywhere. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: replace sb::s_blocksize by fs_info::sectorsizeDavid Sterba
The block size stored in the super block is used by subsystems outside of btrfs and it's a copy of fs_info::sectorsize. Unify that to always use our sectorsize, with the exception of mount where we first need to use fixed values (4K) until we read the super block and can set the sectorsize. Replace all uses, in most cases it's fewer pointer indirections. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: remove duplicate recording of physical addressJohannes Thumshirn
Remove the duplicate physical recording of the original write physical address in case of a single device write. This duplicated code is most likely present due to a rebase error. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: page to folio conversion in btrfs_truncate_block()Goldwyn Rodrigues
Convert use of struct page to struct folio inside btrfs_truncate_block(). The only page based function is set_page_extent_mapped(). All other functions have folio equivalents. Had to use __filemap_get_folio() because filemap_grab_folio() does not allow passing allocation mask as a parameter. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: use a folio array throughout the defrag processMatthew Wilcox (Oracle)
Remove more hidden calls to compound_head() by using an array of folios instead of pages. Also neaten the error path in defrag_one_range() by adjusting the length of the array instead of checking for NULL. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: convert defrag_prepare_one_page() to use a folioMatthew Wilcox (Oracle)
Use a folio throughout defrag_prepare_one_page() to remove dozens of hidden calls to compound_head(). There is no support here for large folios; indeed, turn the existing check for PageCompound into a check for large folios. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: add set_folio_extent_mapped() helperMatthew Wilcox (Oracle)
Turn set_page_extent_mapped() into a wrapper around this version. Saves a call to compound_head() for callers who already have a folio and removes a couple of users of page->mapping. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: WARN_ON_ONCE() in our leak detection codeJosef Bacik
fstests looks for WARN_ON's in dmesg. Add WARN_ON_ONCE() to our leak detection code (enabled only in debug builds) so that fstests will fail if these things trip at all. This will allow us to easily catch problems with our reference counting that may otherwise go unnoticed. Reviewed-by: Neal Gompa <neal@gompa.dev> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: remove extent_map_tree forward declaration at extent_io.hFilipe Manana
There's no need to do a forward declaration of struct extent_map_tree at extent_io.h, as there are no function prototypes, inline functions or data structures that refer to struct extent_map_tree. So remove that forward declaration, which is not needed since commit 477a30ba5f8d ("btrfs: Sink extent_tree arguments in try_release_extent_mapping"). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: cache folio size and shift in extent_bufferQu Wenruo
After the conversion to folio interfaces (but without the patch to enable larger folio allocation), there is an LTP report about observable performance drop on metadata heavy operations. https://lore.kernel.org/linux-btrfs/202312221750.571925bd-oliver.sang@intel.com/ This drop is caused by the extra code of calculating the folio_size()/folio_shift(), instead of the old hard coded PAGE_SIZE/PAGE_SHIFT. To slightly reduce the overhead, just cache both folio_size and folio_shift in extent_buffer. The two new members (u32 folio_size and u8 folio_shift) are stored inside the holes of extent_buffer. folio_size is shared with len, which is reduced to u32. The size of eb does not change. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: remove unused variable bio_offset from end_bbio_data_read()Qu Wenruo
The variable @bio_offset was introduced in commit 7ffd27e378d2 ("btrfs: pass bio_offset to check_data_csum() directly"), when we are still using the same endio function for both data and metadata. Later we had several changes to data and metadata endio functions: - Data verification is handled by btrfs bio layer - Split data and metadata endio paths Now for data path we no longer do any verification in end_bbio_data_read(), as the verification is handled by btrfs bio layer already. Thus there is no need for such bio_offset variable. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: remove the pg_offset parameter from btrfs_get_extent()Qu Wenruo
The parameter @pg_offset of btrfs_get_extent() is only utilized for inlined extent, and we already have an ASSERT() and tree-checker, to make sure we can only get inline extent at file offset 0. Any invalid inline extent with non-zero file offset would be rejected by tree-checker in the first place. Thus the @pg_offset parameter is not really necessary, just remove it. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04hugetlbfs: support idmapped mountsGiuseppe Scrivano
pass down the idmapped mount information to the different helper functions. Differently, hugetlb_file_setup() will continue to not have any mapping since it is only used from contexts where idmapped mounts are not used. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com> Link: https://lore.kernel.org/r/20240229152405.105031-1-gscrivan@redhat.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-03-02Merge tag 'for-netdev' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2024-02-29 We've added 119 non-merge commits during the last 32 day(s) which contain a total of 150 files changed, 3589 insertions(+), 995 deletions(-). The main changes are: 1) Extend the BPF verifier to enable static subprog calls in spin lock critical sections, from Kumar Kartikeya Dwivedi. 2) Fix confusing and incorrect inference of PTR_TO_CTX argument type in BPF global subprogs, from Andrii Nakryiko. 3) Larger batch of riscv BPF JIT improvements and enabling inlining of the bpf_kptr_xchg() for RV64, from Pu Lehui. 4) Allow skeleton users to change the values of the fields in struct_ops maps at runtime, from Kui-Feng Lee. 5) Extend the verifier's capabilities of tracking scalars when they are spilled to stack, especially when the spill or fill is narrowing, from Maxim Mikityanskiy & Eduard Zingerman. 6) Various BPF selftest improvements to fix errors under gcc BPF backend, from Jose E. Marchesi. 7) Avoid module loading failure when the module trying to register a struct_ops has its BTF section stripped, from Geliang Tang. 8) Annotate all kfuncs in .BTF_ids section which eventually allows for automatic kfunc prototype generation from bpftool, from Daniel Xu. 9) Several updates to the instruction-set.rst IETF standardization document, from Dave Thaler. 10) Shrink the size of struct bpf_map resp. bpf_array, from Alexei Starovoitov. 11) Initial small subset of BPF verifier prepwork for sleepable bpf_timer, from Benjamin Tissoires. 12) Fix bpftool to be more portable to musl libc by using POSIX's basename(), from Arnaldo Carvalho de Melo. 13) Add libbpf support to gcc in CORE macro definitions, from Cupertino Miranda. 14) Remove a duplicate type check in perf_event_bpf_event, from Florian Lehner. 15) Fix bpf_spin_{un,}lock BPF helpers to actually annotate them with notrace correctly, from Yonghong Song. 16) Replace the deprecated bpf_lpm_trie_key 0-length array with flexible array to fix build warnings, from Kees Cook. 17) Fix resolve_btfids cross-compilation to non host-native endianness, from Viktor Malik. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (119 commits) selftests/bpf: Test if shadow types work correctly. bpftool: Add an example for struct_ops map and shadow type. bpftool: Generated shadow variables for struct_ops maps. libbpf: Convert st_ops->data to shadow type. libbpf: Set btf_value_type_id of struct bpf_map for struct_ops. bpf: Replace bpf_lpm_trie_key 0-length array with flexible array bpf, arm64: use bpf_prog_pack for memory management arm64: patching: implement text_poke API bpf, arm64: support exceptions arm64: stacktrace: Implement arch_bpf_stack_walk() for the BPF JIT bpf: add is_async_callback_calling_insn() helper bpf: introduce in_sleepable() helper bpf: allow more maps in sleepable bpf programs selftests/bpf: Test case for lacking CFI stub functions. bpf: Check cfi_stubs before registering a struct_ops type. bpf: Clarify batch lookup/lookup_and_delete semantics bpf, docs: specify which BPF_ABS and BPF_IND fields were zero bpf, docs: Fix typos in instruction-set.rst selftests/bpf: update tcp_custom_syncookie to use scalar packet offset bpf: Shrink size of struct bpf_map/bpf_array. ... ==================== Link: https://lore.kernel.org/r/20240301001625.8800-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-02Merge tag 'xfs-6.8-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull xfs fix from Chandan Babu: "Drop experimental warning message when mounting an xfs filesystem on an fsdax device. We now consider xfs on fsdax to be stable" * tag 'xfs-6.8-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: drop experimental warning for FSDAX
2024-03-01libfs: improve path_from_stashed()Christian Brauner
Right now we pass a bunch of info that is fs specific which doesn't make a lot of sense and it bleeds fs sepcific details into the generic helper. nsfs and pidfs have slightly different needs when initializing inodes. Add simple operations that are stashed in sb->s_fs_info that both can implement. This also allows us to get rid of cleaning up references in the caller. All in all path_from_stashed() becomes way simpler. Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-03-01Merge tag 'ceph-for-6.8-rc7' of https://github.com/ceph/ceph-clientLinus Torvalds
Pull ceph fix from Ilya Dryomov: "Catch up with mdsmap encoding rectification which ended up being necessary after all to enable cluster upgrades from problematic v18.2.0 and v18.2.1 releases" * tag 'ceph-for-6.8-rc7' of https://github.com/ceph/ceph-client: ceph: switch to corrected encoding of max_xattr_size in mdsmap
2024-03-01Merge tag 'for-6.8-rc6-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - fix freeing allocated id for anon dev when snapshot creation fails - fiemap fixes: - followup for a recent deadlock fix, ranges that fiemap can access can still race with ordered extent completion - make sure fiemap with SYNC flag does not race with writes * tag 'for-6.8-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix double free of anonymous device after snapshot creation failure btrfs: ensure fiemap doesn't race with writes when FIEMAP_FLAG_SYNC is given btrfs: fix race between ordered extent completion and fiemap
2024-03-01Merge tag 'exfat-for-6.8-rc7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat Pull exfat fix from Namjae Jeon: - Fix ftruncate failure when allocating non-contiguous clusters * tag 'exfat-for-6.8-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat: exfat: fix appending discontinuous clusters to empty file
2024-03-01Merge tag 'vfs-6.8-rc7.fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: "Two small fixes: - Fix an endless loop during afs directory iteration caused by not skipping silly-rename files correctly. - Fix reporting of completion events for aio causing leaks in userspace. This is based on the fix last week as it's now possible to recognize aio events submitted through the old aio interface" * tag 'vfs-6.8-rc7.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs/aio: Make io_cancel() generate completions again afs: Fix endless loop in directory parsing
2024-03-01Merge tag 'efi-fixes-for-v6.8-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi Pull EFI fixes from Ard Biesheuvel: "Only the EFI variable name size change is significant, and will be backported once it lands. The others are cleanup. - Fix phys_addr_t size confusion in 32-bit capsule loader - Reduce maximum EFI variable name size to 512 to work around buggy firmware - Drop some redundant code from efivarfs while at it" * tag 'efi-fixes-for-v6.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi: efivarfs: Drop 'duplicates' bool parameter on efivar_init() efivarfs: Drop redundant cleanup on fill_super() failure efivarfs: Request at most 512 bytes for variable names efi/capsule-loader: fix incorrect allocation size
2024-03-01nfsd: Fix NFSv3 atomicity bugs in nfsd_setattr()Trond Myklebust
The main point of the guarded SETATTR is to prevent races with other WRITE and SETATTR calls. That requires that the check of the guard time against the inode ctime be done after taking the inode lock. Furthermore, we need to take into account the 32-bit nature of timestamps in NFSv3, and the possibility that files may change at a faster rate than once a second. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01nfsd: Fix a regression in nfsd_setattr()Trond Myklebust
Commit bb4d53d66e4b ("NFSD: use (un)lock_inode instead of fh_(un)lock for file operations") broke the NFSv3 pre/post op attributes behaviour when doing a SETATTR rpc call by stripping out the calls to fh_fill_pre_attrs() and fh_fill_post_attrs(). Fixes: bb4d53d66e4b ("NFSD: use (un)lock_inode instead of fh_(un)lock for file operations") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: NeilBrown <neilb@suse.de> Message-ID: <20240216012451.22725-1-trondmy@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01NFSD: OP_CB_RECALL_ANY should recall both read and write delegationsDai Ngo
Add RCA4_TYPE_MASK_WDATA_DLG to ra_bmval bitmask of OP_CB_RECALL_ANY Signed-off-by: Dai Ngo <dai.ngo@oracle.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01NFSD: handle GETATTR conflict with write delegationDai Ngo
If the GETATTR request on a file that has write delegation in effect and the request attributes include the change info and size attribute then the request is handled as below: Server sends CB_GETATTR to client to get the latest change info and file size. If these values are the same as the server's cached values then the GETATTR proceeds as normal. If either the change info or file size is different from the server's cached values, or the file was already marked as modified, then: . update time_modify and time_metadata into file's metadata with current time . encode GETATTR as normal except the file size is encoded with the value returned from CB_GETATTR . mark the file as modified If the CB_GETATTR fails for any reasons, the delegation is recalled and NFS4ERR_DELAY is returned for the GETATTR. Signed-off-by: Dai Ngo <dai.ngo@oracle.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01NFSD: add support for CB_GETATTR callbackDai Ngo
Includes: . CB_GETATTR proc for nfs4_cb_procedures[] . XDR encoding and decoding function for CB_GETATTR request/reply . add nfs4_cb_fattr to nfs4_delegation for sending CB_GETATTR and store file attributes from client's reply. Signed-off-by: Dai Ngo <dai.ngo@oracle.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01NFSD: Document the phases of CREATE_SESSIONChuck Lever
As described in RFC 8881 Section 18.36.4, CREATE_SESSION can be split into four phases. NFSD's implementation now does it like that description. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01NFSD: Fix the NFSv4.1 CREATE_SESSION operationChuck Lever
RFC 8881 Section 18.36.4 discusses the implementation of the NFSv4.1 CREATE_SESSION operation. The section defines four phases of operation. Phase 2 processes the CREATE_SESSION sequence ID. As a separate step, Phase 3 evaluates the CREATE_SESSION arguments. The problem we are concerned with is when phase 2 is successful but phase 3 fails. The spec language in this case is "No changes are made to any client records on the server." RFC 8881 Section 18.35.4 defines a "client record", and it does /not/ contain any details related to the special CREATE_SESSION slot. Therefore NFSD is incorrect to skip incrementing the CREATE_SESSION sequence id when phase 3 (see Section 18.36.4) of CREATE_SESSION processing fails. In other words, even though NFSD happens to store the cs_slot in a client record, in terms of the protocol the slot is logically separate from the client record. Three complications: 1. The world has moved on since commit 86c3e16cc7aa ("nfsd4: confirm only on succesful create_session") broke this. So we can't simply revert that commit. 2. NFSD's CREATE_SESSION implementation does not cleanly delineate the logic of phases 2 and 3. So this won't be a surgical fix. 3. Because of the way it currently handles the CREATE_SESSION slot sequence number, nfsd4_create_session() isn't caching error responses in the CREATE_SESSION slot. Instead of replaying the response cache in those cases, it's executing the transaction again. Reorganize the CREATE_SESSION slot sequence number accounting. This requires that error responses are appropriately cached in the CREATE_SESSION slot (once it is found). Reported-by: Connor Smith <connor.smith@hitachivantara.com> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218382 Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01nfsd: clean up comments over nfs4_client definitionChen Hanxiao
nfsd fault injection has been deprecated since commit 9d60d93198c6 ("Deprecate nfsd fault injection") and removed by commit e56dc9e2949e ("nfsd: remove fault injection code") So remove the outdated parts about fault injection. Signed-off-by: Chen Hanxiao <chenhx.fnst@fujitsu.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01nfsd: Simplify the allocation of slab caches in nfsd4_init_slabsKunwu Chan
Use the new KMEM_CACHE() macro instead of direct kmem_cache_create to simplify the creation of SLAB caches. Make the code cleaner and more readable. Signed-off-by: Kunwu Chan <chentao@kylinos.cn> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01nfsd: Simplify the allocation of slab caches in nfsd_drc_slab_createKunwu Chan
Use the new KMEM_CACHE() macro instead of direct kmem_cache_create to simplify the creation of SLAB caches. And change cache name from 'nfsd_drc' to 'nfsd_cacherep'. Signed-off-by: Kunwu Chan <chentao@kylinos.cn> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01nfsd: Simplify the allocation of slab caches in nfsd_file_cache_initKunwu Chan
Use the new KMEM_CACHE() macro instead of direct kmem_cache_create to simplify the creation of SLAB caches. Signed-off-by: Kunwu Chan <chentao@kylinos.cn> Acked-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01nfsd: Simplify the allocation of slab caches in nfsd4_init_pnfsKunwu Chan
commit 0a31bd5f2bbb ("KMEM_CACHE(): simplify slab cache creation") introduces a new macro. Use the new KMEM_CACHE() macro instead of direct kmem_cache_create to simplify the creation of SLAB caches. Signed-off-by: Kunwu Chan <chentao@kylinos.cn> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01nfsd: don't call locks_release_private() twice concurrentlyNeilBrown
It is possible for free_blocked_lock() to be called twice concurrently, once from nfsd4_lock() and once from nfsd4_release_lockowner() calling remove_blocked_locks(). This is why a kref was added. It is perfectly safe for locks_delete_block() and kref_put() to be called in parallel as they use locking or atomicity respectively as protection. However locks_release_private() has no locking. It is safe for it to be called twice sequentially, but not concurrently. This patch moves that call from free_blocked_lock() where it could race with itself, to free_nbl() where it cannot. This will slightly delay the freeing of private info or release of the owner - but not by much. It is arguably more natural for this freeing to happen in free_nbl() where the structure itself is freed. This bug was found by code inspection - it has not been seen in practice. Fixes: 47446d74f170 ("nfsd4: add refcount for nfsd4_blocked_lock") Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01nfsd: allow layout state to be admin-revoked.NeilBrown
When there is layout state on a filesystem that is being "unlocked" that is now revoked, which involves closing the nfsd_file and releasing the vfs lease. To avoid races, ->ls_file can now be accessed either: - under ->fi_lock for the state's sc_file or - under rcu_read_lock() if nfsd_file_get() is used. To support this, ->fence_client and nfsd4_cb_layout_fail() now take a second argument being the nfsd_file. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01nfsd: allow delegation state ids to be revoked and then freedNeilBrown
Revoking state through 'unlock_filesystem' now revokes any delegation states found. When the stateids are then freed by the client, the revoked stateids will be cleaned up correctly. As there is already support for revoking delegations, we build on that for admin-revoking. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01nfsd: allow open state ids to be revoked and then freedNeilBrown
Revoking state through 'unlock_filesystem' now revokes any open states found. When the stateids are then freed by the client, the revoked stateids will be cleaned up correctly. Possibly the related lock states should be revoked too, but a subsequent patch will do that for all lock state on the superblock. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01nfsd: allow lock state ids to be revoked and then freedNeilBrown
Revoking state through 'unlock_filesystem' now revokes any lock states found. When the stateids are then freed by the client, the revoked stateids will be cleaned up correctly. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01nfsd: allow admin-revoked NFSv4.0 state to be freed.NeilBrown
For NFSv4.1 and later the client easily discovers if there is any admin-revoked state and will then find and explicitly free it. For NFSv4.0 there is no such mechanism. The client can only find that state is admin-revoked if it tries to use that state, and there is no way for it to explicitly free the state. So the server must hold on to the stateid (at least) for an indefinite amount of time. A RELEASE_LOCKOWNER request might justify forgetting some of these stateids, as would the whole clients lease lapsing, but these are not reliable. This patch takes two approaches. Whenever a client uses an revoked stateid, that stateid is then discarded and will not be recognised again. This might confuse a client which expect to get NFS4ERR_ADMIN_REVOKED consistently once it get it at all, but should mostly work. Hopefully one error will lead to other resources being closed (e.g. process exits), which will result in more stateid being freed when a CLOSE attempt gets NFS4ERR_ADMIN_REVOKED. Also, any admin-revoked stateids that have been that way for more than one lease time are periodically revoke. No actual freeing of state happens in this patch. That will come in future patches which handle the different sorts of revoked state. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01nfsd: report in /proc/fs/nfsd/clients/*/states when state is admin-revokeNeilBrown
Add "admin-revoked" to the status information for any states that have been admin-revoked. This can be useful for confirming correct behaviour. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01nfsd: allow state with no file to appear in /proc/fs/nfsd/clients/*/statesNeilBrown
Change the "show" functions to show some content even if a file cannot be found. This is the case for admin-revoked state. This is primarily useful for debugging - to ensure states are being removed eventually. So change several seq_printf() to seq_puts(). Some of these are needed to keep checkpatch happy. Others were done for consistency. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>