summaryrefslogtreecommitdiff
path: root/fs/btrfs/ioctl.c
AgeCommit message (Collapse)Author
6 daysMerge tag 'for-6.17/io_uring-20250728' of git://git.kernel.dk/linuxLinus Torvalds
Pull io_uring updates from Jens Axboe: - Optimization to avoid reference counts on non-cloned registered buffers. This is how these buffers were handled prior to having cloning support, and we can still use that approach as long as the buffers haven't been cloned to another ring. - Cleanup and improvement for uring_cmd, where btrfs was the only user of storing allocated data for the lifetime of the uring_cmd. Clean that up so we can get rid of the need to do that. - Avoid unnecessary memory copies in uring_cmd usage. This is particularly important as a lot of uring_cmd usage necessitates the use of 128b SQEs. - A few updates for recv multishot, where it's now possible to add fairness limits for limiting how much is transferred for each retry loop. Additionally, recv multishot now supports an overall cap as well, where once reached the multishot recv will terminate. The latter is useful for buffer management and juggling many recv streams at the same time. - Add support for returning the TX timestamps via a new socket command. This feature can work in either singleshot or multishot mode, where the latter triggers a completion whenever new timestamps are available. This is an alternative to using the existing error queue. - Add support for an io_uring "mock" file, which is the start of being able to do 100% targeted testing in terms of exercising io_uring request handling. The idea is to have a file type that can be anything the tester would like, and behave exactly how you want it to behave in terms of hitting the code paths you want. - Improve zcrx by using sgtables to de-duplicate and improve dma address handling. - Prep work for supporting larger pages for zcrx. - Various little improvements and fixes. * tag 'for-6.17/io_uring-20250728' of git://git.kernel.dk/linux: (42 commits) io_uring/zcrx: fix leaking pages on sg init fail io_uring/zcrx: don't leak pages on account failure io_uring/zcrx: fix null ifq on area destruction io_uring: fix breakage in EXPERT menu io_uring/cmd: remove struct io_uring_cmd_data btrfs/ioctl: store btrfs_uring_encoded_data in io_btrfs_cmd io_uring/cmd: introduce IORING_URING_CMD_REISSUE flag io_uring/zcrx: account area memory io_uring: export io_[un]account_mem io_uring/net: Support multishot receive len cap io_uring: deduplicate wakeup handling io_uring/net: cast min_not_zero() type io_uring/poll: cleanup apoll freeing io_uring/net: allow multishot receive per-invocation cap io_uring/net: move io_sr_msg->retry_flags to io_sr_msg->flags io_uring/net: use passed in 'len' in io_recv_buf_select() io_uring/zcrx: prepare fallback for larger pages io_uring/zcrx: assert area type in io_zcrx_iov_page io_uring/zcrx: allocate sgtable for umem areas io_uring/zcrx: introduce io_populate_area_dma ...
6 daysMerge tag 'vfs-6.17-rc1.fileattr' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull fileattr updates from Christian Brauner: "This introduces the new file_getattr() and file_setattr() system calls after lengthy discussions. Both system calls serve as successors and extensible companions to the FS_IOC_FSGETXATTR and FS_IOC_FSSETXATTR system calls which have started to show their age in addition to being named in a way that makes it easy to conflate them with extended attribute related operations. These syscalls allow userspace to set filesystem inode attributes on special files. One of the usage examples is the XFS quota projects. XFS has project quotas which could be attached to a directory. All new inodes in these directories inherit project ID set on parent directory. The project is created from userspace by opening and calling FS_IOC_FSSETXATTR on each inode. This is not possible for special files such as FIFO, SOCK, BLK etc. Therefore, some inodes are left with empty project ID. Those inodes then are not shown in the quota accounting but still exist in the directory. This is not critical but in the case when special files are created in the directory with already existing project quota, these new inodes inherit extended attributes. This creates a mix of special files with and without attributes. Moreover, special files with attributes don't have a possibility to become clear or change the attributes. This, in turn, prevents userspace from re-creating quota project on these existing files. In addition, these new system calls allow the implementation of additional attributes that we couldn't or didn't want to fit into the legacy ioctls anymore" * tag 'vfs-6.17-rc1.fileattr' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs: tighten a sanity check in file_attr_to_fileattr() tree-wide: s/struct fileattr/struct file_kattr/g fs: introduce file_getattr and file_setattr syscalls fs: prepare for extending file_get/setattr() fs: make vfs_fileattr_[get|set] return -EOPNOTSUPP selinux: implement inode_file_[g|s]etattr hooks lsm: introduce new hooks for setting/getting inode fsxattr fs: split fileattr related helpers into separate file
13 daysbtrfs: defrag: add flag to force no-compressionDavid Sterba
Currently the defrag ioctl cannot rewrite the extents without compression. Add a new flag for that, as setting compression to 0 (or "no compression") means to do no changes to compression so take what is the current default, like mount options or properties. The defrag setting overrides mount or properties. The compression BTRFS_DEFRAG_DONT_COMPRESS is only used for in-memory operations and does not need to have a fixed value. Mount with zstd:9, copy test file from /usr/bin/ (about 260KB): $ mount -o compress=zstd:9 /dev/vda /mnt $ filefrag -vsb testfile filefrag: -b needs a blocksize option, assuming 1024-byte blocks. Filesystem type is: 9123683e File size of testfile is 297704 (292 blocks of 1024 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 127: 13312.. 13439: 128: encoded 1: 128.. 255: 13364.. 13491: 128: 13440: encoded 2: 256.. 291: 13424.. 13459: 36: 13492: last,encoded,eof testfile: 3 extents found $ compsize testfile Processed 1 file, 3 regular extents (3 refs), 0 inline, 1 fragments. Type Perc Disk Usage Uncompressed Referenced TOTAL 42% 124K 292K 292K zstd 42% 124K 292K 292K Defrag to uncompressed: $ btrfs fi defrag --nocomp testfile $ filefrag -vsb testfile filefrag: -b needs a blocksize option, assuming 1024-byte blocks. Filesystem type is: 9123683e File size of testfile is 297704 (292 blocks of 1024 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 291: 291840.. 292131: 292: last,eof testfile: 1 extent found $ compsize testfile Processed 1 file, 1 regular extents (1 refs), 0 inline, 1 fragments. Type Perc Disk Usage Uncompressed Referenced TOTAL 100% 292K 292K 292K none 100% 292K 292K 292K Compress again with LZO: $ btrfs fi defrag -clzo testfile $ filefrag -vsb testfile filefrag: -b needs a blocksize option, assuming 1024-byte blocks. Filesystem type is: 9123683e File size of testfile is 297704 (292 blocks of 1024 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 127: 13312.. 13439: 128: encoded 1: 128.. 255: 13392.. 13519: 128: 13440: encoded 2: 256.. 291: 13480.. 13515: 36: 13520: last,encoded,eof testfile: 3 extents found $ compsize testfile Processed 1 file, 3 regular extents (3 refs), 0 inline, 1 fragments. Type Perc Disk Usage Uncompressed Referenced TOTAL 64% 188K 292K 292K lzo 64% 188K 292K 292K Signed-off-by: David Sterba <dsterba@suse.com>
13 daysbtrfs: set search_commit_root to false in iterate_inodes_from_logical()Filipe Manana
There's no point in checking at iterate_inodes_from_logical() if the path has search_commit_root set, the only caller never sets search_commit_root to true and it doesn't make sense for it ever to be true for the current use case (logical_to_ino ioctl). So stop checking for that and since the only caller allocates the path just for it to be used by iterate_inodes_from_logical(), move the path allocation into that function. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
13 daysbtrfs: qgroup: use btrfs_qgroup_enabled() in ioctlsFilipe Manana
We have a publicly exported btrfs_qgroup_enabled() and an ioctl.c private qgroup_enabled() helper. Both of these test if qgroups are enabled, the first check if the flag BTRFS_FS_QUOTA_ENABLED is set in fs_info->flags while the second checks if fs_info->quota_root is not NULL while holding the mutex fs_info->qgroup_ioctl_lock. We can get away with the private ioctl.c:qgroup_enabled(), as all entry points into the qgroup code check if fs_info->quota_root is NULL or not while holding the mutex fs_info->qgroup_ioctl_lock, and returning the error -ENOTCONN in case it's NULL. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
13 daysbtrfs: call bdev_fput() to reclaim the blk_holder immediatelyQu Wenruo
As part of the preparation for btrfs blk_holder_ops, we want to ensure the holder of a block device has a proper lifespan. However btrfs is always using fput() to close a block device, which has one problem: - fput() is deferred Meaning we can have a block device with invalid (aka, freed) holder. To avoid the problem and align the behavior to other code, just call bdev_fput() instead. There is some extra requirement on the locking, but that's all resolved by previous patches and we should be safe to call bdev_fput(). Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
13 daysbtrfs: don't skip accounting in early ENOTTY return in ↵Caleb Sander Mateos
btrfs_uring_encoded_read() btrfs_uring_encoded_read() returns early with -ENOTTY if the uring_cmd is issued with IO_URING_F_COMPAT but the kernel doesn't support compat syscalls. However, this early return bypasses the syscall accounting. Go to out_acct instead to ensure the syscall is counted. Fixes: 34310c442e17 ("btrfs: add io_uring command for encoded reads (ENCODED_READ ioctl)") CC: stable@vger.kernel.org # 6.15+ Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
13 daysbtrfs: pass bool to indicate subvolume/snapshot creation typeDavid Sterba
Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
13 daysbtrfs: pass dentry to btrfs_mksubvol() and btrfs_mksnapshot()David Sterba
There's no reason to pass 'struct path' to btrfs_mksubvol(), though it's been like the since the first commit 76dda93c6ae2c1 ("Btrfs: add snapshot/subvolume destroy ioctl"). We only use the dentry so we should pass it directly. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
13 daysbtrfs: use struct qstr for subvolume ioctl helpersDavid Sterba
We pass name and length of subvolumes separately to the related functions, while this can be a struct qstr which is otherwise used for dentry interfaces. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
13 daysbtrfs: replace strcpy() with strscpy()Brahmajit Das
strcpy() is discouraged from use due to lack of bounds checking. Replaces it with strscpy(), the recommended alternative for null terminated strings, to follow best practices. There are instances where strscpy() cannot be used such as where both the source and destination are character pointers. In that instance we can use sysfs_emit(). Link: https://github.com/KSPP/linux/issues/88 Suggested-by: Anthony Iliopoulos <ailiop@suse.com> Signed-off-by: Brahmajit Das <bdas@suse.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
13 daysbtrfs: use pgoff_t for page index variablesDavid Sterba
Any conversion of offsets in the logical or the physical mapping space of the pages is done by a shift and the target type should be pgoff_t (type of struct page::index). Fix the locations where it's still unsigned long. Signed-off-by: David Sterba <dsterba@suse.com>
13 daysbtrfs: add btrfs prefix to is_fstree() and make it return boolFilipe Manana
This is an exported function and therefore it should have a 'btrfs_' prefix, to make it clear it's btrfs specific, avoid future name collisions with code outside btrfs, and make its naming consistent with most other btrfs exported functions. So add a 'btrfs_' prefix to it and make it return bool instead of int, since all we need is to return true or false. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
13 daysbtrfs: rename error to ret in btrfs_mksubvol()David Sterba
Unify naming of return value to the preferred way. Reviewed-by: Daniel Vacek <neelx@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
13 daysbtrfs: rename error to ret in btrfs_may_delete()David Sterba
Unify naming of return value to the preferred way. Reviewed-by: Daniel Vacek <neelx@suse.com>yy Signed-off-by: David Sterba <dsterba@suse.com>
13 daysbtrfs: switch RCU helper versions to btrfs_info()David Sterba
The RCU protection is now done in the plain helpers, we can remove the "_in_rcu" and "_rl_in_rcu". Reviewed-by: Daniel Vacek <neelx@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-18btrfs/ioctl: store btrfs_uring_encoded_data in io_btrfs_cmdCaleb Sander Mateos
btrfs is the only user of struct io_uring_cmd_data and its op_data field. Switch its ->uring_cmd() implementations to store the struct btrfs_uring_encoded_data * in the struct io_btrfs_cmd, overlayed with io_uring_cmd's pdu field. This avoids having to touch another cache line to access the struct btrfs_uring_encoded_data *, and allows op_data and struct io_uring_cmd_data to be removed. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Acked-by: David Sterba <dsterba@suse.com> Link: https://lore.kernel.org/r/20250708202212.2851548-4-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-04tree-wide: s/struct fileattr/struct file_kattr/gChristian Brauner
Now that we expose struct file_attr as our uapi struct rename all the internal struct to struct file_kattr to clearly communicate that it is a kernel internal struct. This is similar to struct mount_{k}attr and others. Link: https://lore.kernel.org/20250703-restlaufzeit-baurecht-9ed44552b481@brauner Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-03Merge tag 'for-6.16-rc4-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - tree-log fixes: - fixes of log tracking of directories and subvolumes - fix iteration and error handling of inode references during log replay - fix free space tree rebuild (reported by syzbot) * tag 'for-6.16-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: use btrfs_record_snapshot_destroy() during rmdir btrfs: propagate last_unlink_trans earlier when doing a rmdir btrfs: record new subvolume in parent dir earlier to avoid dir logging races btrfs: fix inode lookup error handling during log replay btrfs: fix iteration of extrefs during log replay btrfs: fix missing error handling when searching for inode refs during log replay btrfs: fix failure to rebuild free space tree using multiple transactions
2025-06-27btrfs: record new subvolume in parent dir earlier to avoid dir logging racesFilipe Manana
Instead of recording that a new subvolume was created in a directory after we add the entry do the directory, record it before adding the entry. This is to avoid races where after creating the entry and before recording the new subvolume in the directory (the call to btrfs_record_new_subvolume()), another task logs the directory, so we end up with a log tree where we logged a directory that has an entry pointing to a root that was not yet committed, resulting in an invalid entry if the log is persisted and replayed later due to a power failure or crash. Also state this requirement in the function comment for btrfs_record_new_subvolume(), similar to what we do for the btrfs_record_unlink_dir() and btrfs_record_snapshot_destroy(). Fixes: 45c4102f0d82 ("btrfs: avoid transaction commit on any fsync after subvolume creation") Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-06-23Merge tag 'for-6.16-rc3-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "Fixes: - fix invalid inode pointer dereferences during log replay - fix a race between renames and directory logging - fix shutting down delayed iput worker - fix device byte accounting when dropping chunk - in zoned mode, fix offset calculations for DUP profile when conventional and sequential zones are used together Regression fixes: - fix possible double unlock of extent buffer tree (xarray conversion) - in zoned mode, fix extent buffer refcount when writing out extents (xarray conversion) Error handling fixes and updates: - handle unexpected extent type when replaying log - check and warn if there are remaining delayed inodes when putting a root - fix assertion when building free space tree - handle csum tree error with mount option 'rescue=ibadroot' Other: - error message updates: add prefix to all scrub related messages, include other information in messages" * tag 'for-6.16-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: zoned: fix alloc_offset calculation for partly conventional block groups btrfs: handle csum tree error with rescue=ibadroots correctly btrfs: fix race between async reclaim worker and close_ctree() btrfs: fix assertion when building free space tree btrfs: don't silently ignore unexpected extent type when replaying log btrfs: fix invalid inode pointer dereferences during log replay btrfs: fix double unlock of buffer_tree xarray when releasing subpage eb btrfs: update superblock's device bytes_used when dropping chunk btrfs: fix a race between renames and directory logging btrfs: scrub: add prefix for the error messages btrfs: warn if leaking delayed_nodes in btrfs_put_root() btrfs: fix delayed ref refcount leak in debug assertion btrfs: include root in error message when unlinking inode btrfs: don't drop a reference if btrfs_check_write_meta_pointer() fails
2025-06-19btrfs: scrub: add prefix for the error messagesAnand Jain
Add a "scrub: " prefix to all messages logged by scrub so that it's easy to filter them from dmesg for analysis. Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-26Merge tag 'for-6.16-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "Apart from numerous cleanups, there are some performance improvements and one minor mount option update. There's one more radix-tree conversion (one remaining), and continued work towards enabling large folios (almost finished). Performance: - extent buffer conversion to xarray gains throughput and runtime improvements on metadata heavy operations doing writeback (sample test shows +50% throughput, -33% runtime) - extent io tree cleanups lead to performance improvements by avoiding unnecessary searches or repeated searches - more efficient extent unpinning when committing transaction (estimated run time improvement 3-5%) User visible changes: - remove standalone mount option 'nologreplay', deprecated in 5.9, replacement is 'rescue=nologreplay' - in scrub, update reporting, add back device stats message after detected errors (accidentally removed during recent refactoring) Core: - convert extent buffer radix tree to xarray - in subpage mode, move block perfect compression out of experimental build - in zoned mode, introduce sub block groups to allow managing special block groups, like the one for relocation or tree-log, to handle some corner cases of ENOSPC - in scrub, simplify bitmaps for block tracking status - continued preparations for large folios: - remove assertions for folio order 0 - add support where missing: compression, buffered write, defrag, hole punching, subpage, send - fix fsync of files with no hard links not persisting deletion - reject tree blocks which are not nodesize aligned, a precaution from 4.9 times - move transaction abort calls closer to the error sites - remove usage of some struct bio_vec internals - simplifications in extent map - extent IO cleanups and optimizations - error handling improvements - enhanced ASSERT() macro with optional format strings - cleanups: - remove unused code - naming unifications, dropped __, added prefix - merge similar functions - use common helpers for various data structures" * tag 'for-6.16-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (198 commits) btrfs: move misplaced comment of btrfs_path::keep_locks btrfs: remove standalone "nologreplay" mount option btrfs: use a single variable to track return value at btrfs_page_mkwrite() btrfs: don't return VM_FAULT_SIGBUS on failure to set delalloc for mmap write btrfs: simplify early error checking in btrfs_page_mkwrite() btrfs: pass true to btrfs_delalloc_release_space() at btrfs_page_mkwrite() btrfs: fix wrong start offset for delalloc space release during mmap write btrfs: fix harmless race getting delayed ref head count when running delayed refs btrfs: log error codes during failures when writing super blocks btrfs: simplify error return logic when getting folio at prepare_one_folio() btrfs: return real error from __filemap_get_folio() calls btrfs: remove superfluous return value check at btrfs_dio_iomap_begin() btrfs: fix invalid data space release when truncating block in NOCOW mode btrfs: update Kconfig option descriptions btrfs: update list of features built under experimental config btrfs: send: remove btrfs_debug() calls btrfs: use boolean for delalloc argument to btrfs_free_reserved_extent() btrfs: use boolean for delalloc argument to btrfs_free_reserved_bytes() btrfs: fold error checks when allocating ordered extent and update comments btrfs: check we grabbed inode reference when allocating an ordered extent ...
2025-05-26Merge tag 'vfs-6.16-rc1.async.dir' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs directory lookup updates from Christian Brauner: "This contains cleanups for the lookup_one*() family of helpers. We expose a set of functions with names containing "lookup_one_len" and others without the "_len". This difference has nothing to do with "len". It's rater a historical accident that can be confusing. The functions without "_len" take a "mnt_idmap" pointer. This is found in the "vfsmount" and that is an important question when choosing which to use: do you have a vfsmount, or are you "inside" the filesystem. A related question is "is permission checking relevant here?". nfsd and cachefiles *do* have a vfsmount but *don't* use the non-_len functions. They pass nop_mnt_idmap and refuse to work on filesystems which have any other idmap. This work changes nfsd and cachefile to use the lookup_one family of functions and to explictily pass &nop_mnt_idmap which is consistent with all other vfs interfaces used where &nop_mnt_idmap is explicitly passed. The remaining uses of the "_one" functions do not require permission checks so these are renamed to be "_noperm" and the permission checking is removed. This series also changes these lookup function to take a qstr instead of separate name and len. In many cases this simplifies the call" * tag 'vfs-6.16-rc1.async.dir' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: VFS: change lookup_one_common and lookup_noperm_common to take a qstr Use try_lookup_noperm() instead of d_hash_and_lookup() outside of VFS VFS: rename lookup_one_len family to lookup_noperm and remove permission check cachefiles: Use lookup_one() rather than lookup_one_len() nfsd: Use lookup_one() rather than lookup_one_len() VFS: improve interface for lookup_one functions
2025-05-15btrfs: trivial conversion to return bool instead of intDavid Sterba
Old code has a lot of int for bool return values, bool is recommended and done in new code. Convert the trivial cases that do simple 0/false and 1/true. Functions comment are updated if needed. Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: add btrfs prefix to main lock, try lock and unlock extent functionsFilipe Manana
These functions are exported so they should have a 'btrfs_' prefix by convention, to make it clear they are btrfs specific and to avoid collisions with functions from elsewhere in the kernel. So add a prefix to their name. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-04-07VFS: improve interface for lookup_one functionsNeilBrown
The family of functions: lookup_one() lookup_one_unlocked() lookup_one_positive_unlocked() appear designed to be used by external clients of the filesystem rather than by filesystems acting on themselves as the lookup_one_len family are used. They are used by: btrfs/ioctl - which is a user-space interface rather than an internal activity exportfs - i.e. from nfsd or the open_by_handle_at interface overlayfs - at access the underlying filesystems smb/server - for file service They should be used by nfsd (more than just the exportfs path) and cachefs but aren't. It would help if the documentation didn't claim they should "not be called by generic code". Also the path component name is passed as "name" and "len" which are (confusingly?) separate by the "base". In some cases the len in simply "strlen" and so passing a qstr using QSTR() would make the calling clearer. Other callers do pass separate name and len which are stored in a struct. Sometimes these are already stored in a qstr, other times it easily could be. So this patch changes these three functions to receive a 'struct qstr *', and improves the documentation. QSTR_LEN() is added to make it easy to pass a QSTR containing a known len. [brauner@kernel.org: take a struct qstr pointer] Signed-off-by: NeilBrown <neil@brown.name> Link: https://lore.kernel.org/r/20250319031545.2999807-2-neil@brown.name Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-01btrfs: ioctl: don't free iov when btrfs_encoded_read() returns -EAGAINSidong Yang
Fix a bug in encoded read that mistakenly frees the iov in case btrfs_encoded_read() returns -EAGAIN assuming the structure will be reused. This can happen when when receiving requests concurrently, the io_uring subsystem does not reset the data, and the last free will happen in btrfs_uring_read_finished(). Handle the -EAGAIN error and skip freeing iov. CC: stable@vger.kernel.org # 6.13+ Signed-off-by: Sidong Yang <sidong.yang@furiosa.ai> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: simplify the return value handling in search_ioctl()Sun YangKai
Move the assignment of -EFAULT to within the error condition check in fault_in_subpage_writeable(). The previous placement outside the condition could lead to the error value being overwritten by subsequent assignments, cause unnecessary assignments. Simplify loop exit logic by removing redundant goto. The original code used 'goto err' to bypass post-loop processing after handling errors from btrfs_search_forward(). However, the loop's termination naturally falls through to the post-loop section, which already handles 'ret' values. Replacing 'goto err' with 'break' eliminates redundant control flow, consolidates error handling, and makes the loop's exit conditions explicit. Signed-off-by: Sun YangKai <sunk67188@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: make btrfs_iget() return a btrfs inode insteadFilipe Manana
It's an internal function and most of the time the callers are doing a lot of BTRFS_I() calls on the returned VFS inode to get the btrfs inode, so change the return type to struct btrfs_inode instead. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: unify inode variable namingDavid Sterba
Rename binode to inode in local variables or parameters so it's more unified with the rest of the code. Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: pass struct to btrfs_ioctl_subvol_getflags()David Sterba
Pass a struct btrfs_inode to btrfs_ioctl_subvol_getflags() as it's an internal interface, allowing to remove some use of BTRFS_I. Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: simplify local variables in btrfs_ioctl_resize()David Sterba
Remove some redundant variables and assignments, move variable declarations to their closest scope. Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: pass struct btrfs_inode to btrfs_sync_inode_flags_to_i_flags()David Sterba
Pass a struct btrfs_inode to btrfs_sync_inode_flags_to_i_flags() as it's an internal interface. Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: pass root pointers to search tree ioctl helpersDavid Sterba
The search tree ioctl use btrfs_root so change that from btrfs_inode pointers so we don't have to do the conversion. Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: pass btrfs_root pointers to send ioctl parametersDavid Sterba
The ioctl switch btrfs_ioctl() provides several parameter types for convenience so we don't have to do the conversion in the callbacks. Pass root pointers to the send related functions. Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: parameter constification in ioctl.cDavid Sterba
Add const to function parameters that are not changed. Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: pass struct btrfs_inode to btrfs_defrag_file()David Sterba
Pass a struct btrfs_inode to btrfs_defrag_file() as it's an internal interface, allowing to remove some use of BTRFS_I. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: unify ordering of btrfs_key initializationsDavid Sterba
The btrfs_key is defined as objectid/type/offset and the keys are also printed like that. For better readability, update all key initializations to match this order. Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: expose per-inode stable writes flagQu Wenruo
The address space flag AS_STABLE_WRITES determine if FGP_STABLE for will wait for the folio to finish its writeback. For btrfs, due to the default data checksum behavior, if we modify the folio while it's still under writeback, it will cause data checksum mismatch. Thus for quite some call sites we manually call folio_wait_writeback() to prevent such problem from happening. Currently there is only one call site inside btrfs really utilizing FGP_STABLE, and in that case we also manually call folio_wait_writeback() to do the waiting. But it's better to properly expose the stable writes flag to a per-inode basis, to allow call sites to fully benefit from FGP_STABLE flag. E.g. for inodes with NODATASUM allowing beginning dirtying the page without waiting for writeback. This involves: - Update the mapping's stable write flag when setting/clearing NODATASUM inode flag using ioctl This only works for empty files, so it should be fine. - Update the mapping's stable write flag when reading an inode from disk - Remove the explicit folio_wait_writeback() for FGP_BEGINWRITE call site Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-23Merge tag 'fsnotify_hsm_for_v6.14-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull fsnotify pre-content notification support from Jan Kara: "This introduces a new fsnotify event (FS_PRE_ACCESS) that gets generated before a file contents is accessed. The event is synchronous so if there is listener for this event, the kernel waits for reply. On success the execution continues as usual, on failure we propagate the error to userspace. This allows userspace to fill in file content on demand from slow storage. The context in which the events are generated has been picked so that we don't hold any locks and thus there's no risk of a deadlock for the userspace handler. The new pre-content event is available only for users with global CAP_SYS_ADMIN capability (similarly to other parts of fanotify functionality) and it is an administrator responsibility to make sure the userspace event handler doesn't do stupid stuff that can DoS the system. Based on your feedback from the last submission, fsnotify code has been improved and now file->f_mode encodes whether pre-content event needs to be generated for the file so the fast path when nobody wants pre-content event for the file just grows the additional file->f_mode check. As a bonus this also removes the checks whether the old FS_ACCESS event needs to be generated from the fast path. Also the place where the event is generated during page fault has been moved so now filemap_fault() generates the event if and only if there is no uptodate folio in the page cache. Also we have dropped FS_PRE_MODIFY event as current real-world users of the pre-content functionality don't really use it so let's start with the minimal useful feature set" * tag 'fsnotify_hsm_for_v6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: (21 commits) fanotify: Fix crash in fanotify_init(2) fs: don't block write during exec on pre-content watched files fs: enable pre-content events on supported file systems ext4: add pre-content fsnotify hook for DAX faults btrfs: disable defrag on pre-content watched files xfs: add pre-content fsnotify hook for DAX faults fsnotify: generate pre-content permission event on page fault mm: don't allow huge faults for files with pre content watches fanotify: disable readahead if we have pre-content watches fanotify: allow to set errno in FAN_DENY permission response fanotify: report file range info with pre-content events fanotify: introduce FAN_PRE_ACCESS permission event fsnotify: generate pre-content permission event on truncate fsnotify: pass optional file access range in pre-content event fsnotify: introduce pre-content permission events fanotify: reserve event bit of deprecated FAN_DIR_MODIFY fanotify: rename a misnamed constant fanotify: don't skip extra event info if no info_mode is set fsnotify: check if file is actually being watched for pre-content events on open fsnotify: opt-in for permission events at file open time ...
2025-01-13btrfs: add io_uring interface for encoded writesMark Harmstone
Add an io_uring interface for encoded writes, with the same parameters as the BTRFS_IOC_ENCODED_WRITE ioctl. As with the encoded reads code, there's a test program for this at https://github.com/maharmstone/io_uring-encoded, and I'll get this worked into an fstest. How io_uring works is that it initially calls btrfs_uring_cmd with the IO_URING_F_NONBLOCK flag set, and if we return -EAGAIN it tries again in a kthread with the flag cleared. Ideally we'd honour this and call try_lock etc., but there's still a lot of work to be done to create non-blocking versions of all the functions in our write path. Instead, just validate the input in btrfs_uring_encoded_write() on the first pass and return -EAGAIN, with a view to properly optimizing the optimistic path later on. Signed-off-by: Mark Harmstone <maharmstone@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: ioctl: remove unnecessary call to btrfs_mark_buffer_dirty()Filipe Manana
The call to btrfs_mark_buffer_dirty() at btrfs_ioctl_default_subvol() is not necessary as we have a path setup for writing with btrfs_search_slot() having a 'cow' argument set to 1. This just makes the code more verbose, confusing and add a little extra overhead and well as increase the module's text size, so remove it. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: move btrfs_is_empty_uuid() from ioctl.c into fs.cFilipe Manana
It's a generic helper not specific to ioctls and used in several places, so move it out from ioctl.c and into fs.c. While at it change its return type from int to bool and declare the loop variable in the loop itself. This also slightly reduces the module's size. Before this change: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1781492 161037 16920 1959449 1de619 fs/btrfs/btrfs.ko After this change: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1781340 161037 16920 1959297 1de581 fs/btrfs/btrfs.ko Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: move the exclusive operation functions into fs.cFilipe Manana
The declarations for the exclusive operation functions are located at fs.h but their definitions are in ioctl.c, which doesn't make much sense since (most of them) are used in several files other than ioctl.c. Since they are used in several files and they are generic enough, move them out of ioctl.c and into fs.c, even the ones that are currently only used at ioctl.c, for the sake of having them all in the same C file. This also reduces the module's size. Before this change: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1782094 161045 16920 1960059 1de87b fs/btrfs/btrfs.ko After this change: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1781492 161037 16920 1959449 1de619 fs/btrfs/btrfs.ko Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: handle FS_IOC_READ_VERITY_METADATA ioctlAllison Karlitskaya
Commit 146054090b08 ("btrfs: initial fsverity support") introduced fs-verity support for btrfs, but didn't add support for FS_IOC_READ_VERITY_METADATA to directly query the Merkle tree, descriptor and signature blocks for fs-verity enabled files. Add the (trival) implementation: we just need to wire it through to the fs-verity code, the same way as is done in the other two filesystems which support this ioctl (ext4, f2fs). The fs-verity code already has access to the required data. This is also safe to backport to older stable trees (5.15+) if needed. Signed-off-by: Allison Karlitskaya <allison.karlitskaya@redhat.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-09Merge tag 'for-6.13-rc6-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "A few more fixes. Besides the one-liners in Btrfs there's fix to the io_uring and encoded read integration (added in this development cycle). The update to io_uring provides more space for the ongoing command that is then used in Btrfs to handle some cases. - io_uring and encoded read: - provide stable storage for io_uring command data - make a copy of encoded read ioctl call, reuse that in case the call would block and will be called again - properly initialize zlib context for hardware compression on s390 - fix max extent size calculation on filesystems with non-zoned devices - fix crash in scrub on crafted image due to invalid extent tree" * tag 'for-6.13-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: zlib: fix avail_in bytes for s390 zlib HW compression path btrfs: zoned: calculate max_extent_size properly on non-zoned setup btrfs: avoid NULL pointer dereference if no valid extent tree btrfs: don't read from userspace twice in btrfs_uring_encoded_read() io_uring: add io_uring_cmd_get_async_data helper io_uring/cmd: add per-op data to struct io_uring_cmd_data io_uring/cmd: rename struct uring_cache to io_uring_cmd_data
2025-01-06btrfs: don't read from userspace twice in btrfs_uring_encoded_read()Mark Harmstone
If we return -EAGAIN the first time because we need to block, btrfs_uring_encoded_read() will get called twice. Take a copy of args, the iovs, and the iter the first time, as by the time we are called the second time these may have gone out of scope. Reported-by: Jens Axboe <axboe@kernel.dk> Fixes: 34310c442e17 ("btrfs: add io_uring command for encoded reads (ENCODED_READ ioctl)") Signed-off-by: Mark Harmstone <maharmstone@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-12-11btrfs: disable defrag on pre-content watched filesJosef Bacik
We queue up inodes to be defrag'ed asynchronously, which means we do not have their original file for readahead. This means that the code to skip readahead on pre-content watched files will not run, and we could potentially read in empty pages. Handle this corner case by disabling defrag on files that are currently being watched for pre-content events. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/4cc5bcea13db7904174353d08e85157356282a59.1731684329.git.josef@toxicpanda.com
2024-12-03Merge tag 'for-6.13-rc1-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - add lockdep annotations for io_uring/encoded read integration, inode lock is held when returning to userspace - properly reflect experimental config option to sysfs - handle NULL root in case the rescue mode accepts invalid/damaged tree roots (rescue=ibadroot) - regression fix of a deadlock between transaction and extent locks - fix pending bio accounting bug in encoded read ioctl - fix NOWAIT mode when checking references for NOCOW files - fix use-after-free in a rb-tree cleanup in ref-verify debugging tool * tag 'for-6.13-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix lockdep warnings on io_uring encoded reads btrfs: ref-verify: fix use-after-free after invalid ref action btrfs: add a sanity check for btrfs root in btrfs_search_slot() btrfs: don't loop for nowait writes when checking for cross references btrfs: sysfs: advertise experimental features only if CONFIG_BTRFS_EXPERIMENTAL=y btrfs: fix deadlock between transaction commits and extent locks btrfs: fix use-after-free in btrfs_encoded_read_endio()