summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2025-03-25Merge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/linuxLinus Torvalds
Pull fscrypt updates from Eric Biggers: "A fix for an issue where CONFIG_FS_ENCRYPTION could be enabled without some of its dependencies, and a small documentation update" * tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/linux: fscrypt: mention init_on_free instead of page poisoning fscrypt: drop obsolete recommendation to enable optimized ChaCha20 Revert "fscrypt: relax Kconfig dependencies for crypto API algorithms"
2025-03-25Merge tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fsverity/linuxLinus Torvalds
Pull fsverity updates from Eric Biggers: "A fix for an issue where CONFIG_FS_VERITY could be enabled without some of its dependencies, and a small documentation update" * tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fsverity/linux: Revert "fsverity: relax build time dependency on CRYPTO_SHA256" Documentation: add a usecase for FS_IOC_READ_VERITY_METADATA
2025-03-25Merge tag 'timers-cleanups-2025-03-23' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer cleanups from Thomas Gleixner: "A treewide hrtimer timer cleanup hrtimers are initialized with hrtimer_init() and a subsequent store to the callback pointer. This turned out to be suboptimal for the upcoming Rust integration and is obviously a silly implementation to begin with. This cleanup replaces the hrtimer_init(T); T->function = cb; sequence with hrtimer_setup(T, cb); The conversion was done with Coccinelle and a few manual fixups. Once the conversion has completely landed in mainline, hrtimer_init() will be removed and the hrtimer::function becomes a private member" * tag 'timers-cleanups-2025-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (100 commits) wifi: rt2x00: Switch to use hrtimer_update_function() io_uring: Use helper function hrtimer_update_function() serial: xilinx_uartps: Use helper function hrtimer_update_function() ASoC: fsl: imx-pcm-fiq: Switch to use hrtimer_setup() RDMA: Switch to use hrtimer_setup() virtio: mem: Switch to use hrtimer_setup() drm/vmwgfx: Switch to use hrtimer_setup() drm/xe/oa: Switch to use hrtimer_setup() drm/vkms: Switch to use hrtimer_setup() drm/msm: Switch to use hrtimer_setup() drm/i915/request: Switch to use hrtimer_setup() drm/i915/uncore: Switch to use hrtimer_setup() drm/i915/pmu: Switch to use hrtimer_setup() drm/i915/perf: Switch to use hrtimer_setup() drm/i915/gvt: Switch to use hrtimer_setup() drm/i915/huc: Switch to use hrtimer_setup() drm/amdgpu: Switch to use hrtimer_setup() stm class: heartbeat: Switch to use hrtimer_setup() i2c: Switch to use hrtimer_setup() iio: Switch to use hrtimer_setup() ...
2025-03-25Merge tag 'timers-core-2025-03-23' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer core updates from Thomas Gleixner: - Fix a memory ordering issue in posix-timers Posix-timer lookup is lockless and reevaluates the timer validity under the timer lock, but the update which validates the timer is not protected by the timer lock. That allows the store to be reordered against the initialization stores, so that the lookup side can observe a partially initialized timer. That's mostly a theoretical problem, but incorrect nevertheless. - Fix a long standing inconsistency of the coarse time getters The coarse time getters read the base time of the current update cycle without reading the actual hardware clock. NTP frequency adjustment can set the base time backwards. The fine grained interfaces compensate this by reading the clock and applying the new conversion factor, but the coarse grained time getters use the base time directly. That allows the user to observe time going backwards. Cure it by always forwarding base time, when NTP changes the frequency with an immediate step. - Rework of posix-timer hashing The posix-timer hash is not scalable and due to the CRIU timer restore mechanism prone to massive contention on the global hash bucket lock. Replace the global hash lock with a fine grained per bucket locking scheme to address that. - Rework the proc/$PID/timers interface. /proc/$PID/timers is provided for CRIU to be able to restore a timer. The printout happens with sighand lock held and interrupts disabled. That's not required as this can be done with RCU protection as well. - Provide a sane mechanism for CRIU to restore a timer ID CRIU restores timers by creating and deleting them until the kernel internal per process ID counter reached the requested ID. That's horribly slow for sparse timer IDs. Provide a prctl() which allows CRIU to restore a timer with a given ID. When enabled the ID pointer is used as input pointer to read the requested ID from user space. When disabled, the normal allocation scheme (next ID) is active as before. This is backwards compatible for both kernel and user space. - Make hrtimer_update_function() less expensive. The sanity checks are valuable, but expensive for high frequency usage in io/uring. Make the debug checks conditional and enable them only when lockdep is enabled. - Small updates, cleanups and improvements * tag 'timers-core-2025-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits) selftests/timers: Improve skew_consistency by testing with other clockids timekeeping: Fix possible inconsistencies in _COARSE clockids posix-timers: Drop redundant memset() invocation selftests/timers/posix-timers: Add a test for exact allocation mode posix-timers: Provide a mechanism to allocate a given timer ID posix-timers: Dont iterate /proc/$PID/timers with sighand:: Siglock held posix-timers: Make per process list RCU safe posix-timers: Avoid false cacheline sharing posix-timers: Switch to jhash32() posix-timers: Improve hash table performance posix-timers: Make signal_struct:: Next_posix_timer_id an atomic_t posix-timers: Make lock_timer() use guard() posix-timers: Rework timer removal posix-timers: Simplify lock/unlock_timer() posix-timers: Use guards in a few places posix-timers: Remove SLAB_PANIC from kmem cache posix-timers: Remove a few paranoid warnings posix-timers: Cleanup includes posix-timers: Add cond_resched() to posix_timer_add() search loop posix-timers: Initialise timer before adding it to the hash table ...
2025-03-24Merge tag 'sched-core-2025-03-22' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "Core & fair scheduler changes: - Cancel the slice protection of the idle entity (Zihan Zhou) - Reduce the default slice to avoid tasks getting an extra tick (Zihan Zhou) - Force propagating min_slice of cfs_rq when {en,de}queue tasks (Tianchen Ding) - Refactor can_migrate_task() to elimate looping (I Hsin Cheng) - Add unlikey branch hints to several system calls (Colin Ian King) - Optimize current_clr_polling() on certain architectures (Yujun Dong) Deadline scheduler: (Juri Lelli) - Remove redundant dl_clear_root_domain call - Move dl_rebuild_rd_accounting to cpuset.h Uclamp: - Use the uclamp_is_used() helper instead of open-coding it (Xuewen Yan) - Optimize sched_uclamp_used static key enabling (Xuewen Yan) Scheduler topology support: (Juri Lelli) - Ignore special tasks when rebuilding domains - Add wrappers for sched_domains_mutex - Generalize unique visiting of root domains - Rebuild root domain accounting after every update - Remove partition_and_rebuild_sched_domains - Stop exposing partition_sched_domains_locked RSEQ: (Michael Jeanson) - Update kernel fields in lockstep with CONFIG_DEBUG_RSEQ=y - Fix segfault on registration when rseq_cs is non-zero - selftests: Add rseq syscall errors test - selftests: Ensure the rseq ABI TLS is actually 1024 bytes Membarriers: - Fix redundant load of membarrier_state (Nysal Jan K.A.) Scheduler debugging: - Introduce and use preempt_model_str() (Sebastian Andrzej Siewior) - Make CONFIG_SCHED_DEBUG unconditional (Ingo Molnar) Fixes and cleanups: - Always save/restore x86 TSC sched_clock() on suspend/resume (Guilherme G. Piccoli) - Misc fixes and cleanups (Thorsten Blum, Juri Lelli, Sebastian Andrzej Siewior)" * tag 'sched-core-2025-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (40 commits) cpuidle, sched: Use smp_mb__after_atomic() in current_clr_polling() sched/debug: Remove CONFIG_SCHED_DEBUG sched/debug: Remove CONFIG_SCHED_DEBUG from self-test config files sched/debug, Documentation: Remove (most) CONFIG_SCHED_DEBUG references from documentation sched/debug: Make CONFIG_SCHED_DEBUG functionality unconditional sched/debug: Make 'const_debug' tunables unconditional __read_mostly sched/debug: Change SCHED_WARN_ON() to WARN_ON_ONCE() rseq/selftests: Fix namespace collision with rseq UAPI header include/{topology,cpuset}: Move dl_rebuild_rd_accounting to cpuset.h sched/topology: Stop exposing partition_sched_domains_locked cgroup/cpuset: Remove partition_and_rebuild_sched_domains sched/topology: Remove redundant dl_clear_root_domain call sched/deadline: Rebuild root domain accounting after every update sched/deadline: Generalize unique visiting of root domains sched/topology: Wrappers for sched_domains_mutex sched/deadline: Ignore special tasks when rebuilding domains tracing: Use preempt_model_str() xtensa: Rely on generic printing of preemption model x86: Rely on generic printing of preemption model s390: Rely on generic printing of preemption model ...
2025-03-24Merge tag 'pstore-v6.15-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull tiny pstore update from Kees Cook: - pstore: Change kmsg_bytes storage size to u32 * tag 'pstore-v6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: pstore: Change kmsg_bytes storage size to u32
2025-03-24Merge tag 'move-lib-kunit-v6.15-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull lib kunit selftest move from Kees Cook: "This is a one-off tree to coordinate the move of selftests out of lib/ and into lib/tests/. A separate tree was used for this to keep the paths sane with all the work in the same place. - move lib/ selftests into lib/tests/ (Kees Cook, Gabriela Bittencourt, Luis Felipe Hernandez, Lukas Bulwahn, Tamir Duberstein) - lib/math: Add int_log test suite (Bruno Sobreira França) - lib/math: Add Kunit test suite for gcd() (Yu-Chun Lin) - lib/tests/kfifo_kunit.c: add tests for the kfifo structure (Diego Vieira) - unicode: refactor selftests into KUnit (Gabriela Bittencourt) - lib/prime_numbers: convert self-test to KUnit (Tamir Duberstein) - printf: convert self-test to KUnit (Tamir Duberstein) - scanf: convert self-test to KUnit (Tamir Duberstein)" * tag 'move-lib-kunit-v6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (21 commits) scanf: break kunit into test cases scanf: convert self-test to KUnit scanf: remove redundant debug logs scanf: implicate test line in failure messages printf: implicate test line in failure messages printf: break kunit into test cases printf: convert self-test to KUnit kunit/fortify: Replace "volatile" with OPTIMIZER_HIDE_VAR() kunit/fortify: Expand testing of __compiletime_strlen() kunit/stackinit: Use fill byte different from Clang i386 pattern kunit/overflow: Fix DEFINE_FLEX tests for counted_by selftests: remove reference to prime_numbers.sh MAINTAINERS: adjust entries in FORTIFY_SOURCE and KERNEL HARDENING lib/prime_numbers: convert self-test to KUnit lib/math: Add Kunit test suite for gcd() unicode: kunit: change tests filename and path unicode: kunit: refactor selftest to kunit tests lib/tests/kfifo_kunit.c: add tests for the kfifo structure lib: Move KUnit tests into tests/ subdirectory lib/math: Add int_log test suite ...
2025-03-24Merge tag 'execve-v6.15-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull execve updates from Kees Cook: - elf: Define and use note name macros (Akihiko Odaki) - elf: add remaining SHF_ flag macros (Timur Tabi) - binfmt: Remove loader from linux_binprm struct (Yonatan Goldschmidt) - binfmt_elf_fdpic: fix variable set but not used warning (sunliming) * tag 'execve-v6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: binfmt_elf_fdpic: fix variable set but not used warning elf: add remaining SHF_ flag macros binfmt: Remove loader from linux_binprm struct crash: Remove KEXEC_CORE_NOTE_NAME s390/crash: Use note name macros crash: Use note name macros powerpc/crash: Use note name macros binfmt_elf: Use note name macros elf: Define note name macros
2025-03-24Merge tag 'vfs-6.15-rc1.file' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs file handling updates from Christian Brauner: "This contains performance improvements for struct file's new refcount mechanism and various other performance work: - The stock kernel transitioning the file to no refs held penalizes the caller with an extra atomic to block any increments. For cases where the file is highly likely to be going away this is easily avoidable. Add file_ref_put_close() to better handle the common case where closing a file descriptor also operates on the last reference and build fput_close_sync() and fput_close() on top of it. This brings about 1% performance improvement by eliding one atomic in the common case. - Predict no error in close() since the vast majority of the time system call returns 0. - Reduce the work done in fdget_pos() by predicting that the file was found and by explicitly comparing the reference count to one and ignoring the dead zone" * tag 'vfs-6.15-rc1.file' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs: reduce work in fdget_pos() fs: use fput_close() in path_openat() fs: use fput_close() in filp_close() fs: use fput_close_sync() in close() file: add fput and file_ref_put routines optimized for use when closing a fd fs: predict no error in close()
2025-03-24Merge tag 'vfs-6.15-rc1.orangefs' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs orangefs updates from Christian Brauner: "This contains the work to remove orangefs_writepage() and partially convert it to folios. A few regular bugfixes are included as well" * tag 'vfs-6.15-rc1.orangefs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: orangefs: Convert orangefs_writepages to contain an array of folios orangefs: Simplify bvec setup in orangefs_writepages_work() orangefs: Unify error & success paths in orangefs_writepages_work() orangefs: Pass mapping to orangefs_writepages_work() orangefs: Convert orangefs_writepage_locked() to take a folio orangefs: Remove orangefs_writepage() orangefs: make open_for_read and open_for_write boolean orangefs: Move s_kmod_keyword_mask_map to orangefs-debugfs.c orangefs: Do not truncate file size
2025-03-24Merge tag 'vfs-6.15-rc1.afs' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs afs updates from Christian Brauner: "This contains the work for afs for this cycle: - Fix an occasional hang that's only really encountered when rmmod'ing the kafs module - Remove the "-o autocell" mount option. This is obsolete with the dynamic root and removing it makes the next patch slightly easier - Change how the dynamic root mount is constructed. Currently, the root directory is (de)populated when it is (un)mounted if there are cells already configured and, further, pairs of automount points have to be created/removed each time a cell is added/deleted This is changed so that readdir on the root dir lists all the known cell automount pairs plus the @cell symlinks and the inodes and dentries are constructed by lookup on demand. This simplifies the cell management code - A few improvements to the afs_volume and afs_server tracepoints - Pass trace info into the afs_lookup_cell() function to allow the trace log to indicate the purpose of the lookup - Remove the 'net' parameter from afs_unuse_cell() as it's superfluous - In rxrpc, allow a kernel app (such as kafs) to store a word of information on rxrpc_peer records - Use the information stored on the rxrpc_peer record to point to the afs_server record. This allows the server address lookup to be done away with - Simplify the afs_server ref/activity accounting to make each one self-contained and not garbage collected from the cell management work item - Simplify the afs_cell ref/activity accounting to make each one of these also self-contained and not driven by a central management work item The current code was intended to make it such that a single timer for the namespace and one work item per cell could do all the work required to maintain these records. This, however, made for some sequencing problems when cleaning up these records. Further, the attempt to pass refs along with timers and work items made getting it right rather tricky when the timer or work item already had a ref attached and now a ref had to be got rid of" * tag 'vfs-6.15-rc1.afs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: afs: Simplify cell record handling afs: Fix afs_server ref accounting afs: Use the per-peer app data provided by rxrpc rxrpc: Allow the app to store private data on peer structs afs: Drop the net parameter from afs_unuse_cell() afs: Make afs_lookup_cell() take a trace note afs: Improve server refcount/active count tracing afs: Improve afs_volume tracing to display a debug ID afs: Change dynroot to create contents on demand afs: Remove the "autocell" mount option
2025-03-24Merge tag 'vfs-6.15-rc1.ceph' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs ceph updates from Christian Brauner: "This contains the work to remove access to page->index from ceph and fixes the test failure observed for ceph with generic/421 by refactoring ceph_writepages_start()" * tag 'vfs-6.15-rc1.ceph' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fscrypt: Change fscrypt_encrypt_pagecache_blocks() to take a folio ceph: Fix error handling in fill_readdir_cache() fs: Remove page_mkwrite_check_truncate() ceph: Pass a folio to ceph_allocate_page_array() ceph: Convert ceph_move_dirty_page_in_page_array() to move_dirty_folio_in_page_array() ceph: Remove uses of page from ceph_process_folio_batch() ceph: Convert ceph_check_page_before_write() to use a folio ceph: Convert writepage_nounlock() to write_folio_nounlock() ceph: Convert ceph_readdir_cache_control to store a folio ceph: Convert ceph_find_incompatible() to take a folio ceph: Use a folio in ceph_page_mkwrite() ceph: Remove ceph_writepage() ceph: fix generic/421 test failure ceph: introduce ceph_submit_write() method ceph: introduce ceph_process_folio_batch() method ceph: extend ceph_writeback_ctl for ceph_writepages_start() refactoring
2025-03-24Merge tag 'vfs-6.15-rc1.pagesize' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs pagesize updates from Christian Brauner: "This enables block sizes greater than the page size for block devices. With this we can start supporting block devices with logical block sizes larger than 4k. It also allows to lift the device cache sector size support to 64k. This allows filesystems which can use larger sector sizes up to 64k to ensure that the filesystem will not generate writes that are smaller than the specified sector size" * tag 'vfs-6.15-rc1.pagesize' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: bdev: add back PAGE_SIZE block size validation for sb_set_blocksize() bdev: use bdev_io_min() for statx block size block/bdev: lift block size restrictions to 64k block/bdev: enable large folio support for large logical block sizes fs/buffer fs/mpage: remove large folio restriction fs/mpage: use blocks_per_folio instead of blocks_per_page fs/mpage: avoid negative shift for large blocksize fs/buffer: remove batching from async read fs/buffer: simplify block_read_full_folio() with bh_offset()
2025-03-24Merge tag 'vfs-6.15-rc1.mount.namespace' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs mount namespace updates from Christian Brauner: "This expands the ability of anonymous mount namespaces: - Creating detached mounts from detached mounts Currently, detached mounts can only be created from attached mounts. This limitaton prevents various use-cases. For example, the ability to mount a subdirectory without ever having to make the whole filesystem visible first. The current permission modelis: (1) Check that the caller is privileged over the owning user namespace of it's current mount namespace. (2) Check that the caller is located in the mount namespace of the mount it wants to create a detached copy of. While it is not strictly necessary to do it this way it is consistently applied in the new mount api. This model will also be used when allowing the creation of detached mount from another detached mount. The (1) requirement can simply be met by performing the same check as for the non-detached case, i.e., verify that the caller is privileged over its current mount namespace. To meet the (2) requirement it must be possible to infer the origin mount namespace that the anonymous mount namespace of the detached mount was created from. The origin mount namespace of an anonymous mount is the mount namespace that the mounts that were copied into the anonymous mount namespace originate from. In order to check the origin mount namespace of an anonymous mount namespace the sequence number of the original mount namespace is recorded in the anonymous mount namespace. With this in place it is possible to perform an equivalent check (2') to (2). The origin mount namespace of the anonymous mount namespace must be the same as the caller's mount namespace. To establish this the sequence number of the caller's mount namespace and the origin sequence number of the anonymous mount namespace are compared. The caller is always located in a non-anonymous mount namespace since anonymous mount namespaces cannot be setns()ed into. The caller's mount namespace will thus always have a valid sequence number. The owning namespace of any mount namespace, anonymous or non-anonymous, can never change. A mount attached to a non-anonymous mount namespace can never change mount namespace. If the sequence number of the non-anonymous mount namespace and the origin sequence number of the anonymous mount namespace match, the owning namespaces must match as well. Hence, the capability check on the owning namespace of the caller's mount namespace ensures that the caller has the ability to copy the mount tree. - Allow mount detached mounts on detached mounts Currently, detached mounts can only be mounted onto attached mounts. This limitation makes it impossible to assemble a new private rootfs and move it into place. Instead, a detached tree must be created, attached, then mounted open and then either moved or detached again. Lift this restriction. In order to allow mounting detached mounts onto other detached mounts the same permission model used for creating detached mounts from detached mounts can be used (cf. above). Allowing to mount detached mounts onto detached mounts leaves three cases to consider: (1) The source mount is an attached mount and the target mount is a detached mount. This would be equivalent to moving a mount between different mount namespaces. A caller could move an attached mount to a detached mount. The detached mount can now be freely attached to any mount namespace. This changes the current delegatioh model significantly for no good reason. So this will fail. (2) Anonymous mount namespaces are always attached fully, i.e., it is not possible to only attach a subtree of an anoymous mount namespace. This simplifies the implementation and reasoning. Consequently, if the anonymous mount namespace of the source detached mount and the target detached mount are the identical the mount request will fail. (3) The source mount's anonymous mount namespace is different from the target mount's anonymous mount namespace. In this case the source anonymous mount namespace of the source mount tree must be freed after its mounts have been moved to the target anonymous mount namespace. The source anonymous mount namespace must be empty afterwards. By allowing to mount detached mounts onto detached mounts a caller may do the following: fd_tree1 = open_tree(-EBADF, "/mnt", OPEN_TREE_CLONE) fd_tree2 = open_tree(-EBADF, "/tmp", OPEN_TREE_CLONE) fd_tree1 and fd_tree2 refer to two different detached mount trees that belong to two different anonymous mount namespace. It is important to note that fd_tree1 and fd_tree2 both refer to the root of their respective anonymous mount namespaces. By allowing to mount detached mounts onto detached mounts the caller may now do: move_mount(fd_tree1, "", fd_tree2, "", MOVE_MOUNT_F_EMPTY_PATH | MOVE_MOUNT_T_EMPTY_PATH) This will cause the detached mount referred to by fd_tree1 to be mounted on top of the detached mount referred to by fd_tree2. Thus, the detached mount fd_tree1 is moved from its separate anonymous mount namespace into fd_tree2's anonymous mount namespace. It also means that while fd_tree2 continues to refer to the root of its respective anonymous mount namespace fd_tree1 doesn't anymore. This has the consequence that only fd_tree2 can be moved to another anonymous or non-anonymous mount namespace. Moving fd_tree1 will now fail as fd_tree1 doesn't refer to the root of an anoymous mount namespace anymore. Now fd_tree1 and fd_tree2 refer to separate detached mount trees referring to the same anonymous mount namespace. This is conceptually fine. The new mount api does allow for this to happen already via: mount -t tmpfs tmpfs /mnt mkdir -p /mnt/A mount -t tmpfs tmpfs /mnt/A fd_tree3 = open_tree(-EBADF, "/mnt", OPEN_TREE_CLONE | AT_RECURSIVE) fd_tree4 = open_tree(-EBADF, "/mnt/A", 0) Both fd_tree3 and fd_tree4 refer to two different detached mount trees but both detached mount trees refer to the same anonymous mount namespace. An as with fd_tree1 and fd_tree2, only fd_tree3 may be moved another mount namespace as fd_tree3 refers to the root of the anonymous mount namespace just while fd_tree4 doesn't. However, there's an important difference between the fd_tree3/fd_tree4 and the fd_tree1/fd_tree2 example. Closing fd_tree4 and releasing the respective struct file will have no further effect on fd_tree3's detached mount tree. However, closing fd_tree3 will cause the mount tree and the respective anonymous mount namespace to be destroyed causing the detached mount tree of fd_tree4 to be invalid for further mounting. By allowing to mount detached mounts on detached mounts as in the fd_tree1/fd_tree2 example both struct files will affect each other. Both fd_tree1 and fd_tree2 refer to struct files that have FMODE_NEED_UNMOUNT set. To handle this we use the fact that @fd_tree1 will have a parent mount once it has been attached to @fd_tree2. When dissolve_on_fput() is called the mount that has been passed in will refer to the root of the anonymous mount namespace. If it doesn't it would mean that mounts are leaked. So before allowing to mount detached mounts onto detached mounts this would be a bug. Now that detached mounts can be mounted onto detached mounts it just means that the mount has been attached to another anonymous mount namespace and thus dissolve_on_fput() must not unmount the mount tree or free the anonymous mount namespace as the file referring to the root of the namespace hasn't been closed yet. If it had been closed yet it would be obvious because the mount namespace would be NULL, i.e., the @fd_tree1 would have already been unmounted. If @fd_tree1 hasn't been unmounted yet and has a parent mount it is safe to skip any cleanup as closing @fd_tree2 will take care of all cleanup operations. - Allow mount propagation for detached mount trees In commit ee2e3f50629f ("mount: fix mounting of detached mounts onto targets that reside on shared mounts") I fixed a bug where propagating the source mount tree of an anonymous mount namespace into a target mount tree of a non-anonymous mount namespace could be used to trigger an integer overflow in the non-anonymous mount namespace causing any new mounts to fail. The cause of this was that the propagation algorithm was unable to recognize mounts from the source mount tree that were already propagated into the target mount tree and then reappeared as propagation targets when walking the destination propagation mount tree. When fixing this I disabled mount propagation into anonymous mount namespaces. Make it possible for anonymous mount namespace to receive mount propagation events correctly. This is now also a correctness issue now that we allow mounting detached mount trees onto detached mount trees. Mark the source anonymous mount namespace with MNTNS_PROPAGATING indicating that all mounts belonging to this mount namespace are currently in the process of being propagated and make the propagation algorithm discard those if they appear as propagation targets" * tag 'vfs-6.15-rc1.mount.namespace' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (21 commits) selftests: test subdirectory mounting selftests: add test for detached mount tree propagation fs: namespace: fix uninitialized variable use mount: handle mount propagation for detached mount trees fs: allow creating detached mounts from fsmount() file descriptors selftests: seventh test for mounting detached mounts onto detached mounts selftests: sixth test for mounting detached mounts onto detached mounts selftests: fifth test for mounting detached mounts onto detached mounts selftests: fourth test for mounting detached mounts onto detached mounts selftests: third test for mounting detached mounts onto detached mounts selftests: second test for mounting detached mounts onto detached mounts selftests: first test for mounting detached mounts onto detached mounts fs: mount detached mounts onto detached mounts fs: support getname_maybe_null() in move_mount() selftests: create detached mounts from detached mounts fs: create detached mounts from detached mounts fs: add may_copy_tree() fs: add fastpath for dissolve_on_fput() fs: add assert for move_mount() fs: add mnt_ns_empty() helper ...
2025-03-24Merge tag 'vfs-6.15-rc1.nsfs' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs nsfs updates from Christian Brauner: "This contains non-urgent fixes for nsfs to validate ioctls before performing any relevant operations. We alredy did this for a few other filesystems last cycle" * tag 'vfs-6.15-rc1.nsfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: selftests/nsfs: add ioctl validation tests nsfs: validate ioctls
2025-03-24Merge tag 'vfs-6.15-rc1.sysv' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs sysv removal from Christian Brauner: "This removes the sysv filesystem. We've discussed this various times. It's time to try" * tag 'vfs-6.15-rc1.sysv' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: sysv: Remove the filesystem
2025-03-24Merge tag 'vfs-6.15-rc1.async.dir' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs async dir updates from Christian Brauner: "This contains cleanups that fell out of the work from async directory handling: - Change kern_path_locked() and user_path_locked_at() to never return a negative dentry. This simplifies the usability of these helpers in various places - Drop d_exact_alias() from the remaining place in NFS where it is still used. This also allows us to drop the d_exact_alias() helper completely - Drop an unnecessary call to fh_update() from nfsd_create_locked() - Change i_op->mkdir() to return a struct dentry Change vfs_mkdir() to return a dentry provided by the filesystems which is hashed and positive. This allows us to reduce the number of cases where the resulting dentry is not positive to very few cases. The code in these places becomes simpler and easier to understand. - Repack DENTRY_* and LOOKUP_* flags" * tag 'vfs-6.15-rc1.async.dir' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: doc: fix inline emphasis warning VFS: Change vfs_mkdir() to return the dentry. nfs: change mkdir inode_operation to return alternate dentry if needed. fuse: return correct dentry for ->mkdir ceph: return the correct dentry on mkdir hostfs: store inode in dentry after mkdir if possible. Change inode_operations.mkdir to return struct dentry * nfsd: drop fh_update() from S_IFDIR branch of nfsd_create_locked() nfs/vfs: discard d_exact_alias() VFS: add common error checks to lookup_one_qstr_excl() VFS: change kern_path_locked() and user_path_locked_at() to never return negative dentry VFS: repack LOOKUP_ bit flags. VFS: repack DENTRY_ flags.
2025-03-24Merge tag 'vfs-6.15-rc1.overlayfs' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs overlayfs updates from Christian Brauner: "Currently overlayfs uses the mounter's credentials for its override_creds() calls. That provides a consistent permission model. This patches allows a caller to instruct overlayfs to use its credentials instead. The caller must be located in the same user namespace hierarchy as the user namespace the overlayfs instance will be mounted in. This provides a consistent and simple security model. With this it is possible to e.g., mount an overlayfs instance where the mounter must have CAP_SYS_ADMIN but the credentials used for override_creds() have dropped CAP_SYS_ADMIN. It also allows the usage of custom fs{g,u}id different from the callers and other tweaks" * tag 'vfs-6.15-rc1.overlayfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: selftests/ovl: add third selftest for "override_creds" selftests/ovl: add second selftest for "override_creds" selftests/filesystems: add utils.{c,h} selftests/ovl: add first selftest for "override_creds" ovl: allow to specify override credentials
2025-03-24Merge tag 'vfs-6.15-rc1.iomap' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs iomap updates from Christian Brauner: - Allow the filesystem to submit the writeback bios. - Allow the filsystem to track completions on a per-bio bases instead of the entire I/O. - Change writeback_ops so that ->submit_bio can be done by the filesystem. - A new ANON_WRITE flag for writes that don't have a block number assigned to them at the iomap level leaving the filesystem to do that work in the submission handler. - Incremental iterator advance The folio_batch support for zero range where the filesystem provides a batch of folios to process that might not be logically continguous requires more flexibility than the current offset based iteration currently offers. Update all iomap operations to advance the iterator within the operation and thus remove the need to advance from the core iomap iterator. - Make buffered writes work with RWF_DONTCACHE If RWF_DONTCACHE is set for a write, mark the folios being written as uncached. On writeback completion the pages will be dropped. - Introduce infrastructure for large atomic writes This will eventually be used by xfs and ext4. * tag 'vfs-6.15-rc1.iomap' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (42 commits) iomap: rework IOMAP atomic flags iomap: comment on atomic write checks in iomap_dio_bio_iter() iomap: inline iomap_dio_bio_opflags() iomap: fix inline data on buffered read iomap: Lift blocksize restriction on atomic writes iomap: Support SW-based atomic writes iomap: Rename IOMAP_ATOMIC -> IOMAP_ATOMIC_HW xfs: flag as supporting FOP_DONTCACHE iomap: make buffered writes work with RWF_DONTCACHE iomap: introduce a full map advance helper iomap: rename iomap_iter processed field to status iomap: remove unnecessary advance from iomap_iter() dax: advance the iomap_iter on pte and pmd faults dax: advance the iomap_iter on dedupe range dax: advance the iomap_iter on unshare range dax: advance the iomap_iter on zero range dax: push advance down into dax_iomap_iter() for read and write dax: advance the iomap_iter in the read/write path iomap: convert misc simple ops to incremental advance iomap: advance the iter on direct I/O ...
2025-03-24Merge tag 'vfs-6.15-rc1.pidfs' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs pidfs updates from Christian Brauner: - Allow retrieving exit information after a process has been reaped through pidfds via the new PIDFD_INTO_EXIT extension for the PIDFD_GET_INFO ioctl. Various tools need access to information about a process/task even after it has already been reaped. Pidfd polling allows waiting on either task exit or for a task to have been reaped. The contract for PIDFD_INFO_EXIT is simply that EPOLLHUP must be observed before exit information can be retrieved, i.e., exit information is only provided once the task has been reaped and then can be retrieved as long as the pidfd is open. - Add PIDFD_SELF_{THREAD,THREAD_GROUP} sentinels allowing userspace to forgo allocating a file descriptor for their own process. This is useful in scenarios where users want to act on their own process through pidfds and is akin to AT_FDCWD. - Improve premature thread-group leader and subthread exec behavior when polling on pidfds: (1) During a multi-threaded exec by a subthread, i.e., non-thread-group leader thread, all other threads in the thread-group including the thread-group leader are killed and the struct pid of the thread-group leader will be taken over by the subthread that called exec. IOW, two tasks change their TIDs. (2) A premature thread-group leader exit means that the thread-group leader exited before all of the other subthreads in the thread-group have exited. Both cases lead to inconsistencies for pidfd polling with PIDFD_THREAD. Any caller that holds a PIDFD_THREAD pidfd to the current thread-group leader may or may not see an exit notification on the file descriptor depending on when poll is performed. If the poll is performed before the exec of the subthread has concluded an exit notification is generated for the old thread-group leader. If the poll is performed after the exec of the subthread has concluded no exit notification is generated for the old thread-group leader. The correct behavior is to simply not generate an exit notification on the struct pid of a subhthread exec because the struct pid is taken over by the subthread and thus remains alive. But this is difficult to handle because a thread-group may exit premature as mentioned in (2). In that case an exit notification is reliably generated but the subthreads may continue to run for an indeterminate amount of time and thus also may exec at some point. After this pull no exit notifications will be generated for a PIDFD_THREAD pidfd for a thread-group leader until all subthreads have been reaped. If a subthread should exec before no exit notification will be generated until that task exits or it creates subthreads and repeates the cycle. This means an exit notification indicates the ability for the father to reap the child. * tag 'vfs-6.15-rc1.pidfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (25 commits) selftests/pidfd: third test for multi-threaded exec polling selftests/pidfd: second test for multi-threaded exec polling selftests/pidfd: first test for multi-threaded exec polling pidfs: improve multi-threaded exec and premature thread-group leader exit polling pidfs: ensure that PIDFS_INFO_EXIT is available selftests/pidfd: add seventh PIDFD_INFO_EXIT selftest selftests/pidfd: add sixth PIDFD_INFO_EXIT selftest selftests/pidfd: add fifth PIDFD_INFO_EXIT selftest selftests/pidfd: add fourth PIDFD_INFO_EXIT selftest selftests/pidfd: add third PIDFD_INFO_EXIT selftest selftests/pidfd: add second PIDFD_INFO_EXIT selftest selftests/pidfd: add first PIDFD_INFO_EXIT selftest selftests/pidfd: expand common pidfd header pidfs/selftests: ensure correct headers for ioctl handling selftests/pidfd: fix header inclusion pidfs: allow to retrieve exit information pidfs: record exit code and cgroupid at exit pidfs: use private inode slab cache pidfs: move setting flags into pidfs_alloc_file() pidfd: rely on automatic cleanup in __pidfd_prepare() ...
2025-03-24Merge tag 'vfs-6.15-rc1.pipe' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs pipe updates from Christian Brauner: - Introduce struct file_operations pipeanon_fops - Don't update {a,c,m}time for anonymous pipes to avoid the performance costs associated with it - Change pipe_write() to never add a zero-sized buffer - Limit the slots in pipe_resize_ring() - Use pipe_buf() to retrieve the pipe buffer everywhere - Drop an always true check in anon_pipe_write() - Cache 2 pages instead of 1 - Avoid spurious calls to prepare_to_wait_event() in ___wait_event() * tag 'vfs-6.15-rc1.pipe' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs/splice: Use pipe_buf() helper to retrieve pipe buffer fs/pipe: Use pipe_buf() helper to retrieve pipe buffer kernel/watch_queue: Use pipe_buf() to retrieve the pipe buffer fs/pipe: Limit the slots in pipe_resize_ring() wait: avoid spurious calls to prepare_to_wait_event() in ___wait_event() pipe: cache 2 pages instead of 1 pipe: drop an always true check in anon_pipe_write() pipe: change pipe_write() to never add a zero-sized buffer pipe: don't update {a,c,m}time for anonymous pipes pipe: introduce struct file_operations pipeanon_fops
2025-03-24Merge tag 'vfs-6.15-rc1.mount' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs mount updates from Christian Brauner: - Mount notifications The day has come where we finally provide a new api to listen for mount topology changes outside of /proc/<pid>/mountinfo. A mount namespace file descriptor can be supplied and registered with fanotify to listen for mount topology changes. Currently notifications for mount, umount and moving mounts are generated. The generated notification record contains the unique mount id of the mount. The listmount() and statmount() api can be used to query detailed information about the mount using the received unique mount id. This allows userspace to figure out exactly how the mount topology changed without having to generating diffs of /proc/<pid>/mountinfo in userspace. - Support O_PATH file descriptors with FSCONFIG_SET_FD in the new mount api - Support detached mounts in overlayfs Since last cycle we support specifying overlayfs layers via file descriptors. However, we don't allow detached mounts which means userspace cannot user file descriptors received via open_tree(OPEN_TREE_CLONE) and fsmount() directly. They have to attach them to a mount namespace via move_mount() first. This is cumbersome and means they have to undo mounts via umount(). Allow them to directly use detached mounts. - Allow to retrieve idmappings with statmount Currently it isn't possible to figure out what idmapping has been attached to an idmapped mount. Add an extension to statmount() which allows to read the idmapping from the mount. - Allow creating idmapped mounts from mounts that are already idmapped So far it isn't possible to allow the creation of idmapped mounts from already idmapped mounts as this has significant lifetime implications. Make the creation of idmapped mounts atomic by allow to pass struct mount_attr together with the open_tree_attr() system call allowing to solve these issues without complicating VFS lookup in any way. The system call has in general the benefit that creating a detached mount and applying mount attributes to it becomes an atomic operation for userspace. - Add a way to query statmount() for supported options Allow userspace to query which mount information can be retrieved through statmount(). - Allow superblock owners to force unmount * tag 'vfs-6.15-rc1.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (21 commits) umount: Allow superblock owners to force umount selftests: add tests for mount notification selinux: add FILE__WATCH_MOUNTNS samples/vfs: fix printf format string for size_t fs: allow changing idmappings fs: add kflags member to struct mount_kattr fs: add open_tree_attr() fs: add copy_mount_setattr() helper fs: add vfs_open_tree() helper statmount: add a new supported_mask field samples/vfs: add STATMOUNT_MNT_{G,U}IDMAP selftests: add tests for using detached mount with overlayfs samples/vfs: check whether flag was raised statmount: allow to retrieve idmappings uidgid: add map_id_range_up() fs: allow detached mounts in clone_private_mount() selftests/overlayfs: test specifying layers as O_PATH file descriptors fs: support O_PATH fds with FSCONFIG_SET_FD vfs: add notifications for mount attach and detach fanotify: notify on mount attach and detach ...
2025-03-24Merge tag 'vfs-6.15-rc1.eventpoll' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs eventpoll updates from Christian Brauner: "This contains a few preparatory changes to eventpoll to allow io_uring to support epoll" * tag 'vfs-6.15-rc1.eventpoll' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: eventpoll: add epoll_sendevents() helper eventpoll: abstract out ep_try_send_events() helper eventpoll: abstract out parameter sanity checking
2025-03-24Merge tag 'vfs-6.15-rc1.misc' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "Features: - Add CONFIG_DEBUG_VFS infrastucture: - Catch invalid modes in open - Use the new debug macros in inode_set_cached_link() - Use debug-only asserts around fd allocation and install - Place f_ref to 3rd cache line in struct file to resolve false sharing Cleanups: - Start using anon_inode_getfile_fmode() helper in various places - Don't take f_lock during SEEK_CUR if exclusion is guaranteed by f_pos_lock - Add unlikely() to kcmp() - Remove legacy ->remount_fs method from ecryptfs after port to the new mount api - Remove invalidate_inodes() in favour of evict_inodes() - Simplify ep_busy_loopER by removing unused argument - Avoid mmap sem relocks when coredumping with many missing pages - Inline getname() - Inline new_inode_pseudo() and de-staticize alloc_inode() - Dodge an atomic in putname if ref == 1 - Consistently deref the files table with rcu_dereference_raw() - Dedup handling of struct filename init and refcounts bumps - Use wq_has_sleeper() in end_dir_add() - Drop the lock trip around I_NEW wake up in evict() - Load the ->i_sb pointer once in inode_sb_list_{add,del} - Predict not reaching the limit in alloc_empty_file() - Tidy up do_sys_openat2() with likely/unlikely - Call inode_sb_list_add() outside of inode hash lock - Sort out fd allocation vs dup2 race commentary - Turn page_offset() into a wrapper around folio_pos() - Remove locking in exportfs around ->get_parent() call - try_lookup_one_len() does not need any locks in autofs - Fix return type of several functions from long to int in open - Fix return type of several functions from long to int in ioctls Fixes: - Fix watch queue accounting mismatch" * tag 'vfs-6.15-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (30 commits) fs: sort out fd allocation vs dup2 race commentary, take 2 fs: call inode_sb_list_add() outside of inode hash lock fs: tidy up do_sys_openat2() with likely/unlikely fs: predict not reaching the limit in alloc_empty_file() fs: load the ->i_sb pointer once in inode_sb_list_{add,del} fs: drop the lock trip around I_NEW wake up in evict() fs: use wq_has_sleeper() in end_dir_add() VFS/autofs: try_lookup_one_len() does not need any locks fs: dedup handling of struct filename init and refcounts bumps fs: consistently deref the files table with rcu_dereference_raw() exportfs: remove locking around ->get_parent() call. fs: use debug-only asserts around fd allocation and install fs: dodge an atomic in putname if ref == 1 vfs: Remove invalidate_inodes() ecryptfs: remove NULL remount_fs from super_operations watch_queue: fix pipe accounting mismatch fs: place f_ref to 3rd cache line in struct file to resolve false sharing epoll: simplify ep_busy_loop by removing always 0 argument fs: Turn page_offset() into a wrapper around folio_pos() kcmp: improve performance adding an unlikely hint to task comparisons ...
2025-03-24Merge tag 'vfs-6.15-rc1.mount.api' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs mount API updates from Christian Brauner: "This converts the remaining pseudo filesystems to the new mount api. The sysv conversion is a bit gratuitous because we remove sysv in another pull request. But if we have to revert the removal we at least will have it converted to the new mount api already" * tag 'vfs-6.15-rc1.mount.api' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: sysv: convert sysv to use the new mount api vfs: remove some unused old mount api code devtmpfs: replace ->mount with ->get_tree in public instance vfs: Convert devpts to use the new mount API pstore: convert to the new mount API
2025-03-20Merge tag 'v6.14-rc7-smb3-client-fix' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds
Pull smb client fix from Steve French: "smb3 client reconnect fix" * tag 'v6.14-rc7-smb3-client-fix' of git://git.samba.org/sfrench/cifs-2.6: smb: client: don't retry IO on failed negprotos with soft mounts
2025-03-20Merge tag 'vfs-6.14-final.fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: "A final set of fixes for this cycle: VFS: - Ensure that the stable offset api doesn't return duplicate directory entries when userspace has to perform the getdents call multiple times on large directories afs: - Prevent invalid pointer dereference during get_link RCU pathwalk fuse: - Fix deadlock caused by uninitialized rings when using io_uring with fuse - Handle race condition when using io_uring with fuse to prevent NULL dereference libnetfs: - Ensure that invalidate_cache is only called if implemented - Fix collection of results during pause when collection is offloaded - Ensure rolling_buffer_load_from_ra() doesn't clear mark bits - Make netfs_unbuffered_read() return ssize_t rather than int" * tag 'vfs-6.14-final.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: libfs: Fix duplicate directory entry in offset_dir_lookup fuse: fix possible deadlock if rings are never initialized netfs: Fix netfs_unbuffered_read() to return ssize_t rather than int netfs: Fix rolling_buffer_load_from_ra() to not clear mark bits netfs: Call `invalidate_cache` only if implemented netfs: Fix collection of results during pause when collection offloaded fuse: fix uring race condition for null dereference of fc afs: Fix afs_atcell_get_link() to check if ws_cell is unset first
2025-03-20Merge tag 'efi-fixes-for-v6.14-3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi Pull EFI fixes from Ard Biesheuvel: "Here's a final batch of EFI fixes for v6.14. The efivarfs ones are fixes for changes that were made this cycle. James's fix is somewhat of a band-aid, but it was blessed by the VFS folks, who are working with James to come up with something better for the next cycle. - Avoid physical address 0x0 for random page allocations - Add correct lockdep annotation when traversing efivarfs on resume - Avoid NULL mount in kernel_file_open() when traversing efivarfs on resume" * tag 'efi-fixes-for-v6.14-3' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi: efivarfs: fix NULL dereference on resume efivarfs: use I_MUTEX_CHILD nested lock to traverse variables on resume efi/libstub: Avoid physical address 0x0 when doing random allocation
2025-03-20pidfs: improve multi-threaded exec and premature thread-group leader exit ↵Christian Brauner
polling This is another attempt trying to make pidfd polling for multi-threaded exec and premature thread-group leader exit consistent. A quick recap of these two cases: (1) During a multi-threaded exec by a subthread, i.e., non-thread-group leader thread, all other threads in the thread-group including the thread-group leader are killed and the struct pid of the thread-group leader will be taken over by the subthread that called exec. IOW, two tasks change their TIDs. (2) A premature thread-group leader exit means that the thread-group leader exited before all of the other subthreads in the thread-group have exited. Both cases lead to inconsistencies for pidfd polling with PIDFD_THREAD. Any caller that holds a PIDFD_THREAD pidfd to the current thread-group leader may or may not see an exit notification on the file descriptor depending on when poll is performed. If the poll is performed before the exec of the subthread has concluded an exit notification is generated for the old thread-group leader. If the poll is performed after the exec of the subthread has concluded no exit notification is generated for the old thread-group leader. The correct behavior would be to simply not generate an exit notification on the struct pid of a subhthread exec because the struct pid is taken over by the subthread and thus remains alive. But this is difficult to handle because a thread-group may exit prematurely as mentioned in (2). In that case an exit notification is reliably generated but the subthreads may continue to run for an indeterminate amount of time and thus also may exec at some point. So far there was no way to distinguish between (1) and (2) internally. This tiny series tries to address this problem by discarding PIDFD_THREAD notification on premature thread-group leader exit. If that works correctly then no exit notifications are generated for a PIDFD_THREAD pidfd for a thread-group leader until all subthreads have been reaped. If a subthread should exec aftewards no exit notification will be generated until that task exits or it creates subthreads and repeates the cycle. Co-Developed-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20250320-work-pidfs-thread_group-v4-1-da678ce805bf@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-20fs: sort out fd allocation vs dup2 race commentary, take 2Mateusz Guzik
fd_install() has a questionable comment above it. While it correctly points out a possible race against dup2(), it states: > We need to detect this and fput() the struct file we are about to > overwrite in this case. > > It should never happen - if we allow dup2() do it, _really_ bad things > will follow. I have difficulty parsing the above. The first sentence would suggest fd_install() tries to detect and recover from the race (it does not), the next one claims the race needs to be dealt with (it is, by dup2()). Given that fd_install() does not suffer the burden, this patch removes the above and instead expands on the race in dup2() commentary. While here tidy up the docs around fd_install(). Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://lore.kernel.org/r/20250320102637.1924183-1-mjguzik@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-20iomap: rework IOMAP atomic flagsJohn Garry
Flag IOMAP_ATOMIC_SW is not really required. The idea of having this flag is that the FS ->iomap_begin callback could check if this flag is set to decide whether to do a SW (FS-based) atomic write. But the FS can set which ->iomap_begin callback it wants when deciding to do a FS-based atomic write. Furthermore, it was thought that IOMAP_ATOMIC_HW is not a proper name, as the block driver can use SW-methods to emulate an atomic write. So change back to IOMAP_ATOMIC. The ->iomap_begin callback needs though to indicate to iomap core that REQ_ATOMIC needs to be set, so add IOMAP_F_ATOMIC_BIO for that. These changes were suggested by Christoph Hellwig and Dave Chinner. Signed-off-by: John Garry <john.g.garry@oracle.com> Link: https://lore.kernel.org/r/20250320120250.4087011-4-john.g.garry@oracle.com Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-20iomap: comment on atomic write checks in iomap_dio_bio_iter()John Garry
Help explain the code. Also clarify the comment for bio size check. Signed-off-by: John Garry <john.g.garry@oracle.com> Link: https://lore.kernel.org/r/20250320120250.4087011-3-john.g.garry@oracle.com Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-20iomap: inline iomap_dio_bio_opflags()John Garry
It is neater to build blk_opf_t fully in one place, so inline iomap_dio_bio_opflags() in iomap_dio_bio_iter(). Also tidy up the logic in dealing with IOMAP_DIO_CALLER_COMP, in generally separate the logic in dealing with flags associated with reads and writes. Originally-from: Christoph Hellwig <hch@lst.de> Reviewed-by: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> Signed-off-by: John Garry <john.g.garry@oracle.com> Link: https://lore.kernel.org/r/20250320120250.4087011-2-john.g.garry@oracle.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-20libfs: Fix duplicate directory entry in offset_dir_lookupYongjian Sun
There is an issue in the kernel: In tmpfs, when using the "ls" command to list the contents of a directory with a large number of files, glibc performs the getdents call in multiple rounds. If a concurrent unlink occurs between these getdents calls, it may lead to duplicate directory entries in the ls output. One possible reproduction scenario is as follows: Create 1026 files and execute ls and rm concurrently: for i in {1..1026}; do echo "This is file $i" > /tmp/dir/file$i done ls /tmp/dir rm /tmp/dir/file4 ->getdents(file1026-file5) ->unlink(file4) ->getdents(file5,file3,file2,file1) It is expected that the second getdents call to return file3 through file1, but instead it returns an extra file5. The root cause of this problem is in the offset_dir_lookup function. It uses mas_find to determine the starting position for the current getdents call. Since mas_find locates the first position that is greater than or equal to mas->index, when file4 is deleted, it ends up returning file5. It can be fixed by replacing mas_find with mas_find_rev, which finds the first position that is less than or equal to mas->index. Fixes: b9b588f22a0c ("libfs: Use d_children list to iterate simple_offset directories") Signed-off-by: Yongjian Sun <sunyongjian1@huawei.com> Link: https://lore.kernel.org/r/20250320034417.555810-1-sunyongjian@huaweicloud.com Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-20fs: call inode_sb_list_add() outside of inode hash lockMateusz Guzik
As both locks are highly contended during significant inode churn, holding the inode hash lock while waiting for the sb list lock exacerbates the problem. Why moving it out is safe: the inode at hand still has I_NEW set and anyone who finds it through legitimate means waits for the bit to clear, by which time inode_sb_list_add() is guaranteed to have finished. This significantly drops hash lock contention for me when stating 20 separate trees in parallel, each with 1000 directories * 1000 files. However, no speed up was observed as contention increased on the other locks, notably dentry LRU. Even so, removal of the lock ordering will help making this faster later. Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://lore.kernel.org/r/20250320004643.1903287-1-mjguzik@gmail.com Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-20fs: tidy up do_sys_openat2() with likely/unlikelyMateusz Guzik
Otherwise gcc 13 generates conditional forward jumps (aka branch mispredict by default) for build_open_flags() being succesfull. Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://lore.kernel.org/r/20250320092331.1921700-1-mjguzik@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-20fs: reduce work in fdget_pos()Mateusz Guzik
1. predict the file was found 2. explicitly compare the ref to "one", ignoring the dead zone The latter arguably improves the behavior to begin with. Suppose the count turned bad -- the previously used ref routine is going to check for it and return 0, indicating the count does not necessitate taking ->f_pos_lock. But there very well may be several users. i.e. not paying for special-casing the dead zone improves semantics. While here spell out each condition in a dedicated if statement. This has no effect on generated code. Sizes are as follows (in bytes; gcc 13, x86-64): stock: 321 likely(): 298 likely()+ref: 280 Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://lore.kernel.org/r/20250319215801.1870660-1-mjguzik@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-19sched/debug: Make CONFIG_SCHED_DEBUG functionality unconditionalIngo Molnar
All the big Linux distros enable CONFIG_SCHED_DEBUG, because the various features it provides help not just with kernel development, but with system administration and user-space software development as well. Reflect this reality and enable this functionality unconditionally. Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ben Segall <bsegall@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250317104257.3496611-4-mingo@kernel.org
2025-03-19pidfs: ensure that PIDFS_INFO_EXIT is availableChristian Brauner
When we currently create a pidfd we check that the task hasn't been reaped right before we create the pidfd. But it is of course possible that by the time we return the pidfd to userspace the task has already been reaped since we don't check again after having created a dentry for it. This was fine until now because that race was meaningless. But now that we provide PIDFD_INFO_EXIT it is a problem because it is possible that the kernel returns a reaped pidfd and it depends on the race whether PIDFD_INFO_EXIT information is available. This depends on if the task gets reaped before or after a dentry has been attached to struct pid. Make this consistent and only returned pidfds for reaped tasks if PIDFD_INFO_EXIT information is available. This is done by performing another check whether the task has been reaped right after we attached a dentry to struct pid. Since pidfs_exit() is called before struct pid's task linkage is removed the case where the task got reaped but a dentry was already attached to struct pid and exit information was recorded and published can be handled correctly. In that case we do return a pidfd for a reaped task like we would've before. Link: https://lore.kernel.org/r/20250316-kabel-fehden-66bdb6a83436@brauner Reviewed-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-19fs: predict not reaching the limit in alloc_empty_file()Mateusz Guzik
Eliminates a jump over a call to capable() in the common case. By default the limit is not even set, in which case the check can't even fail to begin with. It remains unset at least on Debian and Ubuntu. For this cases this can probably become a static branch instead. In the meantime tidy it up. I note the check separate from the bump makes the entire thing racy. Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://lore.kernel.org/r/20250319124923.1838719-1-mjguzik@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-19iomap: fix inline data on buffered readGao Xiang
Previously, iomap_readpage_iter() returning 0 would break out of the loops of iomap_readahead_iter(), which is what iomap_read_inline_data() relies on. However, commit d9dc477ff6a2 ("iomap: advance the iter directly on buffered read") changes this behavior without calling iomap_iter_advance(), which causes EROFS to get stuck in iomap_readpage_iter(). It seems iomap_iter_advance() cannot be called in iomap_read_inline_data() because of the iomap_write_begin() path, so handle this in iomap_readpage_iter() instead. Reported-and-tested-by: Bo Liu <liubo03@inspur.com> Fixes: d9dc477ff6a2 ("iomap: advance the iter directly on buffered read") Cc: Brian Foster <bfoster@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20250319085125.4039368-1-hsiangkao@linux.alibaba.com Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-19fuse: fix possible deadlock if rings are never initializedLuis Henriques
When mounting a user-space filesystem using io_uring, the initialization of the rings is done separately in the server side. If for some reason (e.g. a server bug) this step is not performed it will be impossible to unmount the filesystem if there are already requests waiting. This issue is easily reproduced with the libfuse passthrough_ll example, if the queue depth is set to '0' and a request is queued before trying to unmount the filesystem. When trying to force the unmount, fuse_abort_conn() will try to wake up all tasks waiting in fc->blocked_waitq, but because the rings were never initialized, fuse_uring_ready() will never return 'true'. Fixes: 3393ff964e0f ("fuse: block request allocation until io-uring init is complete") Signed-off-by: Luis Henriques <luis@igalia.com> Link: https://lore.kernel.org/r/20250306111218.13734-1-luis@igalia.com Acked-by: Miklos Szeredi <mszeredi@redhat.com> Reviewed-by: Bernd Schubert <bschubert@ddn.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-19netfs: Fix netfs_unbuffered_read() to return ssize_t rather than intDavid Howells
Fix netfs_unbuffered_read() to return an ssize_t rather than an int as netfs_wait_for_read() returns ssize_t and this gets implicitly truncated. Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/20250314164201.1993231-5-dhowells@redhat.com Acked-by: "Paulo Alcantara (Red Hat)" <pc@manguebit.com> cc: Jeff Layton <jlayton@kernel.org> cc: Viacheslav Dubeyko <slava@dubeyko.com> cc: Alex Markuze <amarkuze@redhat.com> cc: Ilya Dryomov <idryomov@gmail.com> cc: ceph-devel@vger.kernel.org cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-19netfs: Fix rolling_buffer_load_from_ra() to not clear mark bitsDavid Howells
rolling_buffer_load_from_ra() looms large in the perf report because it loops around doing an atomic clear for each of the three mark bits per folio. However, this is both inefficient (it would be better to build a mask and atomically AND them out) and unnecessary as they shouldn't be set. Fix this by removing the loop. Fixes: ee4cdf7ba857 ("netfs: Speed up buffered reading") Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/20250314164201.1993231-4-dhowells@redhat.com Acked-by: "Paulo Alcantara (Red Hat)" <pc@manguebit.com> cc: Jeff Layton <jlayton@kernel.org> cc: Steve French <sfrench@samba.org> cc: Paulo Alcantara <pc@manguebit.com> cc: netfs@lists.linux.dev cc: linux-cifs@vger.kernel.org cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-19netfs: Call `invalidate_cache` only if implementedMax Kellermann
Many filesystems such as NFS and Ceph do not implement the `invalidate_cache` method. On those filesystems, if writing to the cache (`NETFS_WRITE_TO_CACHE`) fails for some reason, the kernel crashes like this: BUG: kernel NULL pointer dereference, address: 0000000000000000 #PF: supervisor instruction fetch in kernel mode #PF: error_code(0x0010) - not-present page PGD 0 P4D 0 Oops: Oops: 0010 [#1] SMP PTI CPU: 9 UID: 0 PID: 3380 Comm: kworker/u193:11 Not tainted 6.13.3-cm4all1-hp #437 Hardware name: HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 10/17/2018 Workqueue: events_unbound netfs_write_collection_worker RIP: 0010:0x0 Code: Unable to access opcode bytes at 0xffffffffffffffd6. RSP: 0018:ffff9b86e2ca7dc0 EFLAGS: 00010202 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 7fffffffffffffff RDX: 0000000000000001 RSI: ffff89259d576a18 RDI: ffff89259d576900 RBP: ffff89259d5769b0 R08: ffff9b86e2ca7d28 R09: 0000000000000002 R10: ffff89258ceaca80 R11: 0000000000000001 R12: 0000000000000020 R13: ffff893d158b9338 R14: ffff89259d576900 R15: ffff89259d5769b0 FS: 0000000000000000(0000) GS:ffff893c9fa40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffffffffffffd6 CR3: 000000054442e003 CR4: 00000000001706f0 Call Trace: <TASK> ? __die+0x1f/0x60 ? page_fault_oops+0x15c/0x460 ? try_to_wake_up+0x2d2/0x530 ? exc_page_fault+0x5e/0x100 ? asm_exc_page_fault+0x22/0x30 netfs_write_collection_worker+0xe9f/0x12b0 ? xs_poll_check_readable+0x3f/0x80 ? xs_stream_data_receive_workfn+0x8d/0x110 process_one_work+0x134/0x2d0 worker_thread+0x299/0x3a0 ? __pfx_worker_thread+0x10/0x10 kthread+0xba/0xe0 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x30/0x50 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1a/0x30 </TASK> Modules linked in: CR2: 0000000000000000 This patch adds the missing `NULL` check. Fixes: 0e0f2dfe880f ("netfs: Dispatch write requests to process a writeback slice") Fixes: 288ace2f57c9 ("netfs: New writeback implementation") Signed-off-by: Max Kellermann <max.kellermann@ionos.com> Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/20250314164201.1993231-3-dhowells@redhat.com Acked-by: "Paulo Alcantara (Red Hat)" <pc@manguebit.com> cc: netfs@lists.linux.dev cc: linux-cifs@vger.kernel.org cc: linux-fsdevel@vger.kernel.org cc: stable@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-19netfs: Fix collection of results during pause when collection offloadedDavid Howells
A netfs read request can run in one of two modes: for synchronous reads writes, the app thread does the collection of results and for asynchronous reads, this is offloaded to a worker thread. This is controlled by the NETFS_RREQ_OFFLOAD_COLLECTION flag. Now, if a subrequest incurs an error, the NETFS_RREQ_PAUSE flag is set to stop the issuing loop temporarily from issuing more subrequests until a retry is successful or the request is abandoned. When the issuing loop sees NETFS_RREQ_PAUSE, it jumps to netfs_wait_for_pause() which will wait for the PAUSE flag to be cleared - and whilst it is waiting, it will call out to the collector as more results acrue... But this is the wrong thing to do if OFFLOAD_COLLECTION is set as we can then end up with both the app thread and the work item collecting results simultaneously. This manifests itself occasionally when running the generic/323 xfstest against multichannel cifs as an oops that's a bit random but frequently involving io_submit() (the test does lots of simultaneous async DIO reads). Fix this by only doing the collection in netfs_wait_for_pause() if the NETFS_RREQ_OFFLOAD_COLLECTION is not set. Fixes: e2d46f2ec332 ("netfs: Change the read result collector to only use one work item") Reported-by: Steve French <stfrench@microsoft.com> Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/20250314164201.1993231-2-dhowells@redhat.com Acked-by: "Paulo Alcantara (Red Hat)" <pc@manguebit.com> cc: Paulo Alcantara <pc@manguebit.com> cc: Jeff Layton <jlayton@kernel.org> cc: linux-cifs@vger.kernel.org cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-19fs: load the ->i_sb pointer once in inode_sb_list_{add,del}Mateusz Guzik
While this may sound like a pedantic clean up, it does in fact impact code generation -- the patched add routine is slightly smaller. Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://lore.kernel.org/r/20250319004635.1820589-1-mjguzik@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-19fuse: fix uring race condition for null dereference of fcJoanne Koong
There is a race condition leading to a kernel crash from a null dereference when attemping to access fc->lock in fuse_uring_create_queue(). fc may be NULL in the case where another thread is creating the uring in fuse_uring_create() and has set fc->ring but has not yet set ring->fc when fuse_uring_create_queue() reads ring->fc. There is another race condition as well where in fuse_uring_register(), ring->nr_queues may still be 0 and not yet set to the new value when we compare qid against it. This fix sets fc->ring only after ring->fc and ring->nr_queues have been set, which guarantees now that ring->fc is a proper pointer when any queues are created and ring->nr_queues reflects the right number of queues if ring is not NULL. We must use smp_store_release() and smp_load_acquire() semantics to ensure the ordering will remain correct where fc->ring is assigned only after ring->fc and ring->nr_queues have been assigned. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Link: https://lore.kernel.org/r/20250318003028.3330599-1-joannelkoong@gmail.com Fixes: 24fe962c86f5 ("fuse: {io-uring} Handle SQEs - register commands") Acked-by: Miklos Szeredi <mszeredi@redhat.com> Reviewed-by: Bernd Schubert <bschubert@ddn.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-19afs: Fix afs_atcell_get_link() to check if ws_cell is unset firstDavid Howells
Fix afs_atcell_get_link() to check if the workstation cell is unset before doing the RCU pathwalk bit where we dereference that. Fixes: 823869e1e616 ("afs: Fix afs_atcell_get_link() to handle RCU pathwalk") Reported-by: syzbot+76a6f18e3af82e84f264@syzkaller.appspotmail.com Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/2481796.1742296819@warthog.procyon.org.uk Tested-by: syzbot+76a6f18e3af82e84f264@syzkaller.appspotmail.com cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-19umount: Allow superblock owners to force umountTrond Myklebust
Loosen the permission check on forced umount to allow users holding CAP_SYS_ADMIN privileges in namespaces that are privileged with respect to the userns that originally mounted the filesystem. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Link: https://lore.kernel.org/r/12f212d4ef983714d065a6bb372fbb378753bf4c.1742315194.git.trond.myklebust@hammerspace.com Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Christian Brauner <brauner@kernel.org>