summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2025-03-24Merge tag 'vfs-6.15-rc1.orangefs' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs orangefs updates from Christian Brauner: "This contains the work to remove orangefs_writepage() and partially convert it to folios. A few regular bugfixes are included as well" * tag 'vfs-6.15-rc1.orangefs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: orangefs: Convert orangefs_writepages to contain an array of folios orangefs: Simplify bvec setup in orangefs_writepages_work() orangefs: Unify error & success paths in orangefs_writepages_work() orangefs: Pass mapping to orangefs_writepages_work() orangefs: Convert orangefs_writepage_locked() to take a folio orangefs: Remove orangefs_writepage() orangefs: make open_for_read and open_for_write boolean orangefs: Move s_kmod_keyword_mask_map to orangefs-debugfs.c orangefs: Do not truncate file size
2025-03-24Merge tag 'vfs-6.15-rc1.afs' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs afs updates from Christian Brauner: "This contains the work for afs for this cycle: - Fix an occasional hang that's only really encountered when rmmod'ing the kafs module - Remove the "-o autocell" mount option. This is obsolete with the dynamic root and removing it makes the next patch slightly easier - Change how the dynamic root mount is constructed. Currently, the root directory is (de)populated when it is (un)mounted if there are cells already configured and, further, pairs of automount points have to be created/removed each time a cell is added/deleted This is changed so that readdir on the root dir lists all the known cell automount pairs plus the @cell symlinks and the inodes and dentries are constructed by lookup on demand. This simplifies the cell management code - A few improvements to the afs_volume and afs_server tracepoints - Pass trace info into the afs_lookup_cell() function to allow the trace log to indicate the purpose of the lookup - Remove the 'net' parameter from afs_unuse_cell() as it's superfluous - In rxrpc, allow a kernel app (such as kafs) to store a word of information on rxrpc_peer records - Use the information stored on the rxrpc_peer record to point to the afs_server record. This allows the server address lookup to be done away with - Simplify the afs_server ref/activity accounting to make each one self-contained and not garbage collected from the cell management work item - Simplify the afs_cell ref/activity accounting to make each one of these also self-contained and not driven by a central management work item The current code was intended to make it such that a single timer for the namespace and one work item per cell could do all the work required to maintain these records. This, however, made for some sequencing problems when cleaning up these records. Further, the attempt to pass refs along with timers and work items made getting it right rather tricky when the timer or work item already had a ref attached and now a ref had to be got rid of" * tag 'vfs-6.15-rc1.afs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: afs: Simplify cell record handling afs: Fix afs_server ref accounting afs: Use the per-peer app data provided by rxrpc rxrpc: Allow the app to store private data on peer structs afs: Drop the net parameter from afs_unuse_cell() afs: Make afs_lookup_cell() take a trace note afs: Improve server refcount/active count tracing afs: Improve afs_volume tracing to display a debug ID afs: Change dynroot to create contents on demand afs: Remove the "autocell" mount option
2025-03-24Merge tag 'vfs-6.15-rc1.initramfs' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs initramfs updates from Christian Brauner: "This adds basic kunit test coverage for initramfs unpacking and cleans up some buffer handling issues and inefficiencies" * tag 'vfs-6.15-rc1.initramfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: MAINTAINERS: append initramfs files to the VFS section initramfs: avoid static buffer for error message initramfs: fix hardlink hash leak without TRAILER initramfs: reuse name_len for dir mtime tracking initramfs: allocate heap buffers together initramfs: avoid memcpy for hex header fields vsprintf: add simple_strntoul initramfs_test: kunit tests for initramfs unpacking init: add initramfs_internal.h
2025-03-24Merge tag 'vfs-6.15-rc1.ceph' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs ceph updates from Christian Brauner: "This contains the work to remove access to page->index from ceph and fixes the test failure observed for ceph with generic/421 by refactoring ceph_writepages_start()" * tag 'vfs-6.15-rc1.ceph' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fscrypt: Change fscrypt_encrypt_pagecache_blocks() to take a folio ceph: Fix error handling in fill_readdir_cache() fs: Remove page_mkwrite_check_truncate() ceph: Pass a folio to ceph_allocate_page_array() ceph: Convert ceph_move_dirty_page_in_page_array() to move_dirty_folio_in_page_array() ceph: Remove uses of page from ceph_process_folio_batch() ceph: Convert ceph_check_page_before_write() to use a folio ceph: Convert writepage_nounlock() to write_folio_nounlock() ceph: Convert ceph_readdir_cache_control to store a folio ceph: Convert ceph_find_incompatible() to take a folio ceph: Use a folio in ceph_page_mkwrite() ceph: Remove ceph_writepage() ceph: fix generic/421 test failure ceph: introduce ceph_submit_write() method ceph: introduce ceph_process_folio_batch() method ceph: extend ceph_writeback_ctl for ceph_writepages_start() refactoring
2025-03-24Merge tag 'vfs-6.15-rc1.pagesize' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs pagesize updates from Christian Brauner: "This enables block sizes greater than the page size for block devices. With this we can start supporting block devices with logical block sizes larger than 4k. It also allows to lift the device cache sector size support to 64k. This allows filesystems which can use larger sector sizes up to 64k to ensure that the filesystem will not generate writes that are smaller than the specified sector size" * tag 'vfs-6.15-rc1.pagesize' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: bdev: add back PAGE_SIZE block size validation for sb_set_blocksize() bdev: use bdev_io_min() for statx block size block/bdev: lift block size restrictions to 64k block/bdev: enable large folio support for large logical block sizes fs/buffer fs/mpage: remove large folio restriction fs/mpage: use blocks_per_folio instead of blocks_per_page fs/mpage: avoid negative shift for large blocksize fs/buffer: remove batching from async read fs/buffer: simplify block_read_full_folio() with bh_offset()
2025-03-24Merge tag 'vfs-6.15-rc1.mount.namespace' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs mount namespace updates from Christian Brauner: "This expands the ability of anonymous mount namespaces: - Creating detached mounts from detached mounts Currently, detached mounts can only be created from attached mounts. This limitaton prevents various use-cases. For example, the ability to mount a subdirectory without ever having to make the whole filesystem visible first. The current permission modelis: (1) Check that the caller is privileged over the owning user namespace of it's current mount namespace. (2) Check that the caller is located in the mount namespace of the mount it wants to create a detached copy of. While it is not strictly necessary to do it this way it is consistently applied in the new mount api. This model will also be used when allowing the creation of detached mount from another detached mount. The (1) requirement can simply be met by performing the same check as for the non-detached case, i.e., verify that the caller is privileged over its current mount namespace. To meet the (2) requirement it must be possible to infer the origin mount namespace that the anonymous mount namespace of the detached mount was created from. The origin mount namespace of an anonymous mount is the mount namespace that the mounts that were copied into the anonymous mount namespace originate from. In order to check the origin mount namespace of an anonymous mount namespace the sequence number of the original mount namespace is recorded in the anonymous mount namespace. With this in place it is possible to perform an equivalent check (2') to (2). The origin mount namespace of the anonymous mount namespace must be the same as the caller's mount namespace. To establish this the sequence number of the caller's mount namespace and the origin sequence number of the anonymous mount namespace are compared. The caller is always located in a non-anonymous mount namespace since anonymous mount namespaces cannot be setns()ed into. The caller's mount namespace will thus always have a valid sequence number. The owning namespace of any mount namespace, anonymous or non-anonymous, can never change. A mount attached to a non-anonymous mount namespace can never change mount namespace. If the sequence number of the non-anonymous mount namespace and the origin sequence number of the anonymous mount namespace match, the owning namespaces must match as well. Hence, the capability check on the owning namespace of the caller's mount namespace ensures that the caller has the ability to copy the mount tree. - Allow mount detached mounts on detached mounts Currently, detached mounts can only be mounted onto attached mounts. This limitation makes it impossible to assemble a new private rootfs and move it into place. Instead, a detached tree must be created, attached, then mounted open and then either moved or detached again. Lift this restriction. In order to allow mounting detached mounts onto other detached mounts the same permission model used for creating detached mounts from detached mounts can be used (cf. above). Allowing to mount detached mounts onto detached mounts leaves three cases to consider: (1) The source mount is an attached mount and the target mount is a detached mount. This would be equivalent to moving a mount between different mount namespaces. A caller could move an attached mount to a detached mount. The detached mount can now be freely attached to any mount namespace. This changes the current delegatioh model significantly for no good reason. So this will fail. (2) Anonymous mount namespaces are always attached fully, i.e., it is not possible to only attach a subtree of an anoymous mount namespace. This simplifies the implementation and reasoning. Consequently, if the anonymous mount namespace of the source detached mount and the target detached mount are the identical the mount request will fail. (3) The source mount's anonymous mount namespace is different from the target mount's anonymous mount namespace. In this case the source anonymous mount namespace of the source mount tree must be freed after its mounts have been moved to the target anonymous mount namespace. The source anonymous mount namespace must be empty afterwards. By allowing to mount detached mounts onto detached mounts a caller may do the following: fd_tree1 = open_tree(-EBADF, "/mnt", OPEN_TREE_CLONE) fd_tree2 = open_tree(-EBADF, "/tmp", OPEN_TREE_CLONE) fd_tree1 and fd_tree2 refer to two different detached mount trees that belong to two different anonymous mount namespace. It is important to note that fd_tree1 and fd_tree2 both refer to the root of their respective anonymous mount namespaces. By allowing to mount detached mounts onto detached mounts the caller may now do: move_mount(fd_tree1, "", fd_tree2, "", MOVE_MOUNT_F_EMPTY_PATH | MOVE_MOUNT_T_EMPTY_PATH) This will cause the detached mount referred to by fd_tree1 to be mounted on top of the detached mount referred to by fd_tree2. Thus, the detached mount fd_tree1 is moved from its separate anonymous mount namespace into fd_tree2's anonymous mount namespace. It also means that while fd_tree2 continues to refer to the root of its respective anonymous mount namespace fd_tree1 doesn't anymore. This has the consequence that only fd_tree2 can be moved to another anonymous or non-anonymous mount namespace. Moving fd_tree1 will now fail as fd_tree1 doesn't refer to the root of an anoymous mount namespace anymore. Now fd_tree1 and fd_tree2 refer to separate detached mount trees referring to the same anonymous mount namespace. This is conceptually fine. The new mount api does allow for this to happen already via: mount -t tmpfs tmpfs /mnt mkdir -p /mnt/A mount -t tmpfs tmpfs /mnt/A fd_tree3 = open_tree(-EBADF, "/mnt", OPEN_TREE_CLONE | AT_RECURSIVE) fd_tree4 = open_tree(-EBADF, "/mnt/A", 0) Both fd_tree3 and fd_tree4 refer to two different detached mount trees but both detached mount trees refer to the same anonymous mount namespace. An as with fd_tree1 and fd_tree2, only fd_tree3 may be moved another mount namespace as fd_tree3 refers to the root of the anonymous mount namespace just while fd_tree4 doesn't. However, there's an important difference between the fd_tree3/fd_tree4 and the fd_tree1/fd_tree2 example. Closing fd_tree4 and releasing the respective struct file will have no further effect on fd_tree3's detached mount tree. However, closing fd_tree3 will cause the mount tree and the respective anonymous mount namespace to be destroyed causing the detached mount tree of fd_tree4 to be invalid for further mounting. By allowing to mount detached mounts on detached mounts as in the fd_tree1/fd_tree2 example both struct files will affect each other. Both fd_tree1 and fd_tree2 refer to struct files that have FMODE_NEED_UNMOUNT set. To handle this we use the fact that @fd_tree1 will have a parent mount once it has been attached to @fd_tree2. When dissolve_on_fput() is called the mount that has been passed in will refer to the root of the anonymous mount namespace. If it doesn't it would mean that mounts are leaked. So before allowing to mount detached mounts onto detached mounts this would be a bug. Now that detached mounts can be mounted onto detached mounts it just means that the mount has been attached to another anonymous mount namespace and thus dissolve_on_fput() must not unmount the mount tree or free the anonymous mount namespace as the file referring to the root of the namespace hasn't been closed yet. If it had been closed yet it would be obvious because the mount namespace would be NULL, i.e., the @fd_tree1 would have already been unmounted. If @fd_tree1 hasn't been unmounted yet and has a parent mount it is safe to skip any cleanup as closing @fd_tree2 will take care of all cleanup operations. - Allow mount propagation for detached mount trees In commit ee2e3f50629f ("mount: fix mounting of detached mounts onto targets that reside on shared mounts") I fixed a bug where propagating the source mount tree of an anonymous mount namespace into a target mount tree of a non-anonymous mount namespace could be used to trigger an integer overflow in the non-anonymous mount namespace causing any new mounts to fail. The cause of this was that the propagation algorithm was unable to recognize mounts from the source mount tree that were already propagated into the target mount tree and then reappeared as propagation targets when walking the destination propagation mount tree. When fixing this I disabled mount propagation into anonymous mount namespaces. Make it possible for anonymous mount namespace to receive mount propagation events correctly. This is now also a correctness issue now that we allow mounting detached mount trees onto detached mount trees. Mark the source anonymous mount namespace with MNTNS_PROPAGATING indicating that all mounts belonging to this mount namespace are currently in the process of being propagated and make the propagation algorithm discard those if they appear as propagation targets" * tag 'vfs-6.15-rc1.mount.namespace' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (21 commits) selftests: test subdirectory mounting selftests: add test for detached mount tree propagation fs: namespace: fix uninitialized variable use mount: handle mount propagation for detached mount trees fs: allow creating detached mounts from fsmount() file descriptors selftests: seventh test for mounting detached mounts onto detached mounts selftests: sixth test for mounting detached mounts onto detached mounts selftests: fifth test for mounting detached mounts onto detached mounts selftests: fourth test for mounting detached mounts onto detached mounts selftests: third test for mounting detached mounts onto detached mounts selftests: second test for mounting detached mounts onto detached mounts selftests: first test for mounting detached mounts onto detached mounts fs: mount detached mounts onto detached mounts fs: support getname_maybe_null() in move_mount() selftests: create detached mounts from detached mounts fs: create detached mounts from detached mounts fs: add may_copy_tree() fs: add fastpath for dissolve_on_fput() fs: add assert for move_mount() fs: add mnt_ns_empty() helper ...
2025-03-24Merge tag 'vfs-6.15-rc1.nsfs' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs nsfs updates from Christian Brauner: "This contains non-urgent fixes for nsfs to validate ioctls before performing any relevant operations. We alredy did this for a few other filesystems last cycle" * tag 'vfs-6.15-rc1.nsfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: selftests/nsfs: add ioctl validation tests nsfs: validate ioctls
2025-03-24Merge tag 'vfs-6.15-rc1.sysv' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs sysv removal from Christian Brauner: "This removes the sysv filesystem. We've discussed this various times. It's time to try" * tag 'vfs-6.15-rc1.sysv' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: sysv: Remove the filesystem
2025-03-24Merge tag 'vfs-6.15-rc1.async.dir' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs async dir updates from Christian Brauner: "This contains cleanups that fell out of the work from async directory handling: - Change kern_path_locked() and user_path_locked_at() to never return a negative dentry. This simplifies the usability of these helpers in various places - Drop d_exact_alias() from the remaining place in NFS where it is still used. This also allows us to drop the d_exact_alias() helper completely - Drop an unnecessary call to fh_update() from nfsd_create_locked() - Change i_op->mkdir() to return a struct dentry Change vfs_mkdir() to return a dentry provided by the filesystems which is hashed and positive. This allows us to reduce the number of cases where the resulting dentry is not positive to very few cases. The code in these places becomes simpler and easier to understand. - Repack DENTRY_* and LOOKUP_* flags" * tag 'vfs-6.15-rc1.async.dir' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: doc: fix inline emphasis warning VFS: Change vfs_mkdir() to return the dentry. nfs: change mkdir inode_operation to return alternate dentry if needed. fuse: return correct dentry for ->mkdir ceph: return the correct dentry on mkdir hostfs: store inode in dentry after mkdir if possible. Change inode_operations.mkdir to return struct dentry * nfsd: drop fh_update() from S_IFDIR branch of nfsd_create_locked() nfs/vfs: discard d_exact_alias() VFS: add common error checks to lookup_one_qstr_excl() VFS: change kern_path_locked() and user_path_locked_at() to never return negative dentry VFS: repack LOOKUP_ bit flags. VFS: repack DENTRY_ flags.
2025-03-24Merge tag 'vfs-6.15-rc1.overlayfs' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs overlayfs updates from Christian Brauner: "Currently overlayfs uses the mounter's credentials for its override_creds() calls. That provides a consistent permission model. This patches allows a caller to instruct overlayfs to use its credentials instead. The caller must be located in the same user namespace hierarchy as the user namespace the overlayfs instance will be mounted in. This provides a consistent and simple security model. With this it is possible to e.g., mount an overlayfs instance where the mounter must have CAP_SYS_ADMIN but the credentials used for override_creds() have dropped CAP_SYS_ADMIN. It also allows the usage of custom fs{g,u}id different from the callers and other tweaks" * tag 'vfs-6.15-rc1.overlayfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: selftests/ovl: add third selftest for "override_creds" selftests/ovl: add second selftest for "override_creds" selftests/filesystems: add utils.{c,h} selftests/ovl: add first selftest for "override_creds" ovl: allow to specify override credentials
2025-03-24Merge tag 'vfs-6.15-rc1.iomap' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs iomap updates from Christian Brauner: - Allow the filesystem to submit the writeback bios. - Allow the filsystem to track completions on a per-bio bases instead of the entire I/O. - Change writeback_ops so that ->submit_bio can be done by the filesystem. - A new ANON_WRITE flag for writes that don't have a block number assigned to them at the iomap level leaving the filesystem to do that work in the submission handler. - Incremental iterator advance The folio_batch support for zero range where the filesystem provides a batch of folios to process that might not be logically continguous requires more flexibility than the current offset based iteration currently offers. Update all iomap operations to advance the iterator within the operation and thus remove the need to advance from the core iomap iterator. - Make buffered writes work with RWF_DONTCACHE If RWF_DONTCACHE is set for a write, mark the folios being written as uncached. On writeback completion the pages will be dropped. - Introduce infrastructure for large atomic writes This will eventually be used by xfs and ext4. * tag 'vfs-6.15-rc1.iomap' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (42 commits) iomap: rework IOMAP atomic flags iomap: comment on atomic write checks in iomap_dio_bio_iter() iomap: inline iomap_dio_bio_opflags() iomap: fix inline data on buffered read iomap: Lift blocksize restriction on atomic writes iomap: Support SW-based atomic writes iomap: Rename IOMAP_ATOMIC -> IOMAP_ATOMIC_HW xfs: flag as supporting FOP_DONTCACHE iomap: make buffered writes work with RWF_DONTCACHE iomap: introduce a full map advance helper iomap: rename iomap_iter processed field to status iomap: remove unnecessary advance from iomap_iter() dax: advance the iomap_iter on pte and pmd faults dax: advance the iomap_iter on dedupe range dax: advance the iomap_iter on unshare range dax: advance the iomap_iter on zero range dax: push advance down into dax_iomap_iter() for read and write dax: advance the iomap_iter in the read/write path iomap: convert misc simple ops to incremental advance iomap: advance the iter on direct I/O ...
2025-03-24Merge tag 'vfs-6.15-rc1.pidfs' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs pidfs updates from Christian Brauner: - Allow retrieving exit information after a process has been reaped through pidfds via the new PIDFD_INTO_EXIT extension for the PIDFD_GET_INFO ioctl. Various tools need access to information about a process/task even after it has already been reaped. Pidfd polling allows waiting on either task exit or for a task to have been reaped. The contract for PIDFD_INFO_EXIT is simply that EPOLLHUP must be observed before exit information can be retrieved, i.e., exit information is only provided once the task has been reaped and then can be retrieved as long as the pidfd is open. - Add PIDFD_SELF_{THREAD,THREAD_GROUP} sentinels allowing userspace to forgo allocating a file descriptor for their own process. This is useful in scenarios where users want to act on their own process through pidfds and is akin to AT_FDCWD. - Improve premature thread-group leader and subthread exec behavior when polling on pidfds: (1) During a multi-threaded exec by a subthread, i.e., non-thread-group leader thread, all other threads in the thread-group including the thread-group leader are killed and the struct pid of the thread-group leader will be taken over by the subthread that called exec. IOW, two tasks change their TIDs. (2) A premature thread-group leader exit means that the thread-group leader exited before all of the other subthreads in the thread-group have exited. Both cases lead to inconsistencies for pidfd polling with PIDFD_THREAD. Any caller that holds a PIDFD_THREAD pidfd to the current thread-group leader may or may not see an exit notification on the file descriptor depending on when poll is performed. If the poll is performed before the exec of the subthread has concluded an exit notification is generated for the old thread-group leader. If the poll is performed after the exec of the subthread has concluded no exit notification is generated for the old thread-group leader. The correct behavior is to simply not generate an exit notification on the struct pid of a subhthread exec because the struct pid is taken over by the subthread and thus remains alive. But this is difficult to handle because a thread-group may exit premature as mentioned in (2). In that case an exit notification is reliably generated but the subthreads may continue to run for an indeterminate amount of time and thus also may exec at some point. After this pull no exit notifications will be generated for a PIDFD_THREAD pidfd for a thread-group leader until all subthreads have been reaped. If a subthread should exec before no exit notification will be generated until that task exits or it creates subthreads and repeates the cycle. This means an exit notification indicates the ability for the father to reap the child. * tag 'vfs-6.15-rc1.pidfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (25 commits) selftests/pidfd: third test for multi-threaded exec polling selftests/pidfd: second test for multi-threaded exec polling selftests/pidfd: first test for multi-threaded exec polling pidfs: improve multi-threaded exec and premature thread-group leader exit polling pidfs: ensure that PIDFS_INFO_EXIT is available selftests/pidfd: add seventh PIDFD_INFO_EXIT selftest selftests/pidfd: add sixth PIDFD_INFO_EXIT selftest selftests/pidfd: add fifth PIDFD_INFO_EXIT selftest selftests/pidfd: add fourth PIDFD_INFO_EXIT selftest selftests/pidfd: add third PIDFD_INFO_EXIT selftest selftests/pidfd: add second PIDFD_INFO_EXIT selftest selftests/pidfd: add first PIDFD_INFO_EXIT selftest selftests/pidfd: expand common pidfd header pidfs/selftests: ensure correct headers for ioctl handling selftests/pidfd: fix header inclusion pidfs: allow to retrieve exit information pidfs: record exit code and cgroupid at exit pidfs: use private inode slab cache pidfs: move setting flags into pidfs_alloc_file() pidfd: rely on automatic cleanup in __pidfd_prepare() ...
2025-03-24Merge tag 'vfs-6.15-rc1.pipe' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs pipe updates from Christian Brauner: - Introduce struct file_operations pipeanon_fops - Don't update {a,c,m}time for anonymous pipes to avoid the performance costs associated with it - Change pipe_write() to never add a zero-sized buffer - Limit the slots in pipe_resize_ring() - Use pipe_buf() to retrieve the pipe buffer everywhere - Drop an always true check in anon_pipe_write() - Cache 2 pages instead of 1 - Avoid spurious calls to prepare_to_wait_event() in ___wait_event() * tag 'vfs-6.15-rc1.pipe' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs/splice: Use pipe_buf() helper to retrieve pipe buffer fs/pipe: Use pipe_buf() helper to retrieve pipe buffer kernel/watch_queue: Use pipe_buf() to retrieve the pipe buffer fs/pipe: Limit the slots in pipe_resize_ring() wait: avoid spurious calls to prepare_to_wait_event() in ___wait_event() pipe: cache 2 pages instead of 1 pipe: drop an always true check in anon_pipe_write() pipe: change pipe_write() to never add a zero-sized buffer pipe: don't update {a,c,m}time for anonymous pipes pipe: introduce struct file_operations pipeanon_fops
2025-03-24Merge tag 'vfs-6.15-rc1.mount' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs mount updates from Christian Brauner: - Mount notifications The day has come where we finally provide a new api to listen for mount topology changes outside of /proc/<pid>/mountinfo. A mount namespace file descriptor can be supplied and registered with fanotify to listen for mount topology changes. Currently notifications for mount, umount and moving mounts are generated. The generated notification record contains the unique mount id of the mount. The listmount() and statmount() api can be used to query detailed information about the mount using the received unique mount id. This allows userspace to figure out exactly how the mount topology changed without having to generating diffs of /proc/<pid>/mountinfo in userspace. - Support O_PATH file descriptors with FSCONFIG_SET_FD in the new mount api - Support detached mounts in overlayfs Since last cycle we support specifying overlayfs layers via file descriptors. However, we don't allow detached mounts which means userspace cannot user file descriptors received via open_tree(OPEN_TREE_CLONE) and fsmount() directly. They have to attach them to a mount namespace via move_mount() first. This is cumbersome and means they have to undo mounts via umount(). Allow them to directly use detached mounts. - Allow to retrieve idmappings with statmount Currently it isn't possible to figure out what idmapping has been attached to an idmapped mount. Add an extension to statmount() which allows to read the idmapping from the mount. - Allow creating idmapped mounts from mounts that are already idmapped So far it isn't possible to allow the creation of idmapped mounts from already idmapped mounts as this has significant lifetime implications. Make the creation of idmapped mounts atomic by allow to pass struct mount_attr together with the open_tree_attr() system call allowing to solve these issues without complicating VFS lookup in any way. The system call has in general the benefit that creating a detached mount and applying mount attributes to it becomes an atomic operation for userspace. - Add a way to query statmount() for supported options Allow userspace to query which mount information can be retrieved through statmount(). - Allow superblock owners to force unmount * tag 'vfs-6.15-rc1.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (21 commits) umount: Allow superblock owners to force umount selftests: add tests for mount notification selinux: add FILE__WATCH_MOUNTNS samples/vfs: fix printf format string for size_t fs: allow changing idmappings fs: add kflags member to struct mount_kattr fs: add open_tree_attr() fs: add copy_mount_setattr() helper fs: add vfs_open_tree() helper statmount: add a new supported_mask field samples/vfs: add STATMOUNT_MNT_{G,U}IDMAP selftests: add tests for using detached mount with overlayfs samples/vfs: check whether flag was raised statmount: allow to retrieve idmappings uidgid: add map_id_range_up() fs: allow detached mounts in clone_private_mount() selftests/overlayfs: test specifying layers as O_PATH file descriptors fs: support O_PATH fds with FSCONFIG_SET_FD vfs: add notifications for mount attach and detach fanotify: notify on mount attach and detach ...
2025-03-24Merge tag 'vfs-6.15-rc1.eventpoll' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs eventpoll updates from Christian Brauner: "This contains a few preparatory changes to eventpoll to allow io_uring to support epoll" * tag 'vfs-6.15-rc1.eventpoll' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: eventpoll: add epoll_sendevents() helper eventpoll: abstract out ep_try_send_events() helper eventpoll: abstract out parameter sanity checking
2025-03-24Merge tag 'vfs-6.15-rc1.misc' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "Features: - Add CONFIG_DEBUG_VFS infrastucture: - Catch invalid modes in open - Use the new debug macros in inode_set_cached_link() - Use debug-only asserts around fd allocation and install - Place f_ref to 3rd cache line in struct file to resolve false sharing Cleanups: - Start using anon_inode_getfile_fmode() helper in various places - Don't take f_lock during SEEK_CUR if exclusion is guaranteed by f_pos_lock - Add unlikely() to kcmp() - Remove legacy ->remount_fs method from ecryptfs after port to the new mount api - Remove invalidate_inodes() in favour of evict_inodes() - Simplify ep_busy_loopER by removing unused argument - Avoid mmap sem relocks when coredumping with many missing pages - Inline getname() - Inline new_inode_pseudo() and de-staticize alloc_inode() - Dodge an atomic in putname if ref == 1 - Consistently deref the files table with rcu_dereference_raw() - Dedup handling of struct filename init and refcounts bumps - Use wq_has_sleeper() in end_dir_add() - Drop the lock trip around I_NEW wake up in evict() - Load the ->i_sb pointer once in inode_sb_list_{add,del} - Predict not reaching the limit in alloc_empty_file() - Tidy up do_sys_openat2() with likely/unlikely - Call inode_sb_list_add() outside of inode hash lock - Sort out fd allocation vs dup2 race commentary - Turn page_offset() into a wrapper around folio_pos() - Remove locking in exportfs around ->get_parent() call - try_lookup_one_len() does not need any locks in autofs - Fix return type of several functions from long to int in open - Fix return type of several functions from long to int in ioctls Fixes: - Fix watch queue accounting mismatch" * tag 'vfs-6.15-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (30 commits) fs: sort out fd allocation vs dup2 race commentary, take 2 fs: call inode_sb_list_add() outside of inode hash lock fs: tidy up do_sys_openat2() with likely/unlikely fs: predict not reaching the limit in alloc_empty_file() fs: load the ->i_sb pointer once in inode_sb_list_{add,del} fs: drop the lock trip around I_NEW wake up in evict() fs: use wq_has_sleeper() in end_dir_add() VFS/autofs: try_lookup_one_len() does not need any locks fs: dedup handling of struct filename init and refcounts bumps fs: consistently deref the files table with rcu_dereference_raw() exportfs: remove locking around ->get_parent() call. fs: use debug-only asserts around fd allocation and install fs: dodge an atomic in putname if ref == 1 vfs: Remove invalidate_inodes() ecryptfs: remove NULL remount_fs from super_operations watch_queue: fix pipe accounting mismatch fs: place f_ref to 3rd cache line in struct file to resolve false sharing epoll: simplify ep_busy_loop by removing always 0 argument fs: Turn page_offset() into a wrapper around folio_pos() kcmp: improve performance adding an unlikely hint to task comparisons ...
2025-03-24Merge tag 'vfs-6.15-rc1.mount.api' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs mount API updates from Christian Brauner: "This converts the remaining pseudo filesystems to the new mount api. The sysv conversion is a bit gratuitous because we remove sysv in another pull request. But if we have to revert the removal we at least will have it converted to the new mount api already" * tag 'vfs-6.15-rc1.mount.api' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: sysv: convert sysv to use the new mount api vfs: remove some unused old mount api code devtmpfs: replace ->mount with ->get_tree in public instance vfs: Convert devpts to use the new mount API pstore: convert to the new mount API
2025-03-24MAINTAINERS: remove myself as reviewerDarrick J. Wong
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2025-03-24Linux 6.14v6.14Linus Torvalds
2025-03-22Merge tag 'i2c-for-6.14-rc8' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux Pull i2c fix from Wolfram Sang: "Fix double free of irq in amd-mp2 driver" * tag 'i2c-for-6.14-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux: i2c: amd-mp2: drop free_irq() of devm_request_irq() allocated irq
2025-03-22Merge tag 'perf-urgent-2025-03-22' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 perf events fix from Ingo Molnar: "Fix an information leak regression in the AMD IBS PMU code" * tag 'perf-urgent-2025-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/amd/ibs: Prevent leaking sensitive data to userspace
2025-03-22Merge tag 'keys-next-6.14-rc8' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jarkko/linux-tpmdd Pull keys fix from Jarkko Sakkinen: "Fix potential use-after-free in key_put()" * tag 'keys-next-6.14-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/jarkko/linux-tpmdd: keys: Fix UAF in key_put()
2025-03-22Merge tag 'io_uring-6.14-20250322' of git://git.kernel.dk/linuxLinus Torvalds
Pull io_uring fix from Jens Axboe: "Just a single fix for the commit that went into your tree yesterday, which exposed an issue with not always clearing notifications. That could cause them to be used more than once" * tag 'io_uring-6.14-20250322' of git://git.kernel.dk/linux: io_uring/net: fix sendzc double notif flush
2025-03-22io_uring/net: fix sendzc double notif flushPavel Begunkov
refcount_t: underflow; use-after-free. WARNING: CPU: 0 PID: 5823 at lib/refcount.c:28 refcount_warn_saturate+0x15a/0x1d0 lib/refcount.c:28 RIP: 0010:refcount_warn_saturate+0x15a/0x1d0 lib/refcount.c:28 Call Trace: <TASK> io_notif_flush io_uring/notif.h:40 [inline] io_send_zc_cleanup+0x121/0x170 io_uring/net.c:1222 io_clean_op+0x58c/0x9a0 io_uring/io_uring.c:406 io_free_batch_list io_uring/io_uring.c:1429 [inline] __io_submit_flush_completions+0xc16/0xd20 io_uring/io_uring.c:1470 io_submit_flush_completions io_uring/io_uring.h:159 [inline] Before the blamed commit, sendzc relied on io_req_msg_cleanup() to clear REQ_F_NEED_CLEANUP, so after the following snippet the request will never hit the core io_uring cleanup path. io_notif_flush(); io_req_msg_cleanup(); The easiest fix is to null the notification. io_send_zc_cleanup() can still be called after, but it's tolerated. Reported-by: syzbot+cf285a028ffba71b2ef5@syzkaller.appspotmail.com Tested-by: syzbot+cf285a028ffba71b2ef5@syzkaller.appspotmail.com Fixes: cc34d8330e036 ("io_uring/net: don't clear REQ_F_NEED_CLEANUP unconditionally") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/e1306007458b8891c88c4f20c966a17595f766b0.1742643795.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-22keys: Fix UAF in key_put()David Howells
Once a key's reference count has been reduced to 0, the garbage collector thread may destroy it at any time and so key_put() is not allowed to touch the key after that point. The most key_put() is normally allowed to do is to touch key_gc_work as that's a static global variable. However, in an effort to speed up the reclamation of quota, this is now done in key_put() once the key's usage is reduced to 0 - but now the code is looking at the key after the deadline, which is forbidden. Fix this by using a flag to indicate that a key can be gc'd now rather than looking at the key's refcount in the garbage collector. Fixes: 9578e327b2b4 ("keys: update key quotas in key_put()") Reported-by: syzbot+6105ffc1ded71d194d6d@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/673b6aec.050a0220.87769.004a.GAE@google.com/ Signed-off-by: David Howells <dhowells@redhat.com> Tested-by: syzbot+6105ffc1ded71d194d6d@syzkaller.appspotmail.com Reviewed-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
2025-03-22tracing: Disable branch profiling in noinstr codeJosh Poimboeuf
CONFIG_TRACE_BRANCH_PROFILING inserts a call to ftrace_likely_update() for each use of likely() or unlikely(). That breaks noinstr rules if the affected function is annotated as noinstr. Disable branch profiling for files with noinstr functions. In addition to some individual files, this also includes the entire arch/x86 subtree, as well as the kernel/entry, drivers/cpuidle, and drivers/idle directories, all of which are noinstr-heavy. Due to the nature of how sched binaries are built by combining multiple .c files into one, branch profiling is disabled more broadly across the sched code than would otherwise be needed. This fixes many warnings like the following: vmlinux.o: warning: objtool: do_syscall_64+0x40: call to ftrace_likely_update() leaves .noinstr.text section vmlinux.o: warning: objtool: __rdgsbase_inactive+0x33: call to ftrace_likely_update() leaves .noinstr.text section vmlinux.o: warning: objtool: handle_bug.isra.0+0x198: call to ftrace_likely_update() leaves .noinstr.text section ... Reported-by: Ingo Molnar <mingo@kernel.org> Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/fb94fc9303d48a5ed370498f54500cc4c338eb6d.1742586676.git.jpoimboe@kernel.org
2025-03-22perf/amd/ibs: Prevent leaking sensitive data to userspaceNamhyung Kim
Although IBS "swfilt" can prevent leaking samples with kernel RIP to the userspace, there are few subtle cases where a 'data' address and/or a 'branch target' address can fall under kernel address range although RIP is from userspace. Prevent leaking kernel 'data' addresses by discarding such samples when {exclude_kernel=1,swfilt=1}. IBS can now be invoked by unprivileged user with the introduction of "swfilt". However, this creates a loophole in the interface where an unprivileged user can get physical address of the userspace virtual addresses through IBS register raw dump (PERF_SAMPLE_RAW). Prevent this as well. This upstream commit fixed the most obvious leak: 65a99264f5e5 perf/x86: Check data address for IBS software filter Follow that up with a more complete fix. Fixes: d29e744c7167 ("perf/x86: Relax privilege filter restriction on AMD IBS") Suggested-by: Matteo Rizzo <matteorizzo@google.com> Co-developed-by: Ravi Bangoria <ravi.bangoria@amd.com> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250321161251.1033-1-ravi.bangoria@amd.com
2025-03-21x86/hyperv: fix an indentation issue in mshyperv.hWei Liu
Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202503220640.hjiacW2C-lkp@intel.com/ Signed-off-by: Wei Liu <wei.liu@kernel.org>
2025-03-21Merge tag 'spi-fix-v6.14-rc7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi Pull spi fix from Mark Brown: "This is a straightforward fix for a reference count leak in the rarely used SPI device mode functionality" * tag 'spi-fix-v6.14-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi: spi: Fix reference count leak in slave_show()
2025-03-21Merge tag 'regulator-fix-v6.14-rc7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator Pull regulator fixes from Mark Brown: "More fixes than I'd like at this point, some of which is due to me cooking things in -next for a bit and resetting that cooking time as more fixes came in. - Christian Eggers fixed some race conditions with the dummy regulator not being available very early in boot due to the use of asynchronous probing, both the provider side (ensuring that it's availalbe) and consumer side (handling things if that goes wrong) are fixed - Ludvig Pärsson fixed some lockdep issues with the debugfs registration for regulators holding more locks than it really needs causing issues later when looking at the resulting debugfs.boot - Some device specific fixes for incorrect descriptions of the RTQ2208 from ChiYuan Huang" * tag 'regulator-fix-v6.14-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator: regulator: rtq2208: Fix the LDO DVS capability regulator: rtq2208: Fix incorrect buck converter phase mapping regulator: check that dummy regulator has been probed before using it regulator: dummy: force synchronous probing regulator: core: Fix deadlock in create_regulator()
2025-03-21Merge tag 'pinctrl-v6.14-4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl Pull pin control fix from Linus Walleij: - A single patch for Spacemit K1 fixing up the Kconfig to not default to "y" * tag 'pinctrl-v6.14-4' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl: pinctrl: spacemit: PINCTRL_SPACEMIT_K1 should not default to y unconditionally
2025-03-21x86/hyperv: Add comments about hv_vpset and var size hypercall input argsMichael Kelley
Current code varies in how the size of the variable size input header for hypercalls is calculated when the input contains struct hv_vpset. Surprisingly, this variation is correct, as different hypercalls make different choices for what portion of struct hv_vpset is treated as part of the variable size input header. The Hyper-V TLFS is silent on these details, but the behavior has been confirmed with Hyper-V developers. To avoid future confusion about these differences, add comments to struct hv_vpset, and to hypercall call sites with input that contains a struct hv_vpset. The comments describe the overall situation and the calculation that should be used at each particular call site. No functional change as only comments are updated. Signed-off-by: Michael Kelley <mhklinux@outlook.com> Link: https://lore.kernel.org/r/20250318214919.958953-1-mhklinux@outlook.com Signed-off-by: Wei Liu <wei.liu@kernel.org> Message-ID: <20250318214919.958953-1-mhklinux@outlook.com>
2025-03-21Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMsNuno Das Neves
Provide a set of IOCTLs for creating and managing child partitions when running as root partition on Hyper-V. The new driver is enabled via CONFIG_MSHV_ROOT. A brief overview of the interface: MSHV_CREATE_PARTITION is the entry point, returning a file descriptor representing a child partition. IOCTLs on this fd can be used to map memory, create VPs, etc. Creating a VP returns another file descriptor representing that VP which in turn has another set of corresponding IOCTLs for running the VP, getting/setting state, etc. MSHV_ROOT_HVCALL is a generic "passthrough" hypercall IOCTL which can be used for a number of partition or VP hypercalls. This is for hypercalls that do not affect any state in the kernel driver, such as getting and setting VP registers and partition properties, translating addresses, etc. It is "passthrough" because the binary input and output for the hypercall is only interpreted by the VMM - the kernel driver does nothing but insert the VP and partition id where necessary (which are always in the same place), and execute the hypercall. Co-developed-by: Anirudh Rayabharam <anrayabh@linux.microsoft.com> Signed-off-by: Anirudh Rayabharam <anrayabh@linux.microsoft.com> Co-developed-by: Jinank Jain <jinankjain@microsoft.com> Signed-off-by: Jinank Jain <jinankjain@microsoft.com> Co-developed-by: Mukesh Rathor <mrathor@linux.microsoft.com> Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com> Co-developed-by: Muminul Islam <muislam@microsoft.com> Signed-off-by: Muminul Islam <muislam@microsoft.com> Co-developed-by: Praveen K Paladugu <prapal@linux.microsoft.com> Signed-off-by: Praveen K Paladugu <prapal@linux.microsoft.com> Co-developed-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Co-developed-by: Wei Liu <wei.liu@kernel.org> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com> Reviewed-by: Roman Kisel <romank@linux.microsoft.com> Link: https://lore.kernel.org/r/1741980536-3865-11-git-send-email-nunodasneves@linux.microsoft.com Signed-off-by: Wei Liu <wei.liu@kernel.org> Message-ID: <1741980536-3865-11-git-send-email-nunodasneves@linux.microsoft.com>
2025-03-21selftests/timers: Improve skew_consistency by testing with other clockidsJohn Stultz
Lei Chen reported a bug with CLOCK_MONOTONIC_COARSE having inconsistencies when NTP is adjusting the clock frequency. This has gone seemingly undetected for ~15 years, illustrating a clear gap in our testing. The skew_consistency test is intended to catch this sort of problem, but was focused on only evaluating CLOCK_MONOTONIC, and thus missed the problem on CLOCK_MONOTONIC_COARSE. So adjust the test to run with all clockids for 60 seconds each instead of 10 minutes with just CLOCK_MONOTONIC. Reported-by: Lei Chen <lei.chen@smartx.com> Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250320200306.1712599-2-jstultz@google.com Closes: https://lore.kernel.org/lkml/20250310030004.3705801-1-lei.chen@smartx.com/
2025-03-21timekeeping: Fix possible inconsistencies in _COARSE clockidsJohn Stultz
Lei Chen raised an issue with CLOCK_MONOTONIC_COARSE seeing time inconsistencies. Lei tracked down that this was being caused by the adjustment tk->tkr_mono.xtime_nsec -= offset; which is made to compensate for the unaccumulated cycles in offset when the multiplicator is adjusted forward, so that the non-_COARSE clockids don't see inconsistencies. However, the _COARSE clockid getter functions use the adjusted xtime_nsec value directly and do not compensate the negative offset via the clocksource delta multiplied with the new multiplicator. In that case the caller can observe time going backwards in consecutive calls. By design, this negative adjustment should be fine, because the logic run from timekeeping_adjust() is done after it accumulated approximately multiplicator * interval_cycles into xtime_nsec. The accumulated value is always larger then the mult_adj * offset value, which is subtracted from xtime_nsec. Both operations are done together under the tk_core.lock, so the net change to xtime_nsec is always always be positive. However, do_adjtimex() calls into timekeeping_advance() as well, to to apply the NTP frequency adjustment immediately. In this case, timekeeping_advance() does not return early when the offset is smaller then interval_cycles. In that case there is no time accumulated into xtime_nsec. But the subsequent call into timekeeping_adjust(), which modifies the multiplicator, subtracts from xtime_nsec to correct for the new multiplicator. Here because there was no accumulation, xtime_nsec becomes smaller than before, which opens a window up to the next accumulation, where the _COARSE clockid getters, which don't compensate for the offset, can observe the inconsistency. To fix this, rework the timekeeping_advance() logic so that when invoked from do_adjtimex(), the time is immediately forwarded to accumulate also the sub-interval portion into xtime. That means the remaining offset becomes zero and the subsequent multiplier adjustment therefore does not modify xtime_nsec. There is another related inconsistency. If xtime is forwarded due to the instantaneous multiplier adjustment, the NTP error, which was accumulated with the previous setting, becomes meaningless. Therefore clear the NTP error as well, after forwarding the clock for the instantaneous multiplier update. Fixes: da15cfdae033 ("time: Introduce CLOCK_REALTIME_COARSE") Reported-by: Lei Chen <lei.chen@smartx.com> Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250320200306.1712599-1-jstultz@google.com Closes: https://lore.kernel.org/lkml/20250310030004.3705801-1-lei.chen@smartx.com/
2025-03-21Merge tag 'io_uring-6.14-20250321' of git://git.kernel.dk/linuxLinus Torvalds
Pull io_uring fix from Jens Axboe: "Single fix heading to stable, fixing an issue with io_req_msg_cleanup() sometimes too eagerly clearing cleanup flags" * tag 'io_uring-6.14-20250321' of git://git.kernel.dk/linux: io_uring/net: don't clear REQ_F_NEED_CLEANUP unconditionally
2025-03-21Merge tag 'perf-urgent-2025-03-21' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 perf events fixes from Ingo Molnar: "Two fixes: an RAPL PMU driver error handling fix, and an AMD IBS software filter fix" * tag 'perf-urgent-2025-03-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/rapl: Fix error handling in init_rapl_pmus() perf/x86: Check data address for IBS software filter
2025-03-21Merge tag 'sched-urgent-2025-03-21' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Ingo Molnar: "Revert a scheduler performance optimization that regressed other workloads" * tag 'sched-urgent-2025-03-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: Revert "sched/core: Reduce cost of sched_move_task when config autogroup"
2025-03-21Merge tag 'i2c-host-fixes-6.14-rc8' of ↵Wolfram Sang
git://git.kernel.org/pub/scm/linux/kernel/git/andi.shyti/linux into i2c/for-current i2c-host-fixes for v6.14-rc8 amd-mp2: fix double free of irq.
2025-03-21PCI/MSI: Convert pci_msi_ignore_mask to per MSI domain flagRoger Pau Monne
Setting pci_msi_ignore_mask inhibits the toggling of the mask bit for both MSI and MSI-X entries globally, regardless of the IRQ chip they are using. Only Xen sets the pci_msi_ignore_mask when routing physical interrupts over event channels, to prevent PCI code from attempting to toggle the maskbit, as it's Xen that controls the bit. However, the pci_msi_ignore_mask being global will affect devices that use MSI interrupts but are not routing those interrupts over event channels (not using the Xen pIRQ chip). One example is devices behind a VMD PCI bridge. In that scenario the VMD bridge configures MSI(-X) using the normal IRQ chip (the pIRQ one in the Xen case), and devices behind the bridge configure the MSI entries using indexes into the VMD bridge MSI table. The VMD bridge then demultiplexes such interrupts and delivers to the destination device(s). Having pci_msi_ignore_mask set in that scenario prevents (un)masking of MSI entries for devices behind the VMD bridge. Move the signaling of no entry masking into the MSI domain flags, as that allows setting it on a per-domain basis. Set it for the Xen MSI domain that uses the pIRQ chip, while leaving it unset for the rest of the cases. Remove pci_msi_ignore_mask at once, since it was only used by Xen code, and with Xen dropping usage the variable is unneeded. This fixes using devices behind a VMD bridge on Xen PV hardware domains. Albeit Devices behind a VMD bridge are not known to Xen, that doesn't mean Linux cannot use them. By inhibiting the usage of VMD_FEAT_CAN_BYPASS_MSI_REMAP and the removal of the pci_msi_ignore_mask bodge devices behind a VMD bridge do work fine when use from a Linux Xen hardware domain. That's the whole point of the series. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Juergen Gross <jgross@suse.com> Acked-by: Bjorn Helgaas <bhelgaas@google.com> Message-ID: <20250219092059.90850-4-roger.pau@citrix.com> Signed-off-by: Juergen Gross <jgross@suse.com>
2025-03-21PCI: vmd: Disable MSI remapping bypass under XenRoger Pau Monne
MSI remapping bypass (directly configuring MSI entries for devices on the VMD bus) won't work under Xen, as Xen is not aware of devices in such bus, and hence cannot configure the entries using the pIRQ interface in the PV case, and in the PVH case traps won't be setup for MSI entries for such devices. Until Xen is aware of devices in the VMD bus prevent the VMD_FEAT_CAN_BYPASS_MSI_REMAP capability from being used when running as any kind of Xen guest. The MSI remapping bypass is an optional feature of VMD bridges, and hence when running under Xen it will be masked and devices will be forced to redirect its interrupts from the VMD bridge. That mode of operation must always be supported by VMD bridges and works when Xen is not aware of devices behind the VMD bridge. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Bjorn Helgaas <bhelgaas@google.com> Message-ID: <20250219092059.90850-3-roger.pau@citrix.com> Signed-off-by: Juergen Gross <jgross@suse.com>
2025-03-21zstd: Increase DYNAMIC_BMI2 GCC version cutoff from 4.8 to 11.0 to work ↵Ingo Molnar
around compiler segfault Due to pending percpu improvements in -next, GCC9 and GCC10 are crashing during the build with: lib/zstd/compress/huf_compress.c:1033:1: internal compiler error: Segmentation fault 1033 | { | ^ Please submit a full bug report, with preprocessed source if appropriate. See <file:///usr/share/doc/gcc-9/README.Bugs> for instructions. The DYNAMIC_BMI2 feature is a known-challenging feature of the ZSTD library, with an existing GCC quirk turning it off for GCC versions below 4.8. Increase the DYNAMIC_BMI2 version cutoff to GCC 11.0 - GCC 10.5 is the last version known to crash. Reported-by: Michael Kelley <mhklinux@outlook.com> Debugged-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: https://lore.kernel.org/r/SN6PR02MB415723FBCD79365E8D72CA5FD4D82@SN6PR02MB4157.namprd02.prod.outlook.com Cc: Linus Torvalds <torvalds@linux-foundation.org>
2025-03-21x86/asm: Make asm export of __ref_stack_chk_guard unconditionalArd Biesheuvel
Clang does not tolerate the use of non-TLS symbols for the per-CPU stack protector very well, and to work around this limitation, the symbol passed via the -mstack-protector-guard-symbol= option is never defined in C code, but only in the linker script, and it is exported from an assembly file. This is necessary because Clang will fail to generate the correct %GS based references in a compilation unit that includes a non-TLS definition of the guard symbol being used to store the stack cookie. This problem is only triggered by symbol definitions, not by declarations, but nonetheless, the declaration in <asm/asm-prototypes.h> is conditional on __GENKSYMS__ being #define'd, so that only genksyms will observe it, but for ordinary compilation, it will be invisible. This is causing problems with the genksyms alternative gendwarfksyms, which does not #define __GENKSYMS__, does not observe the symbol declaration, and therefore lacks the information it needs to version it. Adding the #define creates problems in other places, so that is not a straight-forward solution. So take the easy way out, and drop the conditional on __GENKSYMS__, as this is not really needed to begin with. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Link: https://lore.kernel.org/r/20250320213238.4451-2-ardb@kernel.org
2025-03-21xen/pci: Do not register devices with segments >= 0x10000Roger Pau Monne
The current hypercall interface for doing PCI device operations always uses a segment field that has a 16 bit width. However on Linux there are buses like VMD that hook up devices into the PCI hierarchy at segment >= 0x10000, after the maximum possible segment enumerated in ACPI. Attempting to register or manage those devices with Xen would result in errors at best, or overlaps with existing devices living on the truncated equivalent segment values. Note also that the VMD segment numbers are arbitrarily assigned by the OS, and hence there would need to be some negotiation between Xen and the OS to agree on how to enumerate VMD segments and devices behind them. Skip notifying Xen about those devices. Given how VMD bridges can multiplex interrupts on behalf of devices behind them there's no need for Xen to be aware of such devices for them to be usable by Linux. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Juergen Gross <jgross@suse.com> Message-ID: <20250219092059.90850-2-roger.pau@citrix.com> Signed-off-by: Juergen Gross <jgross@suse.com>
2025-03-20Merge tag 'drm-fixes-2025-03-21' of https://gitlab.freedesktop.org/drm/kernelLinus Torvalds
Pull drm fixes from Dave Airlie: "Just the usual spread of a bunch for amdgpu, and small changes to others. scheduler: - fix fence reference leak xe: - Fix for an error if exporting a dma-buf multiple time amdgpu: - Fix video caps limits on several asics - SMU 14.x fixes - GC 12 fixes - eDP fixes - DMUB fix amdkfd: - GC 12 trap handler fix - GC 7/8 queue validation fix radeon: - VCE IB parsing fix v3d: - fix job error handling bugs qaic: - fix two integer overflows host1x: - fix NULL domain handling" * tag 'drm-fixes-2025-03-21' of https://gitlab.freedesktop.org/drm/kernel: (21 commits) drm/xe: Fix exporting xe buffers multiple times gpu: host1x: Do not assume that a NULL domain means no DMA IOMMU drm/amdgpu/pm: Handle SCLK offset correctly in overdrive for smu 14.0.2 drm/amd/display: Fix incorrect fw_state address in dmub_srv drm/amd/display: Use HW lock mgr for PSR1 when only one eDP drm/amd/display: Fix message for support_edp0_on_dp1 drm/amdkfd: Fix user queue validation on Gfx7/8 drm/amdgpu: Restore uncached behaviour on GFX12 drm/amdgpu/gfx12: correct cleanup of 'me' field with gfx_v12_0_me_fini() drm/amdkfd: Fix instruction hazard in gfx12 trap handler drm/amdgpu/pm: wire up hwmon fan speed for smu 14.0.2 drm/amd/pm: add unique_id for gfx12 drm/amdgpu: Remove JPEG from vega and carrizo video caps drm/amdgpu: Fix JPEG video caps max size for navi1x and raven drm/amdgpu: Fix MPEG2, MPEG4 and VC1 video caps max size drm/radeon: fix uninitialized size issue in radeon_vce_cs_parse() accel/qaic: Fix integer overflow in qaic_validate_req() accel/qaic: Fix possible data corruption in BOs > 2G drm/v3d: Set job pointer to NULL when the job's fence has an error drm/v3d: Don't run jobs that have errors flagged in its fence ...
2025-03-20Merge tag 'v6.14-rc7-smb3-client-fix' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds
Pull smb client fix from Steve French: "smb3 client reconnect fix" * tag 'v6.14-rc7-smb3-client-fix' of git://git.samba.org/sfrench/cifs-2.6: smb: client: don't retry IO on failed negprotos with soft mounts
2025-03-21Merge tag 'amd-drm-fixes-6.14-2025-03-20' of ↵Dave Airlie
https://gitlab.freedesktop.org/agd5f/linux into drm-fixes amd-drm-fixes-6.14-2025-03-20: amdgpu: - Fix video caps limits on several asics - SMU 14.x fixes - GC 12 fixes - eDP fixes - DMUB fix amdkfd: - GC 12 trap handler fix - GC 7/8 queue validation fix radeon: - VCE IB parsing fix Signed-off-by: Dave Airlie <airlied@redhat.com> From: Alex Deucher <alexander.deucher@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250320210800.1358992-1-alexander.deucher@amd.com
2025-03-21Merge tag 'drm-xe-fixes-2025-03-20' of ↵Dave Airlie
https://gitlab.freedesktop.org/drm/xe/kernel into drm-fixes Driver Changes: - Fix for an error if exporting a dma-buf multiple time (Tomasz) Signed-off-by: Dave Airlie <airlied@redhat.com> From: Thomas Hellstrom <thomas.hellstrom@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/Z9xalLaCWsNbh0P0@fedora
2025-03-21Merge tag 'drm-misc-fixes-2025-03-20' of ↵Dave Airlie
ssh://gitlab.freedesktop.org/drm/misc/kernel into drm-fixes A sched fence reference leak fix, two fence fixes for v3d, two overflow fixes for quaic, and a iommu handling fix for host1x. Signed-off-by: Dave Airlie <airlied@redhat.com> From: Maxime Ripard <mripard@redhat.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250320-valiant-outstanding-nightingale-e9acae@houat
2025-03-20Merge tag 'dma-mapping-6.14-2025-03-21' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux Pull dma-mapping fix from Marek Szyprowski: - fix missing clear bdr in check_ram_in_range_map() (Baochen Qiang) * tag 'dma-mapping-6.14-2025-03-21' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux: dma-mapping: fix missing clear bdr in check_ram_in_range_map()