summaryrefslogtreecommitdiff
path: root/fs/open.c
AgeCommit message (Collapse)Author
2024-01-08Merge tag 'vfs-6.8.rw' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfsLinus Torvalds
Pull vfs rw updates from Christian Brauner: "This contains updates from Amir for read-write backing file helpers for stacking filesystems such as overlayfs: - Fanotify is currently in the process of introducing pre content events. Roughly, a new permission event will be added indicating that it is safe to write to the file being accessed. These events are used by hierarchical storage managers to e.g., fill the content of files on first access. During that work we noticed that our current permission checking is inconsistent in rw_verify_area() and remap_verify_area(). Especially in the splice code permission checking is done multiple times. For example, one time for the whole range and then again for partial ranges inside the iterator. In addition, we mostly do permission checking before we call file_start_write() except for a few places where we call it after. For pre-content events we need such permission checking to be done before file_start_write(). So this is a nice reason to clean this all up. After this series, all permission checking is done before file_start_write(). As part of this cleanup we also massaged the splice code a bit. We got rid of a few helpers because we are alredy drowning in special read-write helpers. We also cleaned up the return types for splice helpers. - Introduce generic read-write helpers for backing files. This lifts some overlayfs code to common code so it can be used by the FUSE passthrough work coming in over the next cycles. Make Amir and Miklos the maintainers for this new subsystem of the vfs" * tag 'vfs-6.8.rw' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (30 commits) fs: fix __sb_write_started() kerneldoc formatting fs: factor out backing_file_mmap() helper fs: factor out backing_file_splice_{read,write}() helpers fs: factor out backing_file_{read,write}_iter() helpers fs: prepare for stackable filesystems backing file helpers fsnotify: optionally pass access range in file permission hooks fsnotify: assert that file_start_write() is not held in permission hooks fsnotify: split fsnotify_perm() into two hooks fs: use splice_copy_file_range() inline helper splice: return type ssize_t from all helpers fs: use do_splice_direct() for nfsd/ksmbd server-side-copy fs: move file_start_write() into direct_splice_actor() fs: fork splice_file_range() from do_splice_direct() fs: create {sb,file}_write_not_started() helpers fs: create file_write_started() helper fs: create __sb_write_started() helper fs: move kiocb_start_write() into vfs_iocb_iter_write() fs: move permission hook out of do_iter_read() fs: move permission hook out of do_iter_write() fs: move file_start_write() into vfs_iter_write() ...
2024-01-08Merge tag 'vfs-6.8.misc' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "This contains the usual miscellaneous features, cleanups, and fixes for vfs and individual fses. Features: - Add Jan Kara as VFS reviewer - Show correct device and inode numbers in proc/<pid>/maps for vma files on stacked filesystems. This is now easily doable thanks to the backing file work from the last cycles. This comes with selftests Cleanups: - Remove a redundant might_sleep() from wait_on_inode() - Initialize pointer with NULL, not 0 - Clarify comment on access_override_creds() - Rework and simplify eventfd_signal() and eventfd_signal_mask() helpers - Process aio completions in batches to avoid needless wakeups - Completely decouple struct mnt_idmap from namespaces. We now only keep the actual idmapping around and don't stash references to namespaces - Reformat maintainer entries to indicate that a given subsystem belongs to fs/ - Simplify fput() for files that were never opened - Get rid of various pointless file helpers - Rename various file helpers - Rename struct file members after SLAB_TYPESAFE_BY_RCU switch from last cycle - Make relatime_need_update() return bool - Use GFP_KERNEL instead of GFP_USER when allocating superblocks - Replace deprecated ida_simple_*() calls with their current ida_*() counterparts Fixes: - Fix comments on user namespace id mapping helpers. They aren't kernel doc comments so they shouldn't be using /** - s/Retuns/Returns/g in various places - Add missing parameter documentation on can_move_mount_beneath() - Rename i_mapping->private_data to i_mapping->i_private_data - Fix a false-positive lockdep warning in pipe_write() for watch queues - Improve __fget_files_rcu() code generation to improve performance - Only notify writer that pipe resizing has finished after setting pipe->max_usage otherwise writers are never notified that the pipe has been resized and hang - Fix some kernel docs in hfsplus - s/passs/pass/g in various places - Fix kernel docs in ntfs - Fix kcalloc() arguments order reported by gcc 14 - Fix uninitialized value in reiserfs" * tag 'vfs-6.8.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (36 commits) reiserfs: fix uninit-value in comp_keys watch_queue: fix kcalloc() arguments order ntfs: dir.c: fix kernel-doc function parameter warnings fs: fix doc comment typo fs tree wide selftests/overlayfs: verify device and inode numbers in /proc/pid/maps fs/proc: show correct device and inode numbers in /proc/pid/maps eventfd: Remove usage of the deprecated ida_simple_xx() API fs: super: use GFP_KERNEL instead of GFP_USER for super block allocation fs/hfsplus: wrapper.c: fix kernel-doc warnings fs: add Jan Kara as reviewer fs/inode: Make relatime_need_update return bool pipe: wakeup wr_wait after setting max_usage file: remove __receive_fd() file: stop exposing receive_fd_user() fs: replace f_rcuhead with f_task_work file: remove pointless wrapper file: s/close_fd_get_file()/file_close_fd()/g Improve __fget_files_rcu() code generation (and thus __fget_light()) file: massage cleanup of files that failed to open fs/pipe: Fix lockdep false-positive in watchqueue pipe_write() ...
2023-12-23fs: prepare for stackable filesystems backing file helpersAmir Goldstein
In preparation for factoring out some backing file io helpers from overlayfs, move backing_file_open() into a new file fs/backing-file.c and header. Add a MAINTAINERS entry for stackable filesystems and add a Kconfig FS_STACK which stackable filesystems need to select. For now, the backing_file struct, the backing_file alloc/free functions and the backing_file_real_path() accessor remain internal to file_table.c. We may change that in the future. Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-12-21Merge branch 'vfs.file'Christian Brauner
Bring in the changes to the file infrastructure for this cycle. Mostly cleanups and some performance tweaks. * file: remove __receive_fd() * file: stop exposing receive_fd_user() * fs: replace f_rcuhead with f_task_work * file: remove pointless wrapper * file: s/close_fd_get_file()/file_close_fd()/g * Improve __fget_files_rcu() code generation (and thus __fget_light()) * file: massage cleanup of files that failed to open Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-12-15cred: get rid of CONFIG_DEBUG_CREDENTIALSJens Axboe
This code is rarely (never?) enabled by distros, and it hasn't caught anything in decades. Let's kill off this legacy debug code. Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-12-12fsnotify: optionally pass access range in file permission hooksAmir Goldstein
In preparation for pre-content permission events with file access range, move fsnotify_file_perm() hook out of security_file_permission() and into the callers. Callers that have the access range information call the new hook fsnotify_file_area_perm() with the access range. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Link: https://lore.kernel.org/r/20231212094440.250945-6-amir73il@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-12-12file: s/close_fd_get_file()/file_close_fd()/gChristian Brauner
That really shouldn't have "get" in there as that implies we're bumping the reference count which we don't do at all. We used to but not anmore. Now we're just closing the fd and pick that file from the fdtable without bumping the reference count. Update the wrong documentation while at it. Link: https://lore.kernel.org/r/20231130-vfs-files-fixes-v1-1-e73ca6f4ea83@kernel.org Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-11-18fs: Clarify "non-RCY" in access_override_creds() commentBagas Sanjaya
The term is originally intended as a joke that stands for "non-racy". This trips new contributors who mistake it for RCU typo [1]. Replace the term with more-explicit wording. Link: https://lore.kernel.org/r/20231030-debatten-nachrangig-f58abcdac530@brauner/ Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com> Link: https://lore.kernel.org/r/20231031114728.41485-1-bagasdotme@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-10-19fs: store real path instead of fake path in backing file f_pathAmir Goldstein
A backing file struct stores two path's, one "real" path that is referring to f_inode and one "fake" path, which should be displayed to users in /proc/<pid>/maps. There is a lot more potential code that needs to know the "real" path, then code that needs to know the "fake" path. Instead of code having to request the "real" path with file_real_path(), store the "real" path in f_path and require code that needs to know the "fake" path request it with file_user_path(). Replace the file_real_path() helper with a simple const accessor f_path(). After this change, file_dentry() is not expected to observe any files with overlayfs f_path and real f_inode, so the call to ->d_real() should not be needed. Leave the ->d_real() call for now and add an assertion in ovl_d_real() to catch if we made wrong assumptions. Suggested-by: Miklos Szeredi <miklos@szeredi.hu> Link: https://lore.kernel.org/r/CAJfpegtt48eXhhjDFA1ojcHPNKj3Go6joryCPtEFAKpocyBsnw@mail.gmail.com/ Signed-off-by: Amir Goldstein <amir73il@gmail.com> Link: https://lore.kernel.org/r/20231009153712.1566422-4-amir73il@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-10-19fs: get mnt_writers count for an open backing file's real pathAmir Goldstein
A writeable mapped backing file can perform writes to the real inode. Therefore, the real path mount must be kept writable so long as the writable map exists. This may not be strictly needed for ovelrayfs private upper mount, but it is correct to take the mnt_writers count in the vfs helper. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Link: https://lore.kernel.org/r/20231009153712.1566422-2-amir73il@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-09-11fs: rename __mnt_{want,drop}_write*() helpersAmir Goldstein
Before exporting these helpers to modules, make their names more meaningful. The names mnt_{get,put)_write_access*() were chosen, because they rhyme with the inode {get,put)_write_access() helpers, which have a very close meaning for the inode object. Suggested-by: Christian Brauner <brauner@kernel.org> Link: https://lore.kernel.org/r/20230817-anfechtbar-ruhelosigkeit-8c6cca8443fc@brauner/ Signed-off-by: Amir Goldstein <amir73il@gmail.com> Message-Id: <20230908132900.2983519-2-amir73il@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-28Merge tag 'v6.6-vfs.fchmodat2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull fchmodat2 system call from Christian Brauner: "This adds the fchmodat2() system call. It is a revised version of the fchmodat() system call, adding a missing flag argument. Support for both AT_SYMLINK_NOFOLLOW and AT_EMPTY_PATH are included. Adding this system call revision has been a longstanding request but so far has always fallen through the cracks. While the kernel implementation of fchmodat() does not have a flag argument the libc provided POSIX-compliant fchmodat(3) version does. Both glibc and musl have to implement a workaround in order to support AT_SYMLINK_NOFOLLOW (see [1] and [2]). The workaround is brittle because it relies not just on O_PATH and O_NOFOLLOW semantics and procfs magic links but also on our rather inconsistent symlink semantics. This gives userspace a proper fchmodat2() system call that libcs can use to properly implement fchmodat(3) and allows them to get rid of their hacks. In this case it will immediately benefit them as the current workaround is already defunct because of aformentioned inconsistencies. In addition to AT_SYMLINK_NOFOLLOW, give userspace the ability to use AT_EMPTY_PATH with fchmodat2(). This is already possible with fchownat() so there's no reason to not also support it for fchmodat2(). The implementation is simple and comes with selftests. Implementation of the system call and wiring up the system call are done as separate patches even though they could arguably be one patch. But in case there are merge conflicts from other system call additions it can be beneficial to have separate patches" Link: https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/fchmodat.c;h=17eca54051ee28ba1ec3f9aed170a62630959143;hb=a492b1e5ef7ab50c6fdd4e4e9879ea5569ab0a6c#l35 [1] Link: https://git.musl-libc.org/cgit/musl/tree/src/stat/fchmodat.c?id=718f363bc2067b6487900eddc9180c84e7739f80#n28 [2] * tag 'v6.6-vfs.fchmodat2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: selftests: fchmodat2: remove duplicate unneeded defines fchmodat2: add support for AT_EMPTY_PATH selftests: Add fchmodat2 selftest arch: Register fchmodat2, usually as syscall 452 fs: Add fchmodat2() Non-functional cleanup of a "__user * filename"
2023-08-28Merge tag 'v6.6-vfs.misc' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "This contains the usual miscellaneous features, cleanups, and fixes for vfs and individual filesystems. Features: - Block mode changes on symlinks and rectify our broken semantics - Report file modifications via fsnotify() for splice - Allow specifying an explicit timeout for the "rootwait" kernel command line option. This allows to timeout and reboot instead of always waiting indefinitely for the root device to show up - Use synchronous fput for the close system call Cleanups: - Get rid of open-coded lockdep workarounds for async io submitters and replace it all with a single consolidated helper - Simplify epoll allocation helper - Convert simple_write_begin and simple_write_end to use a folio - Convert page_cache_pipe_buf_confirm() to use a folio - Simplify __range_close to avoid pointless locking - Disable per-cpu buffer head cache for isolated cpus - Port ecryptfs to kmap_local_page() api - Remove redundant initialization of pointer buf in pipe code - Unexport the d_genocide() function which is only used within core vfs - Replace printk(KERN_ERR) and WARN_ON() with WARN() Fixes: - Fix various kernel-doc issues - Fix refcount underflow for eventfds when used as EFD_SEMAPHORE - Fix a mainly theoretical issue in devpts - Check the return value of __getblk() in reiserfs - Fix a racy assert in i_readcount_dec - Fix integer conversion issues in various functions - Fix LSM security context handling during automounts that prevented NFS superblock sharing" * tag 'v6.6-vfs.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (39 commits) cachefiles: use kiocb_{start,end}_write() helpers ovl: use kiocb_{start,end}_write() helpers aio: use kiocb_{start,end}_write() helpers io_uring: use kiocb_{start,end}_write() helpers fs: create kiocb_{start,end}_write() helpers fs: add kerneldoc to file_{start,end}_write() helpers io_uring: rename kiocb_end_write() local helper splice: Convert page_cache_pipe_buf_confirm() to use a folio libfs: Convert simple_write_begin and simple_write_end to use a folio fs/dcache: Replace printk and WARN_ON by WARN fs/pipe: remove redundant initialization of pointer buf fs: Fix kernel-doc warnings devpts: Fix kernel-doc warnings doc: idmappings: fix an error and rephrase a paragraph init: Add support for rootwait timeout parameter vfs: fix up the assert in i_readcount_dec fs: Fix one kernel-doc comment docs: filesystems: idmappings: clarify from where idmappings are taken fs/buffer.c: disable per-CPU buffer_head cache for isolated CPUs vfs, security: Fix automount superblock LSM init problem, preventing NFS sb sharing ...
2023-08-19fs: Fix kernel-doc warningsMatthew Wilcox (Oracle)
These have a variety of causes and a corresponding variety of solutions. Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org> Message-Id: <20230818200824.2720007-1-willy@infradead.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-08fs: use __fput_sync in close(2)Linus Torvalds
close(2) is a special case which guarantees a shallow kernel stack, making delegation to task_work machinery unnecessary. Said delegation is problematic as it involves atomic ops and interrupt masking trips, none of which are cheap on x86-64. Forcing close(2) to do it looks like an oversight in the original work. Moreover presence of CONFIG_RSEQ adds an additional overhead as fput() -> task_work_add(..., TWA_RESUME) -> set_notify_resume() makes the thread returning to userspace land in resume_user_mode_work(), where rseq_handle_notify_resume takes a SMAP round-trip if rseq is enabled for the thread (and it is by default with contemporary glibc). Sample result when benchmarking open1_processes -t 1 from will-it-scale (that's an open + close loop) + tmpfs on /tmp, running on the Sapphire Rapid CPU (ops/s): stock+RSEQ: 1329857 stock-RSEQ: 1421667 (+7%) patched: 1523521 (+14.5% / +7%) (with / without rseq) Patched result is the same regardless of rseq as the codepath is avoided. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-06open: make RESOLVE_CACHED correctly test for O_TMPFILEAleksa Sarai
O_TMPFILE is actually __O_TMPFILE|O_DIRECTORY. This means that the old fast-path check for RESOLVE_CACHED would reject all users passing O_DIRECTORY with -EAGAIN, when in fact the intended test was to check for __O_TMPFILE. Cc: stable@vger.kernel.org # v5.12+ Fixes: 99668f618062 ("fs: expose LOOKUP_CACHED through openat2() RESOLVE_CACHED") Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> Message-Id: <20230806-resolve_cached-o_tmpfile-v1-1-7ba16308465e@cyphar.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-07-28fchmodat2: add support for AT_EMPTY_PATHAleksa Sarai
This allows userspace to avoid going through /proc/self/fd when dealing with all types of file descriptors for chmod(), and makes fchmodat2() a proper superset of all other chmod syscalls. The primary difference between fchmodat2(AT_EMPTY_PATH) and fchmod() is that fchmod() doesn't operate on O_PATH file descriptors by design. To quote open(2): > O_PATH (since Linux 2.6.39) > [...] > The file itself is not opened, and other file operations (e.g., > read(2), write(2), fchmod(2), fchown(2), fgetxattr(2), ioctl(2), > mmap(2)) fail with the error EBADF. However, procfs has allowed userspace to do this operation ever since the introduction of O_PATH through magic-links, so adding this feature is only an improvement for programs that have to mess around with /proc/self/fd/$n today to get this behaviour. In addition, fchownat(AT_EMPTY_PATH) has existed since the introduction of O_PATH and allows chown() operations directly on O_PATH descriptors. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> Acked-by: Alexey Gladkov <legion@kernel.org> Message-Id: <20230728-fchmodat2-at_empty_path-v1-1-f3add31d3516@cyphar.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-07-27fs: Add fchmodat2()Alexey Gladkov
On the userspace side fchmodat(3) is implemented as a wrapper function which implements the POSIX-specified interface. This interface differs from the underlying kernel system call, which does not have a flags argument. Most implementations require procfs [1][2]. There doesn't appear to be a good userspace workaround for this issue but the implementation in the kernel is pretty straight-forward. The new fchmodat2() syscall allows to pass the AT_SYMLINK_NOFOLLOW flag, unlike existing fchmodat. [1] https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/fchmodat.c;h=17eca54051ee28ba1ec3f9aed170a62630959143;hb=a492b1e5ef7ab50c6fdd4e4e9879ea5569ab0a6c#l35 [2] https://git.musl-libc.org/cgit/musl/tree/src/stat/fchmodat.c?id=718f363bc2067b6487900eddc9180c84e7739f80#n28 Co-developed-by: Palmer Dabbelt <palmer@sifive.com> Signed-off-by: Palmer Dabbelt <palmer@sifive.com> Signed-off-by: Alexey Gladkov <legion@kernel.org> Acked-by: Arnd Bergmann <arnd@arndb.de> Message-Id: <f2a846ef495943c5d101011eebcf01179d0c7b61.1689092120.git.legion@kernel.org> [brauner: pre reviews, do flag conversion in do_fchmodat() directly] Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-06-29Merge tag 'fsnotify_for_v6.5-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull fsnotify updates from Jan Kara: - Support for fanotify events returning file handles for filesystems not exportable via NFS - Improved error handling exportfs functions - Add missing FS_OPEN events when unusual open helpers are used * tag 'fsnotify_for_v6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: fsnotify: move fsnotify_open() hook into do_dentry_open() exportfs: check for error return value from exportfs_encode_*() fanotify: support reporting non-decodeable file handles exportfs: allow exporting non-decodeable file handles to userspace exportfs: add explicit flag to request non-decodeable file handles exportfs: change connectable argument to bit flags
2023-06-26Merge tag 'v6.5/vfs.file' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs file handling updates from Christian Brauner: "This contains Amir's work to fix a long-standing problem where an unprivileged overlayfs mount can be used to avoid fanotify permission events that were requested for an inode or superblock on the underlying filesystem. Some background about files opened in overlayfs. If a file is opened in overlayfs @file->f_path will refer to a "fake" path. What this means is that while @file->f_inode will refer to inode of the underlying layer, @file->f_path refers to an overlayfs {dentry,vfsmount} pair. The reasons for doing this are out of scope here but it is the reason why the vfs has been providing the open_with_fake_path() helper for overlayfs for very long time now. So nothing new here. This is for sure not very elegant and everyone including the overlayfs maintainers agree. Improving this significantly would involve more fragile and potentially rather invasive changes. In various codepaths access to the path of the underlying filesystem is needed for such hybrid file. The best example is fsnotify where this becomes security relevant. Passing the overlayfs @file->f_path->dentry will cause fsnotify to skip generating fsnotify events registered on the underlying inode or superblock. To fix this we extend the vfs provided open_with_fake_path() concept for overlayfs to create a backing file container that holds the real path and to expose a helper that can be used by relevant callers to get access to the path of the underlying filesystem through the new file_real_path() helper. This pattern is similar to what we do in d_real() and d_real_inode(). The first beneficiary is fsnotify and fixes the security sensitive problem mentioned above. There's a couple of nice cleanups included as well. Over time, the old open_with_fake_path() helper added specifically for overlayfs a long time ago started to get used in other places such as cachefiles. Even though cachefiles have nothing to do with hybrid files. The only reason cachefiles used that concept was that files opened with open_with_fake_path() aren't charged against the caller's open file limit by raising FMODE_NOACCOUNT. It's just mere coincidence that both overlayfs and cachefiles need to ensure to not overcharge the caller for their internal open calls. So this work disentangles FMODE_NOACCOUNT use cases and backing file use-cases by adding the FMODE_BACKING flag which indicates that the file can be used to retrieve the backing file of another filesystem. (Fyi, Jens will be sending you a really nice cleanup from Christoph that gets rid of 3 FMODE_* flags otherwise this would be the last fmode_t bit we'd be using.) So now overlayfs becomes the sole user of the renamed open_with_fake_path() helper which is now named backing_file_open(). For internal kernel users such as cachefiles that are only interested in FMODE_NOACCOUNT but not in FMODE_BACKING we add a new kernel_file_open() helper which opens a file without being charged against the caller's open file limit. All new helpers are properly documented and clearly annotated to mention their special uses. We also rename vfs_tmpfile_open() to kernel_tmpfile_open() to clearly distinguish it from vfs_tmpfile() and align it the other kernel_*() internal helpers" * tag 'v6.5/vfs.file' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: ovl: enable fsnotify events on underlying real files fs: use backing_file container for internal files with "fake" f_path fs: move kmem_cache_zalloc() into alloc_empty_file*() helpers fs: use a helper for opening kernel internal files fs: rename {vfs,kernel}_tmpfile_open()
2023-06-19fs: use backing_file container for internal files with "fake" f_pathAmir Goldstein
Overlayfs uses open_with_fake_path() to allocate internal kernel files, with a "fake" path - whose f_path is not on the same fs as f_inode. Allocate a container struct backing_file for those internal files, that is used to hold the "fake" ovl path along with the real path. backing_file_real_path() can be used to access the stored real path. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Message-Id: <20230615112229.2143178-5-amir73il@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-06-19fs: use a helper for opening kernel internal filesAmir Goldstein
cachefiles uses kernel_open_tmpfile() to open kernel internal tmpfile without accounting for nr_files. cachefiles uses open_with_fake_path() for the same reason without the need for a fake path. Fork open_with_fake_path() to kernel_file_open() which only does the noaccount part and use it in cachefiles. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Message-Id: <20230615112229.2143178-3-amir73il@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-06-12fsnotify: move fsnotify_open() hook into do_dentry_open()Amir Goldstein
fsnotify_open() hook is called only from high level system calls context and not called for the very many helpers to open files. This may makes sense for many of the special file open cases, but it is inconsistent with fsnotify_close() hook that is called for every last fput() of on a file object with FMODE_OPENED. As a result, it is possible to observe ACCESS, MODIFY and CLOSE events without ever observing an OPEN event. Fix this inconsistency by replacing all the fsnotify_open() hooks with a single hook inside do_dentry_open(). If there are special cases that would like to opt-out of the possible overhead of fsnotify() call in fsnotify_open(), they would probably also want to avoid the overhead of fsnotify() call in the rest of the fsnotify hooks, so they should be opening that file with the __FMODE_NONOTIFY flag. However, in the majority of those cases, the s_fsnotify_connectors optimization in fsnotify_parent() would be sufficient to avoid the overhead of fsnotify() call anyway. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Message-Id: <20230611122429.1499617-1-amir73il@gmail.com>
2023-05-15fs: fix incorrect fmode_t castsMin-Hua Chen
Use __FMODE_NONOTIFY instead of FMODE_NONOTIFY to fixes the following sparce warnings: fs/overlayfs/file.c:48:37: sparse: warning: restricted fmode_t degrades to integer fs/overlayfs/file.c:128:13: sparse: warning: restricted fmode_t degrades to integer fs/open.c:1159:21: sparse: warning: restricted fmode_t degrades to integer Signed-off-by: Min-Hua Chen <minhuadotchen@gmail.com> Message-Id: <20230502232210.119063-1-minhuadotchen@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-05-15fs/open.c: Fix W=1 kernel doc warningsAnuradha Weeraman
fs/open.c: In functions 'setattr_vfsuid' and 'setattr_vfsgid': warning: Function parameter or member 'attr' not described - Fix warning by removing kernel-doc for these as they are static inline functions and not required to be exposed via kernel-doc. fs/open.c: warning: Excess function parameter 'opened' description in 'finish_open' warning: Excess function parameter 'cred' description in 'vfs_open' - Fix by removing the parameters from the kernel-doc as they are no longer required by the function. Signed-off-by: Anuradha Weeraman <anuradha@debian.org> Message-Id: <20230506182928.384105-1-anuradha@debian.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-03-22open: return EINVAL for O_DIRECTORY | O_CREATChristian Brauner
After a couple of years and multiple LTS releases we received a report that the behavior of O_DIRECTORY | O_CREAT changed starting with v5.7. On kernels prior to v5.7 combinations of O_DIRECTORY, O_CREAT, O_EXCL had the following semantics: (1) open("/tmp/d", O_DIRECTORY | O_CREAT) * d doesn't exist: create regular file * d exists and is a regular file: ENOTDIR * d exists and is a directory: EISDIR (2) open("/tmp/d", O_DIRECTORY | O_CREAT | O_EXCL) * d doesn't exist: create regular file * d exists and is a regular file: EEXIST * d exists and is a directory: EEXIST (3) open("/tmp/d", O_DIRECTORY | O_EXCL) * d doesn't exist: ENOENT * d exists and is a regular file: ENOTDIR * d exists and is a directory: open directory On kernels since to v5.7 combinations of O_DIRECTORY, O_CREAT, O_EXCL have the following semantics: (1) open("/tmp/d", O_DIRECTORY | O_CREAT) * d doesn't exist: ENOTDIR (create regular file) * d exists and is a regular file: ENOTDIR * d exists and is a directory: EISDIR (2) open("/tmp/d", O_DIRECTORY | O_CREAT | O_EXCL) * d doesn't exist: ENOTDIR (create regular file) * d exists and is a regular file: EEXIST * d exists and is a directory: EEXIST (3) open("/tmp/d", O_DIRECTORY | O_EXCL) * d doesn't exist: ENOENT * d exists and is a regular file: ENOTDIR * d exists and is a directory: open directory This is a fairly substantial semantic change that userspace didn't notice until Pedro took the time to deliberately figure out corner cases. Since no one noticed this breakage we can somewhat safely assume that O_DIRECTORY | O_CREAT combinations are likely unused. The v5.7 breakage is especially weird because while ENOTDIR is returned indicating failure a regular file is actually created. This doesn't make a lot of sense. Time was spent finding potential users of this combination. Searching on codesearch.debian.net showed that codebases often express semantical expectations about O_DIRECTORY | O_CREAT which are completely contrary to what our code has done and currently does. The expectation often is that this particular combination would create and open a directory. This suggests users who tried to use that combination would stumble upon the counterintuitive behavior no matter if pre-v5.7 or post v5.7 and quickly realize neither semantics give them what they want. For some examples see the code examples in [1] to [3] and the discussion in [4]. There are various ways to address this issue. The lazy/simple option would be to restore the pre-v5.7 behavior and to just live with that bug forever. But since there's a real chance that the O_DIRECTORY | O_CREAT quirk isn't relied upon we should try to get away with murder(ing bad semantics) first. If we need to Frankenstein pre-v5.7 behavior later so be it. So let's simply return EINVAL categorically for O_DIRECTORY | O_CREAT combinations. In addition to cleaning up the old bug this also opens up the possiblity to make that flag combination do something more intuitive in the future. Starting with this commit the following semantics apply: (1) open("/tmp/d", O_DIRECTORY | O_CREAT) * d doesn't exist: EINVAL * d exists and is a regular file: EINVAL * d exists and is a directory: EINVAL (2) open("/tmp/d", O_DIRECTORY | O_CREAT | O_EXCL) * d doesn't exist: EINVAL * d exists and is a regular file: EINVAL * d exists and is a directory: EINVAL (3) open("/tmp/d", O_DIRECTORY | O_EXCL) * d doesn't exist: ENOENT * d exists and is a regular file: ENOTDIR * d exists and is a directory: open directory One additional note, O_TMPFILE is implemented as: #define __O_TMPFILE 020000000 #define O_TMPFILE (__O_TMPFILE | O_DIRECTORY) #define O_TMPFILE_MASK (__O_TMPFILE | O_DIRECTORY | O_CREAT) For older kernels it was important to return an explicit error when O_TMPFILE wasn't supported. So O_TMPFILE requires that O_DIRECTORY is raised alongside __O_TMPFILE. It also enforced that O_CREAT wasn't specified. Since O_DIRECTORY | O_CREAT could be used to create a regular allowing that combination together with __O_TMPFILE would've meant that false positives were possible, i.e., that a regular file was created instead of a O_TMPFILE. This could've been used to trick userspace into thinking it operated on a O_TMPFILE when it wasn't. Now that we block O_DIRECTORY | O_CREAT completely the check for O_CREAT in the __O_TMPFILE branch via if ((flags & O_TMPFILE_MASK) != O_TMPFILE) can be dropped. Instead we can simply check verify that O_DIRECTORY is raised via if (!(flags & O_DIRECTORY)) and explain this in two comments. As Aleksa pointed out O_PATH is unaffected by this change since it always returned EINVAL if O_CREAT was specified - with or without O_DIRECTORY. Link: https://lore.kernel.org/lkml/20230320071442.172228-1-pedro.falcato@gmail.com Link: https://sources.debian.org/src/flatpak/1.14.4-1/subprojects/libglnx/glnx-dirfd.c/?hl=324#L324 [1] Link: https://sources.debian.org/src/flatpak-builder/1.2.3-1/subprojects/libglnx/glnx-shutil.c/?hl=251#L251 [2] Link: https://sources.debian.org/src/ostree/2022.7-2/libglnx/glnx-dirfd.c/?hl=324#L324 [3] Link: https://www.openwall.com/lists/oss-security/2014/11/26/14 [4] Reported-by: Pedro Falcato <pedro.falcato@gmail.com> Cc: Aleksa Sarai <cyphar@cyphar.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-02-27vfs: avoid duplicating creds in faccessat if possibleMateusz Guzik
access(2) remains commonly used, for example on exec: access("/etc/ld.so.preload", R_OK) or when running gcc: strace -c gcc empty.c % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 0.00 0.000000 0 42 26 access It falls down to do_faccessat without the AT_EACCESS flag, which in turn results in allocation of new creds in order to modify fsuid/fsgid and caps. This is a very expensive process single-threaded and most notably multi-threaded, with numerous structures getting refed and unrefed on imminent new cred destruction. Turns out for typical consumers the resulting creds would be identical and this can be checked upfront, avoiding the hard work. An access benchmark plugged into will-it-scale running on Cascade Lake shows: test proc before after access1 1 1310582 2908735 (+121%) # distinct files access1 24 4716491 63822173 (+1353%) # distinct files access2 24 2378041 5370335 (+125%) # same file The above benchmarks are not integrated into will-it-scale, but can be found in a pull request: https://github.com/antonblanchard/will-it-scale/pull/36/files Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-02-20Merge tag 'fs.v6.3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping Pull vfs hardening update from Christian Brauner: "Jan pointed out that during shutdown both filp_close() and super block destruction will use basic printk logging when bugs are detected. This causes issues in a few scenarios: - Tools like syzkaller cannot figure out that the logged message indicates a bug. - Users that explicitly opt in to have the kernel bug on data corruption by selecting CONFIG_BUG_ON_DATA_CORRUPTION should see the kernel crash when they did actually select that option. - When there are busy inodes after the superblock is shut down later access to such a busy inodes walks through freed memory. It would be better to cleanly crash instead. All of this can be addressed by using the already existing CHECK_DATA_CORRUPTION() macro in these places when kernel bugs are detected. Its logging improvement is useful for all users. Otherwise this only has a meaningful behavioral effect when users do select CONFIG_BUG_ON_DATA_CORRUPTION which means this is backward compatible for regular users" * tag 'fs.v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping: fs: Use CHECK_DATA_CORRUPTION() when kernel bugs are detected
2023-02-20Merge tag 'fs.idmapped.v6.3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping Pull vfs idmapping updates from Christian Brauner: - Last cycle we introduced the dedicated struct mnt_idmap type for mount idmapping and the required infrastucture in 256c8aed2b42 ("fs: introduce dedicated idmap type for mounts"). As promised in last cycle's pull request message this converts everything to rely on struct mnt_idmap. Currently we still pass around the plain namespace that was attached to a mount. This is in general pretty convenient but it makes it easy to conflate namespaces that are relevant on the filesystem with namespaces that are relevant on the mount level. Especially for non-vfs developers without detailed knowledge in this area this was a potential source for bugs. This finishes the conversion. Instead of passing the plain namespace around this updates all places that currently take a pointer to a mnt_userns with a pointer to struct mnt_idmap. Now that the conversion is done all helpers down to the really low-level helpers only accept a struct mnt_idmap argument instead of two namespace arguments. Conflating mount and other idmappings will now cause the compiler to complain loudly thus eliminating the possibility of any bugs. This makes it impossible for filesystem developers to mix up mount and filesystem idmappings as they are two distinct types and require distinct helpers that cannot be used interchangeably. Everything associated with struct mnt_idmap is moved into a single separate file. With that change no code can poke around in struct mnt_idmap. It can only be interacted with through dedicated helpers. That means all filesystems are and all of the vfs is completely oblivious to the actual implementation of idmappings. We are now also able to extend struct mnt_idmap as we see fit. For example, we can decouple it completely from namespaces for users that don't require or don't want to use them at all. We can also extend the concept of idmappings so we can cover filesystem specific requirements. In combination with the vfs{g,u}id_t work we finished in v6.2 this makes this feature substantially more robust and thus difficult to implement wrong by a given filesystem and also protects the vfs. - Enable idmapped mounts for tmpfs and fulfill a longstanding request. A long-standing request from users had been to make it possible to create idmapped mounts for tmpfs. For example, to share the host's tmpfs mount between multiple sandboxes. This is a prerequisite for some advanced Kubernetes cases. Systemd also has a range of use-cases to increase service isolation. And there are more users of this. However, with all of the other work going on this was way down on the priority list but luckily someone other than ourselves picked this up. As usual the patch is tiny as all the infrastructure work had been done multiple kernel releases ago. In addition to all the tests that we already have I requested that Rodrigo add a dedicated tmpfs testsuite for idmapped mounts to xfstests. It is to be included into xfstests during the v6.3 development cycle. This should add a slew of additional tests. * tag 'fs.idmapped.v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping: (26 commits) shmem: support idmapped mounts for tmpfs fs: move mnt_idmap fs: port vfs{g,u}id helpers to mnt_idmap fs: port fs{g,u}id helpers to mnt_idmap fs: port i_{g,u}id_into_vfs{g,u}id() to mnt_idmap fs: port i_{g,u}id_{needs_}update() to mnt_idmap quota: port to mnt_idmap fs: port privilege checking helpers to mnt_idmap fs: port inode_owner_or_capable() to mnt_idmap fs: port inode_init_owner() to mnt_idmap fs: port acl to mnt_idmap fs: port xattr to mnt_idmap fs: port ->permission() to pass mnt_idmap fs: port ->fileattr_set() to pass mnt_idmap fs: port ->set_acl() to pass mnt_idmap fs: port ->get_acl() to pass mnt_idmap fs: port ->tmpfile() to pass mnt_idmap fs: port ->rename() to pass mnt_idmap fs: port ->mknod() to pass mnt_idmap fs: port ->mkdir() to pass mnt_idmap ...
2023-01-27fs: Use CHECK_DATA_CORRUPTION() when kernel bugs are detectedJann Horn
Currently, filp_close() and generic_shutdown_super() use printk() to log messages when bugs are detected. This is problematic because infrastructure like syzkaller has no idea that this message indicates a bug. In addition, some people explicitly want their kernels to BUG() when kernel data corruption has been detected (CONFIG_BUG_ON_DATA_CORRUPTION). And finally, when generic_shutdown_super() detects remaining inodes on a system without CONFIG_BUG_ON_DATA_CORRUPTION, it would be nice if later accesses to a busy inode would at least crash somewhat cleanly rather than walking through freed memory. To address all three, use CHECK_DATA_CORRUPTION() when kernel bugs are detected. Signed-off-by: Jann Horn <jannh@google.com> Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-19fs: port vfs{g,u}id helpers to mnt_idmapChristian Brauner
Convert to struct mnt_idmap. Last cycle we merged the necessary infrastructure in 256c8aed2b42 ("fs: introduce dedicated idmap type for mounts"). This is just the conversion to struct mnt_idmap. Currently we still pass around the plain namespace that was attached to a mount. This is in general pretty convenient but it makes it easy to conflate namespaces that are relevant on the filesystem with namespaces that are relevent on the mount level. Especially for non-vfs developers without detailed knowledge in this area this can be a potential source for bugs. Once the conversion to struct mnt_idmap is done all helpers down to the really low-level helpers will take a struct mnt_idmap argument instead of two namespace arguments. This way it becomes impossible to conflate the two eliminating the possibility of any bugs. All of the vfs and all filesystems only operate on struct mnt_idmap. Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-19fs: port privilege checking helpers to mnt_idmapChristian Brauner
Convert to struct mnt_idmap. Last cycle we merged the necessary infrastructure in 256c8aed2b42 ("fs: introduce dedicated idmap type for mounts"). This is just the conversion to struct mnt_idmap. Currently we still pass around the plain namespace that was attached to a mount. This is in general pretty convenient but it makes it easy to conflate namespaces that are relevant on the filesystem with namespaces that are relevent on the mount level. Especially for non-vfs developers without detailed knowledge in this area this can be a potential source for bugs. Once the conversion to struct mnt_idmap is done all helpers down to the really low-level helpers will take a struct mnt_idmap argument instead of two namespace arguments. This way it becomes impossible to conflate the two eliminating the possibility of any bugs. All of the vfs and all filesystems only operate on struct mnt_idmap. Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-19fs: port ->permission() to pass mnt_idmapChristian Brauner
Convert to struct mnt_idmap. Last cycle we merged the necessary infrastructure in 256c8aed2b42 ("fs: introduce dedicated idmap type for mounts"). This is just the conversion to struct mnt_idmap. Currently we still pass around the plain namespace that was attached to a mount. This is in general pretty convenient but it makes it easy to conflate namespaces that are relevant on the filesystem with namespaces that are relevent on the mount level. Especially for non-vfs developers without detailed knowledge in this area this can be a potential source for bugs. Once the conversion to struct mnt_idmap is done all helpers down to the really low-level helpers will take a struct mnt_idmap argument instead of two namespace arguments. This way it becomes impossible to conflate the two eliminating the possibility of any bugs. All of the vfs and all filesystems only operate on struct mnt_idmap. Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-18fs: port vfs_*() helpers to struct mnt_idmapChristian Brauner
Convert to struct mnt_idmap. Last cycle we merged the necessary infrastructure in 256c8aed2b42 ("fs: introduce dedicated idmap type for mounts"). This is just the conversion to struct mnt_idmap. Currently we still pass around the plain namespace that was attached to a mount. This is in general pretty convenient but it makes it easy to conflate namespaces that are relevant on the filesystem with namespaces that are relevent on the mount level. Especially for non-vfs developers without detailed knowledge in this area this can be a potential source for bugs. Once the conversion to struct mnt_idmap is done all helpers down to the really low-level helpers will take a struct mnt_idmap argument instead of two namespace arguments. This way it becomes impossible to conflate the two eliminating the possibility of any bugs. All of the vfs and all filesystems only operate on struct mnt_idmap. Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-11fs: remove locks_inodeJeff Layton
locks_inode was turned into a wrapper around file_inode in de2a4a501e71 (Partially revert "locks: fix file locking on overlayfs"). Finish replacing locks_inode invocations everywhere with file_inode. Acked-by: Miklos Szeredi <mszeredi@redhat.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Reviewed-by: David Howells <dhowells@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
2023-01-11filelock: move file locking definitions to separate header fileJeff Layton
The file locking definitions have lived in fs.h since the dawn of time, but they are only used by a small subset of the source files that include it. Move the file locking definitions to a new header file, and add the appropriate #include directives to the source files that need them. By doing this we trim down fs.h a bit and limit the amount of rebuilding that has to be done when we make changes to the file locking APIs. Reviewed-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Howells <dhowells@redhat.com> Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Acked-by: Chuck Lever <chuck.lever@oracle.com> Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com> Acked-by: Steve French <stfrench@microsoft.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Jeff Layton <jlayton@kernel.org>
2022-12-13Merge tag 'landlock-6.2-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mic/linux Pull landlock updates from Mickaël Salaün: "This adds file truncation support to Landlock, contributed by Günther Noack. As described by Günther [1], the goal of these patches is to work towards a more complete coverage of file system operations that are restrictable with Landlock. The known set of currently unsupported file system operations in Landlock is described at [2]. Out of the operations listed there, truncate is the only one that modifies file contents, so these patches should make it possible to prevent the direct modification of file contents with Landlock. The new LANDLOCK_ACCESS_FS_TRUNCATE access right covers both the truncate(2) and ftruncate(2) families of syscalls, as well as open(2) with the O_TRUNC flag. This includes usages of creat() in the case where existing regular files are overwritten. Additionally, this introduces a new Landlock security blob associated with opened files, to track the available Landlock access rights at the time of opening the file. This is in line with Unix's general approach of checking the read and write permissions during open(), and associating this previously checked authorization with the opened file. An ongoing patch documents this use case [3]. In order to treat truncate(2) and ftruncate(2) calls differently in an LSM hook, we split apart the existing security_path_truncate hook into security_path_truncate (for truncation by path) and security_file_truncate (for truncation of previously opened files)" Link: https://lore.kernel.org/r/20221018182216.301684-1-gnoack3000@gmail.com [1] Link: https://www.kernel.org/doc/html/v6.1/userspace-api/landlock.html#filesystem-flags [2] Link: https://lore.kernel.org/r/20221209193813.972012-1-mic@digikod.net [3] * tag 'landlock-6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mic/linux: samples/landlock: Document best-effort approach for LANDLOCK_ACCESS_FS_REFER landlock: Document Landlock's file truncation support samples/landlock: Extend sample tool to support LANDLOCK_ACCESS_FS_TRUNCATE selftests/landlock: Test ftruncate on FDs created by memfd_create(2) selftests/landlock: Test FD passing from restricted to unrestricted processes selftests/landlock: Locally define __maybe_unused selftests/landlock: Test open() and ftruncate() in multiple scenarios selftests/landlock: Test file truncation support landlock: Support file truncation landlock: Document init_layer_masks() helper landlock: Refactor check_access_path_dual() into is_access_to_paths_allowed() security: Create file_truncate hook from path_truncate hook
2022-10-19security: Create file_truncate hook from path_truncate hookGünther Noack
Like path_truncate, the file_truncate hook also restricts file truncation, but is called in the cases where truncation is attempted on an already-opened file. This is required in a subsequent commit to handle ftruncate() operations differently to truncate() operations. Acked-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: John Johansen <john.johansen@canonical.com> Acked-by: Paul Moore <paul@paul-moore.com> Signed-off-by: Günther Noack <gnoack3000@gmail.com> Link: https://lore.kernel.org/r/20221018182216.301684-2-gnoack3000@gmail.com Signed-off-by: Mickaël Salaün <mic@digikod.net>
2022-10-18attr: use consistent sgid stripping checksChristian Brauner
Currently setgid stripping in file_remove_privs()'s should_remove_suid() helper is inconsistent with other parts of the vfs. Specifically, it only raises ATTR_KILL_SGID if the inode is S_ISGID and S_IXGRP but not if the inode isn't in the caller's groups and the caller isn't privileged over the inode although we require this already in setattr_prepare() and setattr_copy() and so all filesystem implement this requirement implicitly because they have to use setattr_{prepare,copy}() anyway. But the inconsistency shows up in setgid stripping bugs for overlayfs in xfstests (e.g., generic/673, generic/683, generic/685, generic/686, generic/687). For example, we test whether suid and setgid stripping works correctly when performing various write-like operations as an unprivileged user (fallocate, reflink, write, etc.): echo "Test 1 - qa_user, non-exec file $verb" setup_testfile chmod a+rws $junk_file commit_and_check "$qa_user" "$verb" 64k 64k The test basically creates a file with 6666 permissions. While the file has the S_ISUID and S_ISGID bits set it does not have the S_IXGRP set. On a regular filesystem like xfs what will happen is: sys_fallocate() -> vfs_fallocate() -> xfs_file_fallocate() -> file_modified() -> __file_remove_privs() -> dentry_needs_remove_privs() -> should_remove_suid() -> __remove_privs() newattrs.ia_valid = ATTR_FORCE | kill; -> notify_change() -> setattr_copy() In should_remove_suid() we can see that ATTR_KILL_SUID is raised unconditionally because the file in the test has S_ISUID set. But we also see that ATTR_KILL_SGID won't be set because while the file is S_ISGID it is not S_IXGRP (see above) which is a condition for ATTR_KILL_SGID being raised. So by the time we call notify_change() we have attr->ia_valid set to ATTR_KILL_SUID | ATTR_FORCE. Now notify_change() sees that ATTR_KILL_SUID is set and does: ia_valid = attr->ia_valid |= ATTR_MODE attr->ia_mode = (inode->i_mode & ~S_ISUID); which means that when we call setattr_copy() later we will definitely update inode->i_mode. Note that attr->ia_mode still contains S_ISGID. Now we call into the filesystem's ->setattr() inode operation which will end up calling setattr_copy(). Since ATTR_MODE is set we will hit: if (ia_valid & ATTR_MODE) { umode_t mode = attr->ia_mode; vfsgid_t vfsgid = i_gid_into_vfsgid(mnt_userns, inode); if (!vfsgid_in_group_p(vfsgid) && !capable_wrt_inode_uidgid(mnt_userns, inode, CAP_FSETID)) mode &= ~S_ISGID; inode->i_mode = mode; } and since the caller in the test is neither capable nor in the group of the inode the S_ISGID bit is stripped. But assume the file isn't suid then ATTR_KILL_SUID won't be raised which has the consequence that neither the setgid nor the suid bits are stripped even though it should be stripped because the inode isn't in the caller's groups and the caller isn't privileged over the inode. If overlayfs is in the mix things become a bit more complicated and the bug shows up more clearly. When e.g., ovl_setattr() is hit from ovl_fallocate()'s call to file_remove_privs() then ATTR_KILL_SUID and ATTR_KILL_SGID might be raised but because the check in notify_change() is questioning the ATTR_KILL_SGID flag again by requiring S_IXGRP for it to be stripped the S_ISGID bit isn't removed even though it should be stripped: sys_fallocate() -> vfs_fallocate() -> ovl_fallocate() -> file_remove_privs() -> dentry_needs_remove_privs() -> should_remove_suid() -> __remove_privs() newattrs.ia_valid = ATTR_FORCE | kill; -> notify_change() -> ovl_setattr() // TAKE ON MOUNTER'S CREDS -> ovl_do_notify_change() -> notify_change() // GIVE UP MOUNTER'S CREDS // TAKE ON MOUNTER'S CREDS -> vfs_fallocate() -> xfs_file_fallocate() -> file_modified() -> __file_remove_privs() -> dentry_needs_remove_privs() -> should_remove_suid() -> __remove_privs() newattrs.ia_valid = attr_force | kill; -> notify_change() The fix for all of this is to make file_remove_privs()'s should_remove_suid() helper to perform the same checks as we already require in setattr_prepare() and setattr_copy() and have notify_change() not pointlessly requiring S_IXGRP again. It doesn't make any sense in the first place because the caller must calculate the flags via should_remove_suid() anyway which would raise ATTR_KILL_SGID. While we're at it we move should_remove_suid() from inode.c to attr.c where it belongs with the rest of the iattr helpers. Especially since it returns ATTR_KILL_S{G,U}ID flags. We also rename it to setattr_should_drop_suidgid() to better reflect that it indicates both setuid and setgid bit removal and also that it returns attr flags. Running xfstests with this doesn't report any regressions. We should really try and use consistent checks. Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2022-10-06Merge tag 'pull-file' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds
Pull vfs file updates from Al Viro: "struct file-related stuff" * tag 'pull-file' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: dma_buf_getfile(): don't bother with ->f_flags reassignments Change calling conventions for filldir_t locks: fix TOCTOU race when granting write lease
2022-09-20open: always initialize ownership fieldsTetsuo Handa
Beginning of the merge window we introduced the vfs{g,u}id_t types in b27c82e12965 ("attr: port attribute changes to new types") and changed various codepaths over including chown_common(). During that change we forgot to account for the case were the passed ownership value is -1. In this case the ownership fields in struct iattr aren't initialized but we rely on them being initialized by the time we generate the ownership to pass down to the LSMs. All the major LSMs don't care about the ownership values at all. Only Tomoyo uses them and so it took a while for syzbot to unearth this issue. Fix this by initializing the ownership fields and do it within the retry_deleg block. While notify_change() doesn't alter the ownership fields currently we shouldn't rely on it. Since no kernel has been released with these changes this does not needed to be backported to any stable kernels. [Christian Brauner (Microsoft) <brauner@kernel.org>] * rewrote commit message * use INVALID_VFS{G,U}ID macros Fixes: b27c82e12965 ("attr: port attribute changes to new types") # mainline only Reported-and-tested-by: syzbot+541e21dcc32c4046cba9@syzkaller.appspotmail.com Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reviewed-by: Seth Forshee (DigitalOcean) <sforshee@kernel.org> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2022-08-16locks: fix TOCTOU race when granting write leaseAmir Goldstein
Thread A trying to acquire a write lease checks the value of i_readcount and i_writecount in check_conflicting_open() to verify that its own fd is the only fd referencing the file. Thread B trying to open the file for read will call break_lease() in do_dentry_open() before incrementing i_readcount, which leaves a small window where thread A can acquire the write lease and then thread B completes the open of the file for read without breaking the write lease that was acquired by thread A. Fix this race by incrementing i_readcount before checking for existing leases, same as the case with i_writecount. Use a helper put_file_access() to decrement i_readcount or i_writecount in do_dentry_open() and __fput(). Fixes: 387e3746d01c ("locks: eliminate false positive conflicts for write lease") Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-08-03Merge tag 'pull-work.iov_iter-base' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs iov_iter updates from Al Viro: "Part 1 - isolated cleanups and optimizations. One of the goals is to reduce the overhead of using ->read_iter() and ->write_iter() instead of ->read()/->write(). new_sync_{read,write}() has a surprising amount of overhead, in particular inside iocb_flags(). That's the explanation for the beginning of the series is in this pile; it's not directly iov_iter-related, but it's a part of the same work..." * tag 'pull-work.iov_iter-base' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: first_iovec_segment(): just return address iov_iter: massage calling conventions for first_{iovec,bvec}_segment() iov_iter: first_{iovec,bvec}_segment() - simplify a bit iov_iter: lift dealing with maxpages out of first_{iovec,bvec}_segment() iov_iter_get_pages{,_alloc}(): cap the maxsize with MAX_RW_COUNT iov_iter_bvec_advance(): don't bother with bvec_iter copy_page_{to,from}_iter(): switch iovec variants to generic keep iocb_flags() result cached in struct file iocb: delay evaluation of IS_SYNC(...) until we want to check IOCB_DSYNC struct file: use anonymous union member for rcuhead and llist btrfs: use IOMAP_DIO_NOSYNC teach iomap_dio_rw() to suppress dsync No need of likely/unlikely on calls of check_copy_size()
2022-08-03Merge tag 'pull-work.lseek' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs lseek updates from Al Viro: "Jason's lseek series. Saner handling of 'lseek should fail with ESPIPE' - this gets rid of the magical no_llseek thing and makes checks consistent. In particular, the ad-hoc "can we do splice via internal pipe" checks got saner (and somewhat more permissive, which is what Jason had been after, AFAICT)" * tag 'pull-work.lseek' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fs: remove no_llseek fs: check FMODE_LSEEK to control internal pipe splicing vfio: do not set FMODE_LSEEK flag dma-buf: remove useless FMODE_LSEEK flag fs: do not compare against ->llseek fs: clear or set FMODE_LSEEK based on llseek function
2022-07-16fs: remove no_llseekJason A. Donenfeld
Now that all callers of ->llseek are going through vfs_llseek(), we don't gain anything by keeping no_llseek around. Nothing actually calls it and setting ->llseek to no_lseek is completely equivalent to leaving it NULL. Longer term (== by the end of merge window) we want to remove all such intializations. To simplify the merge window this commit does *not* touch initializers - it only defines no_llseek as NULL (and simplifies the tests on file opening). At -rc1 we'll need do a mechanical removal of no_llseek - git grep -l -w no_llseek | grep -v porting.rst | while read i; do sed -i '/\<no_llseek\>/d' $i done would do it. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-07-16fs: clear or set FMODE_LSEEK based on llseek functionJason A. Donenfeld
Pipe-like behaviour on llseek(2) (i.e. unconditionally failing with -ESPIPE) can be expresses in 3 ways: 1) ->llseek set to NULL in file_operations 2) ->llseek set to no_llseek in file_operations 3) FMODE_LSEEK *not* set in ->f_mode. Enforce (3) in cases (1) and (2); that will allow to simplify the checks and eventually get rid of no_llseek boilerplate. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-06-26attr: port attribute changes to new typesChristian Brauner
Now that we introduced new infrastructure to increase the type safety for filesystems supporting idmapped mounts port the first part of the vfs over to them. This ports the attribute changes codepaths to rely on the new better helpers using a dedicated type. Before this change we used to take a shortcut and place the actual values that would be written to inode->i_{g,u}id into struct iattr. This had the advantage that we moved idmappings mostly out of the picture early on but it made reasoning about changes more difficult than it should be. The filesystem was never explicitly told that it dealt with an idmapped mount. The transition to the value that needed to be stored in inode->i_{g,u}id appeared way too early and increased the probability of bugs in various codepaths. We know place the same value in struct iattr no matter if this is an idmapped mount or not. The vfs will only deal with type safe vfs{g,u}id_t. This makes it massively safer to perform permission checks as the type will tell us what checks we need to perform and what helpers we need to use. Fileystems raising FS_ALLOW_IDMAP can't simply write ia_vfs{g,u}id to inode->i_{g,u}id since they are different types. Instead they need to use the dedicated vfs{g,u}id_to_k{g,u}id() helpers that map the vfs{g,u}id into the filesystem. The other nice effect is that filesystems like overlayfs don't need to care about idmappings explicitly anymore and can simply set up struct iattr accordingly directly. Link: https://lore.kernel.org/lkml/CAHk-=win6+ahs1EwLkcq8apqLi_1wXFWbrPf340zYEhObpz4jA@mail.gmail.com [1] Link: https://lore.kernel.org/r/20220621141454.2914719-9-brauner@kernel.org Cc: Seth Forshee <sforshee@digitalocean.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Aleksa Sarai <cyphar@cyphar.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> CC: linux-fsdevel@vger.kernel.org Reviewed-by: Seth Forshee <sforshee@digitalocean.com> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2022-06-10keep iocb_flags() result cached in struct fileAl Viro
* calculate at the time we set FMODE_OPENED (do_dentry_open() for normal opens, alloc_file() for pipe()/socket()/etc.) * update when handling F_SETFL * keep in a new field - file->f_iocb_flags; since that thing is needed only before the refcount reaches zero, we can put it into the same anon union where ->f_rcuhead and ->f_llist live - those are used only after refcount reaches zero. Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-05-31Merge tag 'riscv-for-linus-5.19-mw0' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux Pull RISC-V updates from Palmer Dabbelt: - Support for the Svpbmt extension, which allows memory attributes to be encoded in pages - Support for the Allwinner D1's implementation of page-based memory attributes - Support for running rv32 binaries on rv64 systems, via the compat subsystem - Support for kexec_file() - Support for the new generic ticket-based spinlocks, which allows us to also move to qrwlock. These should have already gone in through the asm-geneic tree as well - A handful of cleanups and fixes, include some larger ones around atomics and XIP * tag 'riscv-for-linus-5.19-mw0' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: (51 commits) RISC-V: Prepare dropping week attribute from arch_kexec_apply_relocations[_add] riscv: compat: Using seperated vdso_maps for compat_vdso_info RISC-V: Fix the XIP build RISC-V: Split out the XIP fixups into their own file RISC-V: ignore xipImage RISC-V: Avoid empty create_*_mapping definitions riscv: Don't output a bogus mmu-type on a no MMU kernel riscv: atomic: Add custom conditional atomic operation implementation riscv: atomic: Optimize dec_if_positive functions riscv: atomic: Cleanup unnecessary definition RISC-V: Load purgatory in kexec_file RISC-V: Add purgatory RISC-V: Support for kexec_file on panic RISC-V: Add kexec_file support RISC-V: use memcpy for kexec_file mode kexec_file: Fix kexec_file.c build error for riscv platform riscv: compat: Add COMPAT Kbuild skeletal support riscv: compat: ptrace: Add compat_arch_ptrace implement riscv: compat: signal: Add rt_frame implementation riscv: add memory-type errata for T-Head ...
2022-05-26Merge tag 'nfsd-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linuxLinus Torvalds
Pull nfsd updates from Chuck Lever: "We introduce 'courteous server' in this release. Previously NFSD would purge open and lock state for an unresponsive client after one lease period (typically 90 seconds). Now, after one lease period, another client can open and lock those files and the unresponsive client's lease is purged; otherwise if the unresponsive client's open and lock state is uncontended, the server retains that open and lock state for up to 24 hours, allowing the client's workload to resume after a lengthy network partition. A longstanding issue with NFSv4 file creation is also addressed. Previously a file creation can fail internally, returning an error to the client, but leave the newly created file in place as an artifact. The file creation code path has been reorganized so that internal failures and race conditions are less likely to result in an unwanted file creation. A fault injector has been added to help exercise paths that are run during kernel metadata cache invalidation. These caches contain information maintained by user space about exported filesystems. Many of our test workloads do not trigger cache invalidation. There is one patch that is needed to support PREEMPT_RT and a fix for an ancient 'sleep while spin-locked' splat that seems to have become easier to hit since v5.18-rc3" * tag 'nfsd-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (36 commits) NFSD: nfsd_file_put() can sleep NFSD: Add documenting comment for nfsd4_release_lockowner() NFSD: Modernize nfsd4_release_lockowner() NFSD: Fix possible sleep during nfsd4_release_lockowner() nfsd: destroy percpu stats counters after reply cache shutdown nfsd: Fix null-ptr-deref in nfsd_fill_super() nfsd: Unregister the cld notifier when laundry_wq create failed SUNRPC: Use RMW bitops in single-threaded hot paths NFSD: Clean up the show_nf_flags() macro NFSD: Trace filecache opens NFSD: Move documenting comment for nfsd4_process_open2() NFSD: Fix whitespace NFSD: Remove dprintk call sites from tail of nfsd4_open() NFSD: Instantiate a struct file when creating a regular NFSv4 file NFSD: Clean up nfsd_open_verified() NFSD: Remove do_nfsd_create() NFSD: Refactor NFSv4 OPEN(CREATE) NFSD: Refactor NFSv3 CREATE NFSD: Refactor nfsd_create_setattr() NFSD: Avoid calling fh_drop_write() twice in do_nfsd_create() ...