summaryrefslogtreecommitdiff
path: root/fs/fuse/file.c
AgeCommit message (Collapse)Author
2024-04-15fuse: fix wrong ff->iomode state changes from parallel dio writeAmir Goldstein
There is a confusion with fuse_file_uncached_io_{start,end} interface. These helpers do two things when called from passthrough open()/release(): 1. Take/drop negative refcount of fi->iocachectr (inode uncached io mode) 2. State change ff->iomode IOM_NONE <-> IOM_UNCACHED (file uncached open) The calls from parallel dio write path need to take a reference on fi->iocachectr, but they should not be changing ff->iomode state, because in this case, the fi->iocachectr reference does not stick around until file release(). Factor out helpers fuse_inode_uncached_io_{start,end}, to be used from parallel dio write path and rename fuse_file_*cached_io_{start,end} helpers to fuse_file_*cached_io_{open,release} to clarify the difference. Fixes: 205c1d802683 ("fuse: allow parallel dio writes with FUSE_DIRECT_IO_ALLOW_MMAP") Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-03-15Merge tag 'fuse-update-6.9' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse updates from Miklos Szeredi: - Add passthrough mode for regular file I/O. This allows performing read and write (also via memory maps) on a backing file without incurring the overhead of roundtrips to userspace. For now this is only allowed to privileged servers, but this limitation will go away in the future (Amir Goldstein) - Fix interaction of direct I/O mode with memory maps (Bernd Schubert) - Export filesystem tags through sysfs for virtiofs (Stefan Hajnoczi) - Allow resending queued requests for server crash recovery (Zhao Chen) - Misc fixes and cleanups * tag 'fuse-update-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: (38 commits) fuse: get rid of ff->readdir.lock fuse: remove unneeded lock which protecting update of congestion_threshold fuse: Fix missing FOLL_PIN for direct-io fuse: remove an unnecessary if statement fuse: Track process write operations in both direct and writethrough modes fuse: Use the high bit of request ID for indicating resend requests fuse: Introduce a new notification type for resend pending requests fuse: add support for explicit export disabling fuse: __kuid_val/__kgid_val helpers in fuse_fill_attr_from_inode() fuse: fix typo for fuse_permission comment fuse: Convert fuse_writepage_locked to take a folio fuse: Remove fuse_writepage virtio_fs: remove duplicate check if queue is broken fuse: use FUSE_ROOT_ID in fuse_get_root_inode() fuse: don't unhash root fuse: fix root lookup with nonzero generation fuse: replace remaining make_bad_inode() with fuse_make_bad() virtiofs: drop __exit from virtio_fs_sysfs_exit() fuse: implement passthrough for mmap fuse: implement splice read/write passthrough ...
2024-03-06fuse: get rid of ff->readdir.lockMiklos Szeredi
The same protection is provided by file->f_pos_lock. Note, this relies on the fact that file->f_mode has FMODE_ATOMIC_POS. This flag is cleared by stream_open(), which would prevent locking of f_pos_lock. Prior to commit 7de64d521bf9 ("fuse: break up fuse_open_common()") FOPEN_STREAM on a directory would cause stream_open() to be called. After this commit this is not done anymore, so f_pos_lock will always be locked. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-03-06fuse: Fix missing FOLL_PIN for direct-ioLei Huang
Our user space filesystem relies on fuse to provide POSIX interface. In our test, a known string is written into a file and the content is read back later to verify correct data returned. We observed wrong data returned in read buffer in rare cases although correct data are stored in our filesystem. Fuse kernel module calls iov_iter_get_pages2() to get the physical pages of the user-space read buffer passed in read(). The pages are not pinned to avoid page migration. When page migration occurs, the consequence are two-folds. 1) Applications do not receive correct data in read buffer. 2) fuse kernel writes data into a wrong place. Using iov_iter_extract_pages() to pin pages fixes the issue in our test. An auxiliary variable "struct page **pt_pages" is used in the patch to prepare the 2nd parameter for iov_iter_extract_pages() since iov_iter_get_pages2() uses a different type for the 2nd parameter. [SzM] add iov_iter_extract_will_pin(ii) and unpin only if true. Signed-off-by: Lei Huang <lei.huang@linux.intel.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-03-06fuse: remove an unnecessary if statementJiachen Zhang
FUSE remote locking code paths never add any locking state to inode->i_flctx, so the locks_remove_posix() function called on file close will return without calling fuse_setlk(). Therefore, as the if statement to be removed in this commit will always be false, remove it for clearness. Signed-off-by: Jiachen Zhang <zhangjiachen.jaycee@bytedance.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-03-06fuse: Track process write operations in both direct and writethrough modesZhou Jifeng
Due to the fact that fuse does not count the write IO of processes in the direct and writethrough write modes, user processes cannot track write_bytes through the “/proc/[pid]/io” path. For example, the system tool iotop cannot count the write operations of the corresponding process. Signed-off-by: Zhou Jifeng <zhoujifeng@kylinos.com.cn> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-03-05fuse: Convert fuse_writepage_locked to take a folioMatthew Wilcox (Oracle)
The one remaining caller of fuse_writepage_locked() already has a folio, so convert this function entirely. Saves a few calls to compound_head() but no attempt is made to support large folios in this patch. Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-03-05fuse: Remove fuse_writepageMatthew Wilcox (Oracle)
The writepage operation is deprecated as it leads to worse performance under high memory pressure due to folios being written out in LRU order rather than sequentially within a file. Use filemap_migrate_folio() to support dirty folio migration instead of writepage. Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-03-05fuse: implement passthrough for mmapAmir Goldstein
An mmap request for a file open in passthrough mode, maps the memory directly to the backing file. An mmap of a file in direct io mode, usually uses cached mmap and puts the inode in caching io mode, which denies new passthrough opens of that inode, because caching io mode is conflicting with passthrough io mode. For the same reason, trying to mmap a direct io file, while there is a passthrough file open on the same inode will fail with -ENODEV. An mmap of a file in direct io mode, also needs to wait for parallel dio writes in-progress to complete. If a passthrough file is opened, while an mmap of another direct io file is waiting for parallel dio writes to complete, the wait is aborted and mmap fails with -ENODEV. A FUSE server that uses passthrough and direct io opens on the same inode that may also be mmaped, is advised to provide a backing fd also for the files that are open in direct io mode (i.e. use the flags combination FOPEN_DIRECT_IO | FOPEN_PASSTHROUGH), so that mmap will always use the backing file, even if read/write do not passthrough. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-03-05fuse: implement splice read/write passthroughAmir Goldstein
This allows passing fstests generic/249 and generic/591. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-03-05fuse: implement read/write passthroughAmir Goldstein
Use the backing file read/write helpers to implement read/write passthrough to a backing file. After read/write, we invalidate a/c/mtime/size attributes. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-03-05fuse: implement open in passthrough modeAmir Goldstein
After getting a backing file id with FUSE_DEV_IOC_BACKING_OPEN ioctl, a FUSE server can reply to an OPEN request with flag FOPEN_PASSTHROUGH and the backing file id. The FUSE server should reuse the same backing file id for all the open replies of the same FUSE inode and open will fail (with -EIO) if a the server attempts to open the same inode with conflicting io modes or to setup passthrough to two different backing files for the same FUSE inode. Using the same backing file id for several different inodes is allowed. Opening a new file with FOPEN_DIRECT_IO for an inode that is already open for passthrough is allowed, but only if the FOPEN_PASSTHROUGH flag and correct backing file id are specified as well. The read/write IO of such files will not use passthrough operations to the backing file, but mmap, which does not support direct_io, will use the backing file insead of using the page cache as it always did. Even though all FUSE passthrough files of the same inode use the same backing file as a backing inode reference, each FUSE file opens a unique instance of a backing_file object to store the FUSE path that was used to open the inode and the open flags of the specific open file. The per-file, backing_file object is released along with the FUSE file. The inode associated fuse_backing object is released when the last FUSE passthrough file of that inode is released AND when the backing file id is closed by the server using the FUSE_DEV_IOC_BACKING_CLOSE ioctl. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-03-05fuse: prepare for opening file in passthrough modeAmir Goldstein
In preparation for opening file in passthrough mode, store the fuse_open_out argument in ff->args to be passed into fuse_file_io_open() with the optional backing_id member. This will be used for setting up passthrough to backing file on open reply with FOPEN_PASSTHROUGH flag and a valid backing_id. Opening a file in passthrough mode may fail for several reasons, such as missing capability, conflicting open flags or inode in caching mode. Return EIO from fuse_file_io_open() in those cases. The combination of FOPEN_PASSTHROUGH and FOPEN_DIRECT_IO is allowed - it mean that read/write operations will go directly to the server, but mmap will be done to the backing file. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-02-23fuse: allow parallel dio writes with FUSE_DIRECT_IO_ALLOW_MMAPAmir Goldstein
Instead of denying caching mode on parallel dio open, deny caching open only while parallel dio are in-progress and wait for in-progress parallel dio writes before entering inode caching io mode. This allows executing parallel dio when inode is not in caching mode even if shared mmap is allowed, but no mmaps have been performed on the inode in question. An mmap on direct_io file now waits for all in-progress parallel dio writes to complete, so parallel dio writes together with FUSE_DIRECT_IO_ALLOW_MMAP is enabled by this commit. Signed-off-by: Bernd Schubert <bschubert@ddn.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-02-23fuse: introduce inode io modesAmir Goldstein
The fuse inode io mode is determined by the mode of its open files/mmaps and parallel dio opens and expressed in the value of fi->iocachectr: > 0 - caching io: files open in caching mode or mmap on direct_io file < 0 - parallel dio: direct io mode with parallel dio writes enabled == 0 - direct io: no files open in caching mode and no files mmaped Note that iocachectr value of 0 might become positive or negative, while non-parallel dio is getting processed. direct_io mmap uses page cache, so first mmap will mark the file as ff->io_opened and increment fi->iocachectr to enter the caching io mode. If the server opens the file in caching mode while it is already open for parallel dio or vice versa the open fails. This allows executing parallel dio when inode is not in caching mode and no mmaps have been performed on the inode in question. Signed-off-by: Bernd Schubert <bschubert@ddn.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-02-23fuse: prepare for failing open responseAmir Goldstein
In preparation for inode io modes, a server open response could fail due to conflicting inode io modes. Allow returning an error from fuse_finish_open() and handle the error in the callers. fuse_finish_open() is used as the callback of finish_open(), so that FMODE_OPENED will not be set if fuse_finish_open() fails. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-02-23fuse: break up fuse_open_common()Amir Goldstein
fuse_open_common() has a lot of code relevant only for regular files and O_TRUNC in particular. Copy the little bit of remaining code into fuse_dir_open() and stop using this common helper for directory open. Also split out fuse_dir_finish_open() from fuse_finish_open() before we add inode io modes to fuse_finish_open(). Suggested-by: Miklos Szeredi <miklos@szeredi.hu> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-02-23fuse: allocate ff->release_args only if release is neededAmir Goldstein
This removed the need to pass isdir argument to fuse_put_file(). Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-02-23fuse: factor out helper fuse_truncate_update_attr()Amir Goldstein
fuse_finish_open() is called from fuse_open_common() and from fuse_create_open(). In the latter case, the O_TRUNC flag is always cleared in finish_open()m before calling into fuse_finish_open(). Move the bits that update attribute cache post O_TRUNC open into a helper and call this helper from fuse_open_common() directly. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-02-23fuse: add fuse_dio_lock/unlock helper functionsBernd Schubert
So far this is just a helper to remove complex locking logic out of fuse_direct_write_iter. Especially needed by the next patch in the series to that adds the fuse inode cache IO mode and adds in even more locking complexity. Signed-off-by: Bernd Schubert <bschubert@ddn.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-02-23fuse: create helper function if DIO write needs exclusive lockBernd Schubert
This makes the code a bit easier to read and allows to more easily add more conditions when an exclusive lock is needed. Signed-off-by: Bernd Schubert <bschubert@ddn.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-02-23fuse: fix VM_MAYSHARE and direct_io_allow_mmapBernd Schubert
There were multiple issues with direct_io_allow_mmap: - fuse_link_write_file() was missing, resulting in warnings in fuse_write_file_get() and EIO from msync() - "vma->vm_ops = &fuse_file_vm_ops" was not set, but especially fuse_page_mkwrite is needed. The semantics of invalidate_inode_pages2() is so far not clearly defined in fuse_file_mmap. It dates back to commit 3121bfe76311 ("fuse: fix "direct_io" private mmap") Though, as direct_io_allow_mmap is a new feature, that was for MAP_PRIVATE only. As invalidate_inode_pages2() is calling into fuse_launder_folio() and writes out dirty pages, it should be safe to call invalidate_inode_pages2 for MAP_PRIVATE and MAP_SHARED as well. Cc: Hao Xu <howeyxu@tencent.com> Cc: stable@vger.kernel.org Fixes: e78662e818f9 ("fuse: add a new fuse init flag to relax restrictions in no cache mode") Signed-off-by: Bernd Schubert <bschubert@ddn.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-02-05fuse: adapt to breakup of struct file_lockJeff Layton
Most of the existing APIs have remained the same, but subsystems that access file_lock fields directly need to reach into struct file_lock_core now. Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/20240131-flsplit-v3-39-c6129007ee8d@kernel.org Reviewed-by: NeilBrown <neilb@suse.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-02-05filelock: split common fields into struct file_lock_coreJeff Layton
In a future patch, we're going to split file leases into their own structure. Since a lot of the underlying machinery uses the same fields move those into a new file_lock_core, and embed that inside struct file_lock. For now, add some macros to ensure that we can continue to build while the conversion is in progress. Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/20240131-flsplit-v3-17-c6129007ee8d@kernel.org Reviewed-by: NeilBrown <neilb@suse.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-01-08Merge tag 'vfs-6.8.rw' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfsLinus Torvalds
Pull vfs rw updates from Christian Brauner: "This contains updates from Amir for read-write backing file helpers for stacking filesystems such as overlayfs: - Fanotify is currently in the process of introducing pre content events. Roughly, a new permission event will be added indicating that it is safe to write to the file being accessed. These events are used by hierarchical storage managers to e.g., fill the content of files on first access. During that work we noticed that our current permission checking is inconsistent in rw_verify_area() and remap_verify_area(). Especially in the splice code permission checking is done multiple times. For example, one time for the whole range and then again for partial ranges inside the iterator. In addition, we mostly do permission checking before we call file_start_write() except for a few places where we call it after. For pre-content events we need such permission checking to be done before file_start_write(). So this is a nice reason to clean this all up. After this series, all permission checking is done before file_start_write(). As part of this cleanup we also massaged the splice code a bit. We got rid of a few helpers because we are alredy drowning in special read-write helpers. We also cleaned up the return types for splice helpers. - Introduce generic read-write helpers for backing files. This lifts some overlayfs code to common code so it can be used by the FUSE passthrough work coming in over the next cycles. Make Amir and Miklos the maintainers for this new subsystem of the vfs" * tag 'vfs-6.8.rw' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (30 commits) fs: fix __sb_write_started() kerneldoc formatting fs: factor out backing_file_mmap() helper fs: factor out backing_file_splice_{read,write}() helpers fs: factor out backing_file_{read,write}_iter() helpers fs: prepare for stackable filesystems backing file helpers fsnotify: optionally pass access range in file permission hooks fsnotify: assert that file_start_write() is not held in permission hooks fsnotify: split fsnotify_perm() into two hooks fs: use splice_copy_file_range() inline helper splice: return type ssize_t from all helpers fs: use do_splice_direct() for nfsd/ksmbd server-side-copy fs: move file_start_write() into direct_splice_actor() fs: fork splice_file_range() from do_splice_direct() fs: create {sb,file}_write_not_started() helpers fs: create file_write_started() helper fs: create __sb_write_started() helper fs: move kiocb_start_write() into vfs_iocb_iter_write() fs: move permission hook out of do_iter_read() fs: move permission hook out of do_iter_write() fs: move file_start_write() into vfs_iter_write() ...
2023-12-12fs: use splice_copy_file_range() inline helperAmir Goldstein
generic_copy_file_range() is just a wrapper around splice_file_range(), which caps the maximum copy length. The only caller of splice_file_range(), namely __ceph_copy_file_range() is already ready to cope with short copy. Move the length capping into splice_file_range() and replace the exported symbol generic_copy_file_range() with a simple inline helper. Suggested-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/linux-fsdevel/20231204083849.GC32438@lst.de/ Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Link: https://lore.kernel.org/r/20231212094440.250945-3-amir73il@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-12-04fuse: disable FOPEN_PARALLEL_DIRECT_WRITES with FUSE_DIRECT_IO_ALLOW_MMAPAmir Goldstein
The new fuse init flag FUSE_DIRECT_IO_ALLOW_MMAP breaks assumptions made by FOPEN_PARALLEL_DIRECT_WRITES and causes test generic/095 to hit BUG_ON(fi->writectr < 0) assertions in fuse_set_nowrite(): generic/095 5s ... kernel BUG at fs/fuse/dir.c:1756! ... ? fuse_set_nowrite+0x3d/0xdd ? do_raw_spin_unlock+0x88/0x8f ? _raw_spin_unlock+0x2d/0x43 ? fuse_range_is_writeback+0x71/0x84 fuse_sync_writes+0xf/0x19 fuse_direct_io+0x167/0x5bd fuse_direct_write_iter+0xf0/0x146 Auto disable FOPEN_PARALLEL_DIRECT_WRITES when server negotiated FUSE_DIRECT_IO_ALLOW_MMAP. Fixes: e78662e818f9 ("fuse: add a new fuse init flag to relax restrictions in no cache mode") Cc: <stable@vger.kernel.org> # v6.6 Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-12-04fuse: Rename DIRECT_IO_RELAX to DIRECT_IO_ALLOW_MMAPTyler Fanelli
Although DIRECT_IO_RELAX's initial usage is to allow shared mmap, its description indicates a purpose of reducing memory footprint. This may imply that it could be further used to relax other DIRECT_IO operations in the future. Replace it with a flag DIRECT_IO_ALLOW_MMAP which does only one thing, allow shared mmap of DIRECT_IO files while still bypassing the cache on regular reads and writes. [Miklos] Also Keep DIRECT_IO_RELAX definition for backward compatibility. Signed-off-by: Tyler Fanelli <tfanelli@redhat.com> Fixes: e78662e818f9 ("fuse: add a new fuse init flag to relax restrictions in no cache mode") Cc: <stable@vger.kernel.org> # v6.6 Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-08-16fuse: write back dirty pages before direct write in direct_io_relax modeHao Xu
In direct_io_relax mode, there can be shared mmaped files and thus dirty pages in its page cache. Therefore those dirty pages should be written back to backend before direct io to avoid data loss. Signed-off-by: Hao Xu <howeyxu@tencent.com> Reviewed-by: Jiachen Zhang <zhangjiachen.jaycee@bytedance.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-08-16fuse: add a new fuse init flag to relax restrictions in no cache modeHao Xu
FOPEN_DIRECT_IO is usually set by fuse daemon to indicate need of strong coherency, e.g. network filesystems. Thus shared mmap is disabled since it leverages page cache and may write to it, which may cause inconsistence. But FOPEN_DIRECT_IO can be used not for coherency but to reduce memory footprint as well, e.g. reduce guest memory usage with virtiofs. Therefore, add a new fuse init flag FUSE_DIRECT_IO_RELAX to relax restrictions in that mode, currently, it allows shared mmap. One thing to note is to make sure it doesn't break coherency in your use case. Signed-off-by: Hao Xu <howeyxu@tencent.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-08-16fuse: invalidate page cache pages before direct writeHao Xu
In FOPEN_DIRECT_IO, page cache may still be there for a file since private mmap is allowed. Direct write should respect that and invalidate the corresponding pages so that page cache readers don't get stale data. Signed-off-by: Hao Xu <howeyxu@tencent.com> Tested-by: Jiachen Zhang <zhangjiachen.jaycee@bytedance.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-08-14Revert "fuse: in fuse_flush only wait if someone wants the return code"Miklos Szeredi
This reverts commit 5a8bee63b10f6f2f52f6d22e109a4a147409842a. Jürg Billeter reports the following regression: Since v6.3-rc1 commit 5a8bee63b1 ("fuse: in fuse_flush only wait if someone wants the return code") `fput()` is called asynchronously if a file is closed as part of a process exiting, i.e., if there was no explicit `close()` before exit. If the file was open for writing, also `put_write_access()` is called asynchronously as part of the async `fput()`. If that newly written file is an executable, attempting to `execve()` the new file can fail with `ETXTBSY` if it's called after the writer process exited but before the async `fput()` has run. Reported-and-tested-by: "Jürg Billeter" <j@bitron.ch> Cc: <stable@vger.kernel.org> # v6.3 Link: https://lore.kernel.org/all/4f66cded234462964899f2a661750d6798a57ec0.camel@bitron.ch/ Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-06-28Merge tag 'mm-stable-2023-06-24-19-15' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull mm updates from Andrew Morton: - Yosry Ahmed brought back some cgroup v1 stats in OOM logs - Yosry has also eliminated cgroup's atomic rstat flushing - Nhat Pham adds the new cachestat() syscall. It provides userspace with the ability to query pagecache status - a similar concept to mincore() but more powerful and with improved usability - Mel Gorman provides more optimizations for compaction, reducing the prevalence of page rescanning - Lorenzo Stoakes has done some maintanance work on the get_user_pages() interface - Liam Howlett continues with cleanups and maintenance work to the maple tree code. Peng Zhang also does some work on maple tree - Johannes Weiner has done some cleanup work on the compaction code - David Hildenbrand has contributed additional selftests for get_user_pages() - Thomas Gleixner has contributed some maintenance and optimization work for the vmalloc code - Baolin Wang has provided some compaction cleanups, - SeongJae Park continues maintenance work on the DAMON code - Huang Ying has done some maintenance on the swap code's usage of device refcounting - Christoph Hellwig has some cleanups for the filemap/directio code - Ryan Roberts provides two patch series which yield some rationalization of the kernel's access to pte entries - use the provided APIs rather than open-coding accesses - Lorenzo Stoakes has some fixes to the interaction between pagecache and directio access to file mappings - John Hubbard has a series of fixes to the MM selftesting code - ZhangPeng continues the folio conversion campaign - Hugh Dickins has been working on the pagetable handling code, mainly with a view to reducing the load on the mmap_lock - Catalin Marinas has reduced the arm64 kmalloc() minimum alignment from 128 to 8 - Domenico Cerasuolo has improved the zswap reclaim mechanism by reorganizing the LRU management - Matthew Wilcox provides some fixups to make gfs2 work better with the buffer_head code - Vishal Moola also has done some folio conversion work - Matthew Wilcox has removed the remnants of the pagevec code - their functionality is migrated over to struct folio_batch * tag 'mm-stable-2023-06-24-19-15' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (380 commits) mm/hugetlb: remove hugetlb_set_page_subpool() mm: nommu: correct the range of mmap_sem_read_lock in task_mem() hugetlb: revert use of page_cache_next_miss() Revert "page cache: fix page_cache_next/prev_miss off by one" mm/vmscan: fix root proactive reclaim unthrottling unbalanced node mm: memcg: rename and document global_reclaim() mm: kill [add|del]_page_to_lru_list() mm: compaction: convert to use a folio in isolate_migratepages_block() mm: zswap: fix double invalidate with exclusive loads mm: remove unnecessary pagevec includes mm: remove references to pagevec mm: rename invalidate_mapping_pagevec to mapping_try_invalidate mm: remove struct pagevec net: convert sunrpc from pagevec to folio_batch i915: convert i915_gpu_error to use a folio_batch pagevec: rename fbatch_count() mm: remove check_move_unevictable_pages() drm: convert drm_gem_put_pages() to use a folio_batch i915: convert shmem_sg_free_table() to use a folio_batch scatterlist: add sg_set_folio() ...
2023-06-09fuse: use direct_write_fallbackChristoph Hellwig
Use the generic direct_write_fallback helper instead of duplicating the logic. Link: https://lkml.kernel.org/r/20230601145904.1385409-13-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Anna Schumaker <anna@kernel.org> Cc: Chao Yu <chao@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Xiubo Li <xiubli@redhat.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09fuse: drop redundant arguments to fuse_perform_writeChristoph Hellwig
pos is always equal to iocb->ki_pos, and mapping is always equal to iocb->ki_filp->f_mapping. Link: https://lkml.kernel.org/r/20230601145904.1385409-12-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Miklos Szeredi <mszeredi@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Anna Schumaker <anna@kernel.org> Cc: Chao Yu <chao@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Xiubo Li <xiubli@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09fuse: update ki_pos in fuse_perform_writeChristoph Hellwig
Both callers of fuse_perform_write need to updated ki_pos, move it into common code. Link: https://lkml.kernel.org/r/20230601145904.1385409-11-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Anna Schumaker <anna@kernel.org> Cc: Chao Yu <chao@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Hannes Reinecke <hare@suse.de> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Miklos Szeredi <mszeredi@redhat.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Xiubo Li <xiubli@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09backing_dev: remove current->backing_dev_infoChristoph Hellwig
Patch series "cleanup the filemap / direct I/O interaction", v4. This series cleans up some of the generic write helper calling conventions and the page cache writeback / invalidation for direct I/O. This is a spinoff from the no-bufferhead kernel project, for which we'll want to an use iomap based buffered write path in the block layer. This patch (of 12): The last user of current->backing_dev_info disappeared in commit b9b1335e6403 ("remove bdi_congested() and wb_congested() and related functions"). Remove the field and all assignments to it. Link: https://lkml.kernel.org/r/20230601145904.1385409-1-hch@lst.de Link: https://lkml.kernel.org/r/20230601145904.1385409-2-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Theodore Ts'o <tytso@mit.edu> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Anna Schumaker <anna@kernel.org> Cc: Chao Yu <chao@kernel.org> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Miklos Szeredi <mszeredi@redhat.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Xiubo Li <xiubli@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-24splice: Use filemap_splice_read() instead of generic_file_splice_read()David Howells
Replace pointers to generic_file_splice_read() with calls to filemap_splice_read(). Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> cc: Jens Axboe <axboe@kernel.dk> cc: Al Viro <viro@zeniv.linux.org.uk> cc: David Hildenbrand <david@redhat.com> cc: John Hubbard <jhubbard@nvidia.com> cc: linux-mm@kvack.org cc: linux-block@vger.kernel.org cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/20230522135018.2742245-29-dhowells@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-03-30iov_iter: add iter_iovec() helperJens Axboe
This returns a pointer to the current iovec entry in the iterator. Only useful with ITER_IOVEC right now, but it prepares us to treat ITER_UBUF and ITER_IOVEC identically for the first segment. Rename struct iov_iter->iov to iov_iter->__iov to find any potentially troublesome spots, and also to prevent anyone from adding new code that accesses iter->iov directly. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-27Merge tag 'fuse-update-6.3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse updates from Miklos Szeredi: - Fix regression in fileattr permission checking - Fix possible hang during PID namespace destruction - Add generic support for request extensions - Add supplementary group list extension - Add limited support for supplying supplementary groups in create requests - Documentation fixes * tag 'fuse-update-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: fuse: add inode/permission checks to fileattr_get/fileattr_set fuse: fix all W=1 kernel-doc warnings fuse: in fuse_flush only wait if someone wants the return code fuse: optional supplementary group in create requests fuse: add request extension
2023-02-23Merge tag 'mm-stable-2023-02-20-13-37' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Daniel Verkamp has contributed a memfd series ("mm/memfd: add F_SEAL_EXEC") which permits the setting of the memfd execute bit at memfd creation time, with the option of sealing the state of the X bit. - Peter Xu adds a patch series ("mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare") which addresses a rare race condition related to PMD unsharing. - Several folioification patch serieses from Matthew Wilcox, Vishal Moola, Sidhartha Kumar and Lorenzo Stoakes - Johannes Weiner has a series ("mm: push down lock_page_memcg()") which does perform some memcg maintenance and cleanup work. - SeongJae Park has added DAMOS filtering to DAMON, with the series "mm/damon/core: implement damos filter". These filters provide users with finer-grained control over DAMOS's actions. SeongJae has also done some DAMON cleanup work. - Kairui Song adds a series ("Clean up and fixes for swap"). - Vernon Yang contributed the series "Clean up and refinement for maple tree". - Yu Zhao has contributed the "mm: multi-gen LRU: memcg LRU" series. It adds to MGLRU an LRU of memcgs, to improve the scalability of global reclaim. - David Hildenbrand has added some userfaultfd cleanup work in the series "mm: uffd-wp + change_protection() cleanups". - Christoph Hellwig has removed the generic_writepages() library function in the series "remove generic_writepages". - Baolin Wang has performed some maintenance on the compaction code in his series "Some small improvements for compaction". - Sidhartha Kumar is doing some maintenance work on struct page in his series "Get rid of tail page fields". - David Hildenbrand contributed some cleanup, bugfixing and generalization of pte management and of pte debugging in his series "mm: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE on all architectures with swap PTEs". - Mel Gorman and Neil Brown have removed the __GFP_ATOMIC allocation flag in the series "Discard __GFP_ATOMIC". - Sergey Senozhatsky has improved zsmalloc's memory utilization with his series "zsmalloc: make zspage chain size configurable". - Joey Gouly has added prctl() support for prohibiting the creation of writeable+executable mappings. The previous BPF-based approach had shortcomings. See "mm: In-kernel support for memory-deny-write-execute (MDWE)". - Waiman Long did some kmemleak cleanup and bugfixing in the series "mm/kmemleak: Simplify kmemleak_cond_resched() & fix UAF". - T.J. Alumbaugh has contributed some MGLRU cleanup work in his series "mm: multi-gen LRU: improve". - Jiaqi Yan has provided some enhancements to our memory error statistics reporting, mainly by presenting the statistics on a per-node basis. See the series "Introduce per NUMA node memory error statistics". - Mel Gorman has a second and hopefully final shot at fixing a CPU-hog regression in compaction via his series "Fix excessive CPU usage during compaction". - Christoph Hellwig does some vmalloc maintenance work in the series "cleanup vfree and vunmap". - Christoph Hellwig has removed block_device_operations.rw_page() in ths series "remove ->rw_page". - We get some maple_tree improvements and cleanups in Liam Howlett's series "VMA tree type safety and remove __vma_adjust()". - Suren Baghdasaryan has done some work on the maintainability of our vm_flags handling in the series "introduce vm_flags modifier functions". - Some pagemap cleanup and generalization work in Mike Rapoport's series "mm, arch: add generic implementation of pfn_valid() for FLATMEM" and "fixups for generic implementation of pfn_valid()" - Baoquan He has done some work to make /proc/vmallocinfo and /proc/kcore better represent the real state of things in his series "mm/vmalloc.c: allow vread() to read out vm_map_ram areas". - Jason Gunthorpe rationalized the GUP system's interface to the rest of the kernel in the series "Simplify the external interface for GUP". - SeongJae Park wishes to migrate people from DAMON's debugfs interface over to its sysfs interface. To support this, we'll temporarily be printing warnings when people use the debugfs interface. See the series "mm/damon: deprecate DAMON debugfs interface". - Andrey Konovalov provided the accurately named "lib/stackdepot: fixes and clean-ups" series. - Huang Ying has provided a dramatic reduction in migration's TLB flush IPI rates with the series "migrate_pages(): batch TLB flushing". - Arnd Bergmann has some objtool fixups in "objtool warning fixes". * tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (505 commits) include/linux/migrate.h: remove unneeded externs mm/memory_hotplug: cleanup return value handing in do_migrate_range() mm/uffd: fix comment in handling pte markers mm: change to return bool for isolate_movable_page() mm: hugetlb: change to return bool for isolate_hugetlb() mm: change to return bool for isolate_lru_page() mm: change to return bool for folio_isolate_lru() objtool: add UACCESS exceptions for __tsan_volatile_read/write kmsan: disable ftrace in kmsan core code kasan: mark addr_has_metadata __always_inline mm: memcontrol: rename memcg_kmem_enabled() sh: initialize max_mapnr m68k/nommu: add missing definition of ARCH_PFN_OFFSET mm: percpu: fix incorrect size in pcpu_obj_full_size() maple_tree: reduce stack usage with gcc-9 and earlier mm: page_alloc: call panic() when memoryless node allocation fails mm: multi-gen LRU: avoid futile retries migrate_pages: move THP/hugetlb migration support check to simplify code migrate_pages: batch flushing TLB migrate_pages: share more code between _unmap and _move ...
2023-02-20Merge tag 'fs.idmapped.v6.3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping Pull vfs idmapping updates from Christian Brauner: - Last cycle we introduced the dedicated struct mnt_idmap type for mount idmapping and the required infrastucture in 256c8aed2b42 ("fs: introduce dedicated idmap type for mounts"). As promised in last cycle's pull request message this converts everything to rely on struct mnt_idmap. Currently we still pass around the plain namespace that was attached to a mount. This is in general pretty convenient but it makes it easy to conflate namespaces that are relevant on the filesystem with namespaces that are relevant on the mount level. Especially for non-vfs developers without detailed knowledge in this area this was a potential source for bugs. This finishes the conversion. Instead of passing the plain namespace around this updates all places that currently take a pointer to a mnt_userns with a pointer to struct mnt_idmap. Now that the conversion is done all helpers down to the really low-level helpers only accept a struct mnt_idmap argument instead of two namespace arguments. Conflating mount and other idmappings will now cause the compiler to complain loudly thus eliminating the possibility of any bugs. This makes it impossible for filesystem developers to mix up mount and filesystem idmappings as they are two distinct types and require distinct helpers that cannot be used interchangeably. Everything associated with struct mnt_idmap is moved into a single separate file. With that change no code can poke around in struct mnt_idmap. It can only be interacted with through dedicated helpers. That means all filesystems are and all of the vfs is completely oblivious to the actual implementation of idmappings. We are now also able to extend struct mnt_idmap as we see fit. For example, we can decouple it completely from namespaces for users that don't require or don't want to use them at all. We can also extend the concept of idmappings so we can cover filesystem specific requirements. In combination with the vfs{g,u}id_t work we finished in v6.2 this makes this feature substantially more robust and thus difficult to implement wrong by a given filesystem and also protects the vfs. - Enable idmapped mounts for tmpfs and fulfill a longstanding request. A long-standing request from users had been to make it possible to create idmapped mounts for tmpfs. For example, to share the host's tmpfs mount between multiple sandboxes. This is a prerequisite for some advanced Kubernetes cases. Systemd also has a range of use-cases to increase service isolation. And there are more users of this. However, with all of the other work going on this was way down on the priority list but luckily someone other than ourselves picked this up. As usual the patch is tiny as all the infrastructure work had been done multiple kernel releases ago. In addition to all the tests that we already have I requested that Rodrigo add a dedicated tmpfs testsuite for idmapped mounts to xfstests. It is to be included into xfstests during the v6.3 development cycle. This should add a slew of additional tests. * tag 'fs.idmapped.v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping: (26 commits) shmem: support idmapped mounts for tmpfs fs: move mnt_idmap fs: port vfs{g,u}id helpers to mnt_idmap fs: port fs{g,u}id helpers to mnt_idmap fs: port i_{g,u}id_into_vfs{g,u}id() to mnt_idmap fs: port i_{g,u}id_{needs_}update() to mnt_idmap quota: port to mnt_idmap fs: port privilege checking helpers to mnt_idmap fs: port inode_owner_or_capable() to mnt_idmap fs: port inode_init_owner() to mnt_idmap fs: port acl to mnt_idmap fs: port xattr to mnt_idmap fs: port ->permission() to pass mnt_idmap fs: port ->fileattr_set() to pass mnt_idmap fs: port ->set_acl() to pass mnt_idmap fs: port ->get_acl() to pass mnt_idmap fs: port ->tmpfile() to pass mnt_idmap fs: port ->rename() to pass mnt_idmap fs: port ->mknod() to pass mnt_idmap fs: port ->mkdir() to pass mnt_idmap ...
2023-02-02fs: convert writepage_t callback to pass a folioMatthew Wilcox (Oracle)
Patch series "Convert writepage_t to use a folio". More folioisation. I split out the mpage work from everything else because it completely dominated the patch, but some implementations I just converted outright. This patch (of 2): We always write back an entire folio, but that's currently passed as the head page. Convert all filesystems that use write_cache_pages() to expect a folio instead of a page. Link: https://lkml.kernel.org/r/20230126201255.1681189-1-willy@infradead.org Link: https://lkml.kernel.org/r/20230126201255.1681189-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-26fuse: fix all W=1 kernel-doc warningsRandy Dunlap
Use correct function name in kernel-doc notation. (1) Don't use "/**" to begin non-kernel-doc comments. (3) Fixes these warnings: fs/fuse/cuse.c:272: warning: expecting prototype for cuse_parse_dev_info(). Prototype was for cuse_parse_devinfo() instead fs/fuse/dev.c:212: warning: expecting prototype for A new request is available, wake fiq(). Prototype was for fuse_dev_wake_and_unlock() instead fs/fuse/dir.c:149: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst * Mark the attributes as stale due to an atime change. Avoid the invalidate if fs/fuse/file.c:656: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst * In case of short read, the caller sets 'pos' to the position of Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-01-26fuse: in fuse_flush only wait if someone wants the return codeEric W. Biederman
If a fuse filesystem is mounted inside a container, there is a problem during pid namespace destruction. The scenario is: 1. task (a thread in the fuse server, with a fuse file open) starts exiting, does exit_signals(), goes into fuse_flush() -> wait 2. fuse daemon gets killed, tries to wake everyone up 3. task from 1 is stuck because complete_signal() doesn't wake it up, since it has PF_EXITING. The result is that the thread will never be woken up, and pid namespace destruction will block indefinitely. To add insult to injury, nobody is waiting for these return codes, since the pid namespace is being destroyed. To fix this, let's not block on flush operations when the current task has PF_EXITING. This does change the semantics slightly: the wait here is for posix locks to be unlocked, so the task will exit before things are unlocked. To quote Miklos: "remote" posix locks are almost never used due to problems like this, so I think it's safe to do this. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Tycho Andersen <tycho@tycho.pizza> Link: https://lore.kernel.org/all/YrShFXRLtRt6T%2Fj+@risky/ Tested-by: Tycho Andersen <tycho@tycho.pizza> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-01-19fs: port privilege checking helpers to mnt_idmapChristian Brauner
Convert to struct mnt_idmap. Last cycle we merged the necessary infrastructure in 256c8aed2b42 ("fs: introduce dedicated idmap type for mounts"). This is just the conversion to struct mnt_idmap. Currently we still pass around the plain namespace that was attached to a mount. This is in general pretty convenient but it makes it easy to conflate namespaces that are relevant on the filesystem with namespaces that are relevent on the mount level. Especially for non-vfs developers without detailed knowledge in this area this can be a potential source for bugs. Once the conversion to struct mnt_idmap is done all helpers down to the really low-level helpers will take a struct mnt_idmap argument instead of two namespace arguments. This way it becomes impossible to conflate the two eliminating the possibility of any bugs. All of the vfs and all filesystems only operate on struct mnt_idmap. Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-11filelock: move file locking definitions to separate header fileJeff Layton
The file locking definitions have lived in fs.h since the dawn of time, but they are only used by a small subset of the source files that include it. Move the file locking definitions to a new header file, and add the appropriate #include directives to the source files that need them. By doing this we trim down fs.h a bit and limit the amount of rebuilding that has to be done when we make changes to the file locking APIs. Reviewed-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Howells <dhowells@redhat.com> Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Acked-by: Chuck Lever <chuck.lever@oracle.com> Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com> Acked-by: Steve French <stfrench@microsoft.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Jeff Layton <jlayton@kernel.org>
2022-12-12Merge tag 'fuse-update-6.2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse update from Miklos Szeredi: - Allow some write requests to proceed in parallel - Fix a performance problem with allow_sys_admin_access - Add a special kind of invalidation that doesn't immediately purge submounts - On revalidation treat the target of rename(RENAME_NOREPLACE) the same as open(O_EXCL) - Use type safe helpers for some mnt_userns transformations * tag 'fuse-update-6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: fuse: Rearrange fuse_allow_current_process checks fuse: allow non-extending parallel direct writes on the same file fuse: remove the unneeded result variable fuse: port to vfs{g,u}id_t and associated helpers fuse: Remove user_ns check for FUSE_DEV_IOC_CLONE fuse: always revalidate rename target dentry fuse: add "expire only" mode to FUSE_NOTIFY_INVAL_ENTRY fs/fuse: Replace kmap() with kmap_local_page()
2022-12-12Merge tag 'fs.ovl.setgid.v6.2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping Pull setgid inheritance updates from Christian Brauner: "This contains the work to make setgid inheritance consistent between modifying a file and when changing ownership or mode as this has been a repeated source of very subtle bugs. The gist is that we perform the same permission checks in the write path as we do in the ownership and mode changing paths after this series where we're currently doing different things. We've already made setgid inheritance a lot more consistent and reliable in the last releases by moving setgid stripping from the individual filesystems up into the vfs. This aims to make the logic even more consistent and easier to understand and also to fix long-standing overlayfs setgid inheritance bugs. Miklos was nice enough to just let me carry the trivial overlayfs patches from Amir too. Below is a more detailed explanation how the current difference in setgid handling lead to very subtle bugs exemplified via overlayfs which is a victim of the current rules. I hope this explains why I think taking the regression risk here is worth it. A long while ago I found a few setgid inheritance bugs in overlayfs in the write path in certain conditions. Amir recently picked this back up in [1] and I jumped on board to fix this more generally. On the surface all that overlayfs would need to fix setgid inheritance would be to call file_remove_privs() or file_modified() but actually that isn't enough because the setgid inheritance api is wildly inconsistent in that area. Before this pr setgid stripping in file_remove_privs()'s old should_remove_suid() helper was inconsistent with other parts of the vfs. Specifically, it only raises ATTR_KILL_SGID if the inode is S_ISGID and S_IXGRP but not if the inode isn't in the caller's groups and the caller isn't privileged over the inode although we require this already in setattr_prepare() and setattr_copy() and so all filesystem implement this requirement implicitly because they have to use setattr_{prepare,copy}() anyway. But the inconsistency shows up in setgid stripping bugs for overlayfs in xfstests (e.g., generic/673, generic/683, generic/685, generic/686, generic/687). For example, we test whether suid and setgid stripping works correctly when performing various write-like operations as an unprivileged user (fallocate, reflink, write, etc.): echo "Test 1 - qa_user, non-exec file $verb" setup_testfile chmod a+rws $junk_file commit_and_check "$qa_user" "$verb" 64k 64k The test basically creates a file with 6666 permissions. While the file has the S_ISUID and S_ISGID bits set it does not have the S_IXGRP set. On a regular filesystem like xfs what will happen is: sys_fallocate() -> vfs_fallocate() -> xfs_file_fallocate() -> file_modified() -> __file_remove_privs() -> dentry_needs_remove_privs() -> should_remove_suid() -> __remove_privs() newattrs.ia_valid = ATTR_FORCE | kill; -> notify_change() -> setattr_copy() In should_remove_suid() we can see that ATTR_KILL_SUID is raised unconditionally because the file in the test has S_ISUID set. But we also see that ATTR_KILL_SGID won't be set because while the file is S_ISGID it is not S_IXGRP (see above) which is a condition for ATTR_KILL_SGID being raised. So by the time we call notify_change() we have attr->ia_valid set to ATTR_KILL_SUID | ATTR_FORCE. Now notify_change() sees that ATTR_KILL_SUID is set and does: ia_valid = attr->ia_valid |= ATTR_MODE attr->ia_mode = (inode->i_mode & ~S_ISUID); which means that when we call setattr_copy() later we will definitely update inode->i_mode. Note that attr->ia_mode still contains S_ISGID. Now we call into the filesystem's ->setattr() inode operation which will end up calling setattr_copy(). Since ATTR_MODE is set we will hit: if (ia_valid & ATTR_MODE) { umode_t mode = attr->ia_mode; vfsgid_t vfsgid = i_gid_into_vfsgid(mnt_userns, inode); if (!vfsgid_in_group_p(vfsgid) && !capable_wrt_inode_uidgid(mnt_userns, inode, CAP_FSETID)) mode &= ~S_ISGID; inode->i_mode = mode; } and since the caller in the test is neither capable nor in the group of the inode the S_ISGID bit is stripped. But assume the file isn't suid then ATTR_KILL_SUID won't be raised which has the consequence that neither the setgid nor the suid bits are stripped even though it should be stripped because the inode isn't in the caller's groups and the caller isn't privileged over the inode. If overlayfs is in the mix things become a bit more complicated and the bug shows up more clearly. When e.g., ovl_setattr() is hit from ovl_fallocate()'s call to file_remove_privs() then ATTR_KILL_SUID and ATTR_KILL_SGID might be raised but because the check in notify_change() is questioning the ATTR_KILL_SGID flag again by requiring S_IXGRP for it to be stripped the S_ISGID bit isn't removed even though it should be stripped: sys_fallocate() -> vfs_fallocate() -> ovl_fallocate() -> file_remove_privs() -> dentry_needs_remove_privs() -> should_remove_suid() -> __remove_privs() newattrs.ia_valid = ATTR_FORCE | kill; -> notify_change() -> ovl_setattr() /* TAKE ON MOUNTER'S CREDS */ -> ovl_do_notify_change() -> notify_change() /* GIVE UP MOUNTER'S CREDS */ /* TAKE ON MOUNTER'S CREDS */ -> vfs_fallocate() -> xfs_file_fallocate() -> file_modified() -> __file_remove_privs() -> dentry_needs_remove_privs() -> should_remove_suid() -> __remove_privs() newattrs.ia_valid = attr_force | kill; -> notify_change() The fix for all of this is to make file_remove_privs()'s should_remove_suid() helper perform the same checks as we already require in setattr_prepare() and setattr_copy() and have notify_change() not pointlessly requiring S_IXGRP again. It doesn't make any sense in the first place because the caller must calculate the flags via should_remove_suid() anyway which would raise ATTR_KILL_SGID Note that some xfstests will now fail as these patches will cause the setgid bit to be lost in certain conditions for unprivileged users modifying a setgid file when they would've been kept otherwise. I think this risk is worth taking and I explained and mentioned this multiple times on the list [2]. Enforcing the rules consistently across write operations and chmod/chown will lead to losing the setgid bit in cases were it might've been retained before. While I've mentioned this a few times but it's worth repeating just to make sure that this is understood. For the sake of maintainability, consistency, and security this is a risk worth taking. If we really see regressions for workloads the fix is to have special setgid handling in the write path again with different semantics from chmod/chown and possibly additional duct tape for overlayfs. I'll update the relevant xfstests with if you should decide to merge this second setgid cleanup. Before that people should be aware that there might be failures for fstests where unprivileged users modify a setgid file" Link: https://lore.kernel.org/linux-fsdevel/20221003123040.900827-1-amir73il@gmail.com [1] Link: https://lore.kernel.org/linux-fsdevel/20221122142010.zchf2jz2oymx55qi@wittgenstein [2] * tag 'fs.ovl.setgid.v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping: fs: use consistent setgid checks in is_sxid() ovl: remove privs in ovl_fallocate() ovl: remove privs in ovl_copyfile() attr: use consistent sgid stripping checks attr: add setattr_should_drop_sgid() fs: move should_remove_suid() attr: add in_group_or_capable()
2022-11-23fuse: allow non-extending parallel direct writes on the same fileDharmendra Singh
In general, as of now, in FUSE, direct writes on the same file are serialized over inode lock i.e we hold inode lock for the full duration of the write request. I could not find in fuse code and git history a comment which clearly explains why this exclusive lock is taken for direct writes. Following might be the reasons for acquiring an exclusive lock but not be limited to 1) Our guess is some USER space fuse implementations might be relying on this lock for serialization. 2) The lock protects against file read/write size races. 3) Ruling out any issues arising from partial write failures. This patch relaxes the exclusive lock for direct non-extending writes only. File size extending writes might not need the lock either, but we are not entirely sure if there is a risk to introduce any kind of regression. Furthermore, benchmarking with fio does not show a difference between patch versions that take on file size extension a) an exclusive lock and b) a shared lock. A possible example of an issue with i_size extending writes are write error cases. Some writes might succeed and others might fail for file system internal reasons - for example ENOSPACE. With parallel file size extending writes it _might_ be difficult to revert the action of the failing write, especially to restore the right i_size. With these changes, we allow non-extending parallel direct writes on the same file with the help of a flag called FOPEN_PARALLEL_DIRECT_WRITES. If this flag is set on the file (flag is passed from libfuse to fuse kernel as part of file open/create), we do not take exclusive lock anymore, but instead use a shared lock that allows non-extending writes to run in parallel. FUSE implementations which rely on this inode lock for serialization can continue to do so and serialized direct writes are still the default. Implementations that do not do write serialization need to be updated and need to set the FOPEN_PARALLEL_DIRECT_WRITES flag in their file open/create reply. On patch review there were concerns that network file systems (or vfs multiple mounts of the same file system) might have issues with parallel writes. We believe this is not the case, as this is just a local lock, which network file systems could not rely on anyway. I.e. this lock is just for local consistency. Signed-off-by: Dharmendra Singh <dsingh@ddn.com> Signed-off-by: Bernd Schubert <bschubert@ddn.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>