summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2024-11-18Merge tag 'vfs-6.13.netfs' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull netfs updates from Christian Brauner: "Various fixes for the netfs library and related infrastructure: cachefiles: - Fix a dentry leak in cachefiles_open_file() - Fix incorrect length return value in cachefiles_ondemand_fd_write_iter() - Fix missing pos updates in cachefiles_ondemand_fd_write_iter() - Clean up in cachefiles_commit_tmpfile() - Fix NULL pointer dereference in object->file - Add a memory barrier for FSCACHE_VOLUME_CREATING netfs: - Remove call to folio_index() - Fix a few minor bugs in netfs_page_mkwrite() - Remove unnecessary references to pages" * tag 'vfs-6.13.netfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: netfs/fscache: Add a memory barrier for FSCACHE_VOLUME_CREATING cachefiles: Fix NULL pointer dereference in object->file cachefiles: Clean up in cachefiles_commit_tmpfile() cachefiles: Fix missing pos updates in cachefiles_ondemand_fd_write_iter() cachefiles: Fix incorrect length return value in cachefiles_ondemand_fd_write_iter() netfs: Remove unnecessary references to pages netfs: Fix a few minor bugs in netfs_page_mkwrite() netfs: Remove call to folio_index()
2024-11-18Merge tag 'vfs-6.13.pagecache' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs pagecache updates from Christian Brauner: "Cleanup filesystem page flag usage: This continues the work to make the mappedtodisk/owner_2 flag available to filesystems which don't use buffer heads. Further patches remove uses of Private2. This brings us very close to being rid of it entirely" * tag 'vfs-6.13.pagecache' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: migrate: Remove references to Private2 ceph: Remove call to PagePrivate2() btrfs: Switch from using the private_2 flag to owner_2 mm: Remove PageMappedToDisk nilfs2: Convert nilfs_copy_buffer() to use folios fs: Move clearing of mappedtodisk to buffer.c
2024-11-18Merge tag 'vfs-6.13.rust.file' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs rust file abstractions from Christian Brauner: "This contains the file abstractions needed by the Rust implementation of the Binder driver and other parts of the kernel. Let's treat this as a first attempt at getting something working but I do expect the actual interfaces to change significantly over time. Simply because we are still figuring out what actually works. But there's no point in further theorizing. Let's see how it holds up with actual users" * tag 'vfs-6.13.rust.file' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: rust: task: adjust safety comments in Task methods rust: add seqfile abstraction rust: file: add abstraction for `poll_table` rust: file: add `Kuid` wrapper rust: file: add `FileDescriptorReservation` rust: security: add abstraction for secctx rust: cred: add Rust abstraction for `struct cred` rust: file: add Rust abstraction for `struct file` rust: task: add `Task::current_raw` rust: types: add `NotThreadSafe`
2024-11-18Merge tag 'vfs-6.13.misc' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "Features: - Fixup and improve NLM and kNFSD file lock callbacks Last year both GFS2 and OCFS2 had some work done to make their locking more robust when exported over NFS. Unfortunately, part of that work caused both NLM (for NFS v3 exports) and kNFSD (for NFSv4.1+ exports) to no longer send lock notifications to clients This in itself is not a huge problem because most NFS clients will still poll the server in order to acquire a conflicted lock It's important for NLM and kNFSD that they do not block their kernel threads inside filesystem's file_lock implementations because that can produce deadlocks. We used to make sure of this by only trusting that posix_lock_file() can correctly handle blocking lock calls asynchronously, so the lock managers would only setup their file_lock requests for async callbacks if the filesystem did not define its own lock() file operation However, when GFS2 and OCFS2 grew the capability to correctly handle blocking lock requests asynchronously, they started signalling this behavior with EXPORT_OP_ASYNC_LOCK, and the check for also trusting posix_lock_file() was inadvertently dropped, so now most filesystems no longer produce lock notifications when exported over NFS Fix this by using an fop_flag which greatly simplifies the problem and grooms the way for future uses by both filesystems and lock managers alike - Add a sysctl to delete the dentry when a file is removed instead of making it a negative dentry Commit 681ce8623567 ("vfs: Delete the associated dentry when deleting a file") introduced an unconditional deletion of the associated dentry when a file is removed. However, this led to performance regressions in specific benchmarks, such as ilebench.sum_operations/s, prompting a revert in commit 4a4be1ad3a6e ("Revert "vfs: Delete the associated dentry when deleting a file""). This reintroduces the concept conditionally through a sysctl - Expand the statmount() system call: * Report the filesystem subtype in a new fs_subtype field to e.g., report fuse filesystem subtypes * Report the superblock source in a new sb_source field * Add a new way to return filesystem specific mount options in an option array that returns filesystem specific mount options separated by zero bytes and unescaped. This allows caller's to retrieve filesystem specific mount options and immediately pass them to e.g., fsconfig() without having to unescape or split them * Report security (LSM) specific mount options in a separate security option array. We don't lump them together with filesystem specific mount options as security mount options are generic and most users aren't interested in them The format is the same as for the filesystem specific mount option array - Support relative paths in fsconfig()'s FSCONFIG_SET_STRING command - Optimize acl_permission_check() to avoid costly {g,u}id ownership checks if possible - Use smp_mb__after_spinlock() to avoid full smp_mb() in evict() - Add synchronous wakeup support for ep_poll_callback. Currently, epoll only uses wake_up() to wake up task. But sometimes there are epoll users which want to use the synchronous wakeup flag to give a hint to the scheduler, e.g., the Android binder driver. So add a wake_up_sync() define, and use wake_up_sync() when sync is true in ep_poll_callback() Fixes: - Fix kernel documentation for inode_insert5() and iget5_locked() - Annotate racy epoll check on file->f_ep - Make F_DUPFD_QUERY associative - Avoid filename buffer overrun in initramfs - Don't let statmount() return empty strings - Add a cond_resched() to dump_user_range() to avoid hogging the CPU - Don't query the device logical blocksize multiple times for hfsplus - Make filemap_read() check that the offset is positive or zero Cleanups: - Various typo fixes - Cleanup wbc_attach_fdatawrite_inode() - Add __releases annotation to wbc_attach_and_unlock_inode() - Add hugetlbfs tracepoints - Fix various vfs kernel doc parameters - Remove obsolete TODO comment from io_cancel() - Convert wbc_account_cgroup_owner() to take a folio - Fix comments for BANDWITH_INTERVAL and wb_domain_writeout_add() - Reorder struct posix_acl to save 8 bytes - Annotate struct posix_acl with __counted_by() - Replace one-element array with flexible array member in freevxfs - Use idiomatic atomic64_inc_return() in alloc_mnt_ns()" * tag 'vfs-6.13.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (35 commits) statmount: retrieve security mount options vfs: make evict() use smp_mb__after_spinlock instead of smp_mb statmount: add flag to retrieve unescaped options fs: add the ability for statmount() to report the sb_source writeback: wbc_attach_fdatawrite_inode out of line writeback: add a __releases annoation to wbc_attach_and_unlock_inode fs: add the ability for statmount() to report the fs_subtype fs: don't let statmount return empty strings fs:aio: Remove TODO comment suggesting hash or array usage in io_cancel() hfsplus: don't query the device logical block size multiple times freevxfs: Replace one-element array with flexible array member fs: optimize acl_permission_check() initramfs: avoid filename buffer overrun fs/writeback: convert wbc_account_cgroup_owner to take a folio acl: Annotate struct posix_acl with __counted_by() acl: Realign struct posix_acl to save 8 bytes epoll: Add synchronous wakeup support for ep_poll_callback coredump: add cond_resched() to dump_user_range mm/page-writeback.c: Fix comment of wb_domain_writeout_add() mm/page-writeback.c: Update comment for BANDWIDTH_INTERVAL ...
2024-11-18Merge tag 'vfs-6.13.mount.api' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs mount api conversions from Christian Brauner: "Convert adfs, affs, befs, hfs, hfsplus, jfs, and hpfs to the new mount api" * tag 'vfs-6.13.mount.api' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: efs: fix the efs new mount api implementation ubifs: Convert ubifs to use the new mount API hpfs: convert hpfs to use the new mount api jfs: convert jfs to use the new mount api hfsplus: convert hfsplus to use the new mount api hfs: convert hfs to use the new mount api befs: convert befs to use the new mount api affs: convert affs to use the new mount api adfs: convert adfs to use the new mount api
2024-11-18Merge tag 'vfs-6.13.mgtime' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs multigrain timestamps from Christian Brauner: "This is another try at implementing multigrain timestamps. This time with significant help from the timekeeping maintainers to reduce the performance impact. Thomas provided a base branch that contains the required timekeeping interfaces for the VFS. It serves as the base for the multi-grain timestamp work: - Multigrain timestamps allow the kernel to use fine-grained timestamps when an inode's attributes is being actively observed via ->getattr(). With this support, it's possible for a file to get a fine-grained timestamp, and another modified after it to get a coarse-grained stamp that is earlier than the fine-grained time. If this happens then the files can appear to have been modified in reverse order, which breaks VFS ordering guarantees. To prevent this, a floor value is maintained for multigrain timestamps. Whenever a fine-grained timestamp is handed out, record it, and when later coarse-grained stamps are handed out, ensure they are not earlier than that value. If the coarse-grained timestamp is earlier than the fine-grained floor, return the floor value instead. The timekeeper changes add a static singleton atomic64_t into timekeeper.c that is used to keep track of the latest fine-grained time ever handed out. This is tracked as a monotonic ktime_t value to ensure that it isn't affected by clock jumps. Because it is updated at different times than the rest of the timekeeper object, the floor value is managed independently of the timekeeper via a cmpxchg() operation, and sits on its own cacheline. Two new public timekeeper interfaces are added: (1) ktime_get_coarse_real_ts64_mg() fills a timespec64 with the later of the coarse-grained clock and the floor time (2) ktime_get_real_ts64_mg() gets the fine-grained clock value, and tries to swap it into the floor. A timespec64 is filled with the result. - The VFS has always used coarse-grained timestamps when updating the ctime and mtime after a change. This has the benefit of allowing filesystems to optimize away a lot metadata updates, down to around 1 per jiffy, even when a file is under heavy writes. Unfortunately, this has always been an issue when we're exporting via NFSv3, which relies on timestamps to validate caches. A lot of changes can happen in a jiffy, so timestamps aren't sufficient to help the client decide when to invalidate the cache. Even with NFSv4, a lot of exported filesystems don't properly support a change attribute and are subject to the same problems with timestamp granularity. Other applications have similar issues with timestamps (e.g backup applications). If we were to always use fine-grained timestamps, that would improve the situation, but that becomes rather expensive, as the underlying filesystem would have to log a lot more metadata updates. This adds a way to only use fine-grained timestamps when they are being actively queried. Use the (unused) top bit in inode->i_ctime_nsec as a flag that indicates whether the current timestamps have been queried via stat() or the like. When it's set, we allow the kernel to use a fine-grained timestamp iff it's necessary to make the ctime show a different value. This solves the problem of being able to distinguish the timestamp between updates, but introduces a new problem: it's now possible for a file being changed to get a fine-grained timestamp. A file that is altered just a bit later can then get a coarse-grained one that appears older than the earlier fine-grained time. This violates timestamp ordering guarantees. This is where the earlier mentioned timkeeping interfaces help. A global monotonic atomic64_t value is kept that acts as a timestamp floor. When we go to stamp a file, we first get the latter of the current floor value and the current coarse-grained time. If the inode ctime hasn't been queried then we just attempt to stamp it with that value. If it has been queried, then first see whether the current coarse time is later than the existing ctime. If it is, then we accept that value. If it isn't, then we get a fine-grained time and try to swap that into the global floor. Whether that succeeds or fails, we take the resulting floor time, convert it to realtime and try to swap that into the ctime. We take the result of the ctime swap whether it succeeds or fails, since either is just as valid. Filesystems can opt into this by setting the FS_MGTIME fstype flag. Others should be unaffected (other than being subject to the same floor value as multigrain filesystems)" * tag 'vfs-6.13.mgtime' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs: reduce pointer chasing in is_mgtime() test tmpfs: add support for multigrain timestamps btrfs: convert to multigrain timestamps ext4: switch to multigrain timestamps xfs: switch to multigrain timestamps Documentation: add a new file documenting multigrain timestamps fs: add percpu counters for significant multigrain timestamp events fs: tracepoints around multigrain timestamp events fs: handle delegated timestamps in setattr_copy_mgtime timekeeping: Add percpu counter for tracking floor swap events timekeeping: Add interfaces for handling timestamps with a floor value fs: have setattr_copy handle multigrain timestamps appropriately fs: add infrastructure for multigrain timestamps
2024-11-18ceph: miscellaneous spelling fixesDmitry Antipov
Correct spelling here and there as suggested by codespell. Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-11-18ceph: Use strscpy() instead of strcpy() in __get_snap_name()Abdul Rahim
strcpy() performs no bounds checking on the destination buffer. This could result in linear overflows beyond the end of the buffer, leading to all kinds of misbehaviors [1]. This fixes checkpatch warning: WARNING: Prefer strscpy over strcpy [1] https://www.kernel.org/doc/html/latest/process/deprecated.html#strcpy [ idryomov: formatting ] Signed-off-by: Abdul Rahim <abdul.rahim@myyahoo.com> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-11-18ceph: Use str_true_false() helper in status_show()Thorsten Blum
Remove hard-coded strings by using the str_true_false() helper function. Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-11-18ceph: requalify some char pointers as constPatrick Donnelly
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-11-18ceph: extract entity name from device idPatrick Donnelly
Previously, the "name" in the new device syntax "<name>@<fsid>.<fsname>" was ignored because (presumably) tests were done using mount.ceph which also passed the entity name using "-o name=foo". If mounting is done without the mount.ceph helper, the new device id syntax fails to set the name properly. Cc: stable@vger.kernel.org Link: https://tracker.ceph.com/issues/68516 Signed-off-by: Patrick Donnelly <pdonnell@redhat.com> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-11-18ceph: Remove fs/ceph deadcodeDr. David Alan Gilbert
ceph_caps_revoking() has been unused since 2017's commit 3fb99d483e61 ("ceph: nuke startsync op") ceph_mdsc_open_export_target_sessions() has been unused since 2013's commit 11df2dfb610d ("ceph: add imported caps when handling cap export message") Remove them. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-11-18fsnotify: Fix ordering of iput() and watched_objects decrementJann Horn
Ensure the superblock is kept alive until we're done with iput(). Holding a reference to an inode is not allowed unless we ensure the superblock stays alive, which fsnotify does by keeping the watched_objects count elevated, so iput() must happen before the watched_objects decrement. This can lead to a UAF of something like sb->s_fs_info in tmpfs, but the UAF is hard to hit because race orderings that oops are more likely, thanks to the CHECK_DATA_CORRUPTION() block in generic_shutdown_super(). Also, ensure that fsnotify_put_sb_watched_objects() doesn't call fsnotify_sb_watched_objects() on a superblock that may have already been freed, which would cause a UAF read of sb->s_fsnotify_info. Cc: stable@kernel.org Fixes: d2f277e26f52 ("fsnotify: rename fsnotify_{get,put}_sb_connectors()") Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Jan Kara <jack@suse.cz>
2024-11-18dlm: fix dlm_recover_members refcount on errorAlexander Aring
If dlm_recover_members() fails we don't drop the references of the previous created root_list that holds and keep all rsbs alive during the recovery. It might be not an unlikely event because ping_members() could run into an -EINTR if another recovery progress was triggered again. Fixes: 3a747f4a2ee8 ("dlm: move rsb root_list to ls_recover() stack") Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2024-11-18Revert "nfs: don't reuse partially completed requests in ↵Trond Myklebust
nfs_lock_and_join_requests" This reverts commit b571cfcb9dcac187c6d967987792d37cb0688610. This patch appears to assume that if one request is complete, then the others will complete too before unlocking. That is not a valid assumption, since other requests could hit a non-fatal error or a short write that would cause them not to complete. Reported-by: Igor Raits <igor@gooddata.com> Link: https://bugzilla.kernel.org/show_bug.cgi?id=219508 Fixes: b571cfcb9dca ("nfs: don't reuse partially completed requests in nfs_lock_and_join_requests") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2024-11-18virtiofs: dax: remove ->writepages() callbackAsahi Lina
When using FUSE DAX with virtiofs, cache coherency is managed by the host. Disk persistence is handled via fsync() and friends, which are passed directly via the FUSE layer to the host. Therefore, there's no need to do dax_writeback_mapping_range(). All that ends up doing is a cache flush operation, which is not caught by KVM and doesn't do much, since the host and guest are already cache-coherent. Since dax_writeback_mapping_range() checks that the inode block size is equal to PAGE_SIZE, this fixes a spurious WARN when virtiofs is used with a mismatched guest PAGE_SIZE and virtiofs backing FS block size (this happens, for example, when it's a tmpfs and the host and guest have a different PAGE_SIZE). FUSE DAX does not require any particular FS block size, since it always performs DAX mappings in aligned 2MiB blocks. See discussion in [1]. [1] https://lore.kernel.org/lkml/20241101-dax-page-size-v1-1-eedbd0c6b08f@asahilina.net/T/#u [SzM: remove the empty callback] Suggested-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Asahi Lina <lina@asahilina.net> Acked-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-11-18fuse: check attributes staleness on fuse_iget()Zhang Tianci
Function fuse_direntplus_link() might call fuse_iget() to initialize a new fuse_inode and change its attributes. If fi->attr_version is always initialized with 0, even if the attributes returned by the FUSE_READDIR request is staled, as the new fi->attr_version is 0, fuse_change_attributes will still set the staled attributes to inode. This wrong behaviour may cause file size inconsistency even when there is no changes from server-side. To reproduce the issue, consider the following 2 programs (A and B) are running concurrently, A B ---------------------------------- -------------------------------- { /fusemnt/dir/f is a file path in a fuse mount, the size of f is 0. } readdir(/fusemnt/dir) start //Daemon set size 0 to f direntry fallocate(f, 1024) stat(f) // B see size 1024 echo 2 > /proc/sys/vm/drop_caches readdir(/fusemnt/dir) reply to kernel Kernel set 0 to the I_NEW inode stat(f) // B see size 0 In the above case, only program B is modifying the file size, however, B observes file size changing between the 2 'readonly' stat() calls. To fix this issue, we should make sure readdirplus still follows the rule of attr_version staleness checking even if the fi->attr_version is lost due to inode eviction. To identify this situation, the new fc->evict_ctr is used to record whether the eviction of inodes occurs during the readdirplus request processing. If it does, the result of readdirplus may be inaccurate; otherwise, the result of readdirplus can be trusted. Although this may still lead to incorrect invalidation, considering the relatively low frequency of evict occurrences, it should be acceptable. Link: https://lore.kernel.org/lkml/20230711043405.66256-2-zhangjiachen.jaycee@bytedance.com/ Link: https://lore.kernel.org/lkml/20241114070905.48901-1-zhangtianci.1997@bytedance.com/ Reported-by: Jiachen Zhang <zhangjiachen.jaycee@bytedance.com> Suggested-by: Miklos Szeredi <miklos@szeredi.hu> Signed-off-by: Zhang Tianci <zhangtianci.1997@bytedance.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-11-18erofs: handle NONHEAD !delta[1] lclusters gracefullyGao Xiang
syzbot reported a WARNING in iomap_iter_done: iomap_fiemap+0x73b/0x9b0 fs/iomap/fiemap.c:80 ioctl_fiemap fs/ioctl.c:220 [inline] Generally, NONHEAD lclusters won't have delta[1]==0, except for crafted images and filesystems created by pre-1.0 mkfs versions. Previously, it would immediately bail out if delta[1]==0, which led to inadequate decompressed lengths (thus FIEMAP is impacted). Treat it as delta[1]=1 to work around these legacy mkfs versions. `lclusterbits > 14` is illegal for compact indexes, error out too. Reported-by: syzbot+6c0b301317aa0156f9eb@syzkaller.appspotmail.com Closes: https://lore.kernel.org/r/67373c0c.050a0220.2a2fcc.0079.GAE@google.com Tested-by: syzbot+6c0b301317aa0156f9eb@syzkaller.appspotmail.com Fixes: d95ae5e25326 ("erofs: add support for the full decompressed length") Fixes: 001b8ccd0650 ("erofs: fix compact 4B support for 16k block size") Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20241115173651.3339514-1-hsiangkao@linux.alibaba.com
2024-11-18erofs: clarify direct I/O supportGao Xiang
Currently, only filesystems backed by block devices support direct I/O. Also remove the unnecessary strict checks that can be supported with iomap. Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20241115074625.2520728-1-hsiangkao@linux.alibaba.com
2024-11-18erofs: fix blksize < PAGE_SIZE for file-backed mountsHongzhen Luo
Adjust sb->s_blocksize{,_bits} directly for file-backed mounts when the fs block size is smaller than PAGE_SIZE. Previously, EROFS used sb_set_blocksize(), which caused a panic if bdev-backed mounts is not used. Fixes: fb176750266a ("erofs: add file-backed mount support") Signed-off-by: Hongzhen Luo <hongzhen@linux.alibaba.com> Link: https://lore.kernel.org/r/20241015103836.3757438-1-hongzhen@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2024-11-18erofs: get rid of `buf->kmap_type`Gao Xiang
After commit 927e5010ff5b ("erofs: use kmap_local_page() only for erofs_bread()"), `buf->kmap_type` actually has no use at all. Let's get rid of `buf->kmap_type` now. Suggested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20241114095813.839866-1-hsiangkao@linux.alibaba.com
2024-11-18erofs: fix file-backed mounts over FUSEGao Xiang
syzbot reported a null-ptr-deref in fuse_read_args_fill: fuse_read_folio+0xb0/0x100 fs/fuse/file.c:905 filemap_read_folio+0xc6/0x2a0 mm/filemap.c:2367 do_read_cache_folio+0x263/0x5c0 mm/filemap.c:3825 read_mapping_folio include/linux/pagemap.h:1011 [inline] erofs_bread+0x34d/0x7e0 fs/erofs/data.c:41 erofs_read_superblock fs/erofs/super.c:281 [inline] erofs_fc_fill_super+0x2b9/0x2500 fs/erofs/super.c:625 Unlike most filesystems, some network filesystems and FUSE need unavoidable valid `file` pointers for their read I/Os [1]. Anyway, those use cases need to be supported too. [1] https://docs.kernel.org/filesystems/vfs.html Reported-by: syzbot+0b1279812c46e48bb0c1@syzkaller.appspotmail.com Closes: https://lore.kernel.org/r/6727bbdf.050a0220.3c8d68.0a7e.GAE@google.com Fixes: fb176750266a ("erofs: add file-backed mount support") Tested-by: syzbot+0b1279812c46e48bb0c1@syzkaller.appspotmail.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20241114234905.1873723-1-hsiangkao@linux.alibaba.com
2024-11-18erofs: simplify definition of the log functionsGou Hao
Use printk instead of pr_info/err to reduce redundant code. Signed-off-by: Gou Hao <gouhao@uniontech.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20241114013247.30821-1-gouhao@uniontech.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2024-11-18erofs: add sysfs node to drop internal cachesChunhai Guo
Add a sysfs node to drop compression-related caches, currently used to drop in-memory pclusters and cached compressed folios. Signed-off-by: Chunhai Guo <guochunhai@vivo.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20241113041148.749129-1-guochunhai@vivo.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2024-11-18erofs: free pclusters if no cached folio is attachedChunhai Guo
Once a pcluster is fully decompressed and there are no attached cached folios, its corresponding `struct z_erofs_pcluster` will be freed. This will significantly reduce the frequency of calls to erofs_shrink_scan() and the memory allocated for `struct z_erofs_pcluster`. The tables below show approximately a 96% reduction in the calls to erofs_shrink_scan() and in the memory allocated for `struct z_erofs_pcluster` after applying this patch. The results were obtained by performing a test to copy a 4.1GB partition on ARM64 Android devices running the 6.6 kernel with an 8-core CPU and 12GB of memory. 1. The reduction in calls to erofs_shrink_scan(): +-----------------+-----------+----------+---------+ | | w/o patch | w/ patch | diff | +-----------------+-----------+----------+---------+ | Average (times) | 11390 | 390 | -96.57% | +-----------------+-----------+----------+---------+ 2. The reduction in memory released by erofs_shrink_scan(): +-----------------+-----------+----------+---------+ | | w/o patch | w/ patch | diff | +-----------------+-----------+----------+---------+ | Average (Byte) | 133612656 | 4434552 | -96.68% | +-----------------+-----------+----------+---------+ Signed-off-by: Chunhai Guo <guochunhai@vivo.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20241112043235.546164-1-guochunhai@vivo.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2024-11-18erofs: sunset `struct erofs_workgroup`Gao Xiang
`struct erofs_workgroup` was introduced to provide a unique header for all physically indexed objects. However, after big pclusters and shared pclusters are implemented upstream, it seems that all EROFS encoded data (which requires transformation) can be represented with `struct z_erofs_pcluster` directly. Move all members into `struct z_erofs_pcluster` for simplicity. Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20241021035323.3280682-3-hsiangkao@linux.alibaba.com
2024-11-18erofs: move erofs_workgroup operations into zdata.cGao Xiang
Move related helpers into zdata.c as an intermediate step of getting rid of `struct erofs_workgroup`, and rename: erofs_workgroup_put => z_erofs_put_pcluster erofs_workgroup_get => z_erofs_get_pcluster erofs_try_to_release_workgroup => erofs_try_to_release_pcluster erofs_shrink_workstation => z_erofs_shrink_scan Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20241021035323.3280682-2-hsiangkao@linux.alibaba.com
2024-11-18erofs: get rid of erofs_{find,insert}_workgroupGao Xiang
Just fold them into the only two callers since they are simple enough. Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20241021035323.3280682-1-hsiangkao@linux.alibaba.com
2024-11-17smb: client: fix use-after-free of signing keyPaulo Alcantara
Customers have reported use-after-free in @ses->auth_key.response with SMB2.1 + sign mounts which occurs due to following race: task A task B cifs_mount() dfs_mount_share() get_session() cifs_mount_get_session() cifs_send_recv() cifs_get_smb_ses() compound_send_recv() cifs_setup_session() smb2_setup_request() kfree_sensitive() smb2_calc_signature() crypto_shash_setkey() *UAF* Fix this by ensuring that we have a valid @ses->auth_key.response by checking whether @ses->ses_status is SES_GOOD or SES_EXITING with @ses->ses_lock held. After commit 24a9799aa8ef ("smb: client: fix UAF in smb2_reconnect_server()"), we made sure to call ->logoff() only when @ses was known to be good (e.g. valid ->auth_key.response), so it's safe to access signing key when @ses->ses_status == SES_EXITING. Cc: stable@vger.kernel.org Reported-by: Jay Shin <jaeshin@redhat.com> Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2024-11-17smb: client: Use str_yes_no() helper functionThorsten Blum
Remove hard-coded strings by using the str_yes_no() helper function. Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: Steve French <stfrench@microsoft.com>
2024-11-17smb: client: memcpy() with surrounding object base addressKees Cook
Like commit f1f047bd7ce0 ("smb: client: Fix -Wstringop-overflow issues"), adjust the memcpy() destination address to be based off the surrounding object rather than based off the 4-byte "Protocol" member. This avoids a build-time warning when compiling under CONFIG_FORTIFY_SOURCE with GCC 15: In function 'fortify_memcpy_chk', inlined from 'CIFSSMBSetPathInfo' at ../fs/smb/client/cifssmb.c:5358:2: ../include/linux/fortify-string.h:571:25: error: call to '__write_overflow_field' declared with attribute warning: detected write beyond size of field (1st parameter); maybe use struct_group()? [-Werror=attribute-warning] 571 | __write_overflow_field(p_size_field, size); | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Signed-off-by: Kees Cook <kees@kernel.org> Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2024-11-17cifs: Remove pre-historic unused CIFSSMBCopyDr. David Alan Gilbert
CIFSSMBCopy() is unused, remove it. It seems to have been that way pre-git; looking in a historic archive, I think it landed around May 2004 in Linus' BKrev: 40ab7591J_OgkpHW-qhzZukvAUAw9g and was unused back then. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Acked-by: Tom Talpey <tom@talpey.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2024-11-17ksmbd: fix malformed unsupported smb1 negotiate responseNamjae Jeon
When mounting with vers=1.0, ksmbd should return unsupported smb1 negotiate response. But this response is malformed. [ 6010.586702] CIFS: VFS: Bad protocol string signature header 0x25000000 [ 6010.586708] 00000000: 25000000 25000000 424d53ff 00000072 ...%...%.SMBr... [ 6010.586711] 00000010: c8408000 00000000 00000000 00000000 ..@............. [ 6010.586713] 00000020: 00 00 b9 32 00 00 01 00 01 ...2..... Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2024-11-16Merge tag 'mm-hotfixes-stable-2024-11-16-15-33' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull hotfixes from Andrew Morton: "10 hotfixes, 7 of which are cc:stable. All singletons, please see the changelogs for details" * tag 'mm-hotfixes-stable-2024-11-16-15-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mm: revert "mm: shmem: fix data-race in shmem_getattr()" ocfs2: uncache inode which has failed entering the group mm: fix NULL pointer dereference in alloc_pages_bulk_noprof mm, doc: update read_ahead_kb for MADV_HUGEPAGE fs/proc/task_mmu: prevent integer overflow in pagemap_scan_get_args() sched/task_stack: fix object_is_on_stack() for KASAN tagged pointers crash, powerpc: default to CRASH_DUMP=n on PPC_BOOK3S_32 mm/mremap: fix address wraparound in move_page_tables() tools/mm: fix compile error mm, swap: fix allocation and scanning race with swapoff
2024-11-16fs/9p: replace functions v9fs_cache_{register|unregister} with direct callsColin Ian King
The helper functions v9fs_cache_register and v9fs_cache_unregister are trivial helper functions that don't offer any extra functionality and are unncessary. Replace them with direct calls to v9fs_init_inode_cache and v9fs_destroy_inode_cache respectively to simplify the code. Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Message-ID: <20241107095756.10261-1-colin.i.king@gmail.com> Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
2024-11-15hostfs: Fix the NULL vs IS_ERR() bug for __filemap_get_folio()ZhangPeng
The __filemap_get_folio() function returns error pointers. It never returns NULL. So use IS_ERR() to check it. Fixes: 1da86618bdce ("fs: Convert aops->write_begin to take a folio") Signed-off-by: ZhangPeng <zhangpeng362@huawei.com> Acked-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: Richard Weinberger <richard@nod.at>
2024-11-15dlm: fix recovery of middle conversionsAlexander Aring
In one special case, recovery is unable to reliably rebuild lock state by simply recreating lkb structs as sent from the lock holders. That case is when the lkb's include conversions between PR and CW modes. The recovery code has always recognized this special case, but the implemention has always been broken, and would set invalid modes in recovered lkb's. Unpredictable or bogus errors could then be returned for further locking calls on these locks. This bug has gone unnoticed for so long due to some combination of: - applications never or infrequently converting between PR/CW - recovery not occuring during these conversions - if the recovery bug does occur, the caller may not notice, depending on what further locking calls are made, e.g. if the lock is simply unlocked it may go unnoticed However, a core analysis from a recent gfs2 bug report points to this broken code. PR = Protected Read CW = Concurrent Write PR and CW are incompatible PR and PR are compatible CW and CW are compatible Example 1 node C, resource R granted: PR node A granted: PR node B granted: NL node C granted: NL node D - A sends convert PR->CW to C - C fails before A gets a reply - recovery occurs At this point, A does not know if it still holds the lock in PR, or if its conversion to CW was granted: - If A's conversion to CW was granted, then another node's CW lock may also have been granted. - If A's conversion to CW was not granted, it still holds a PR lock, and other nodes may also hold PR locks. So, the new master of R cannot simply recreate the lock from A using granted mode PR and requested mode CW. The new master must look at all the recovered locks to determine the correct granted modes, and ensure that all the recovered locks are recreated in compatible states. The correct lock recovery steps in this example are: - node D becomes the new master of R - node B sends D its lkb, granted PR - node A sends D its lkb, convert PR->CW - D determines the correct lock state is: granted: PR node B convert: PR->CW node A The lkb sent by each node was recreated without any change on the new master node. Example 2 node C, resource R granted: PR node A granted: NL node C granted: NL node D waiting: CW node B - A sends convert PR->CW to C - C grants the conversion to CW for A - C grants the waiting request for CW to B - C sends granted message to B, but fails before it can send the granted message to A - B receives the granted message from C At this point: - A believes it is converting PR->CW - B believes it is holding a CW lock The correct lock recovery steps in this example are: - node D becomes the new master of R - node A sends D its lkb, convert PR->CW - node B sends D its lkb, granted CW - D determins the correct lock state is: granted: CW node B granted: CW node A The lkb sent by B is recreated without change, but the lkb sent by A is changed because the granted mode was not compatible. Fixes to make this work correctly: recover_convert_waiter: should not make any changes to a converting lkb that is still waiting for a reply message. It was previously setting grmode to IV, which is invalid state, so the lkb would not be handled correctly by other code. receive_rcom_lock_args: was checking the wrong lkb field (wait_type instead of status) to determine if the lkb is being converted, and in need of inspection for this special recovery. It was also setting grmode to IV in the lkb, causing it to be mishandled by other code. Now, this function just puts the lkb, directly as sent, onto the convert queue of the resource being recovered, and corrects it in recover_conversion() later, if needed. recover_conversion: the job of this function is to detect and correct lkb states for the special PR/CW conversions. The new code now checks for recovered lkbs on the granted queue with grmode PR or CW, and takes the real grmode from that. Then it looks for lkbs on the convert queue with an incompatible grmode (i.e. grmode PR when the real grmode is CW, or v.v.) These converting lkbs need to be fixed. They are fixed by temporarily setting their grmode to NL, so that grmodes are not incompatible and won't confuse other locking code. The converting lkb will then be granted at the end of recovery, replacing the temporary NL grmode. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2024-11-15Merge tag 'for-6.12-rc7-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fix from David Sterba: "One more fix that seems urgent and good to have in 6.12 final. It could potentially lead to unexpected transaction aborts, due to wrong comparison and order of processing of delayed refs" * tag 'for-6.12-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix incorrect comparison for delayed refs
2024-11-15ubifs: Fix uninitialized use of err in ubifs_jnl_write_inode()Nathan Chancellor
Clang warns (or errors with CONFIG_WERROR=y): fs/ubifs/journal.c:986:20: error: variable 'err' is uninitialized when used here [-Werror,-Wuninitialized] 986 | ubifs_ro_mode(c, err); | ^~~ Set err to -EPERM before the call to ubifs_ro_mode() and reuse it in the return statement to resolve the warning. Fixes: 957e1c4e1779 ("ubifs: ubifs_jnl_write_inode: Only check once for the limitation of xattr count") Signed-off-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Richard Weinberger <richard@nod.at>
2024-11-15ecryptfs: Fix spelling mistake "validationg" -> "validating"Colin Ian King
There is a spelling mistake in an error message literal string. Fix it. Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Link: https://lore.kernel.org/r/20241108112509.109891-1-colin.i.king@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-11-15ecryptfs: Convert ecryptfs to use the new mount APIEric Sandeen
Convert ecryptfs to the new mount API. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Link: https://lore.kernel.org/r/20241028143359.605061-3-sandeen@redhat.com Acked-by: Tyler Hicks <code@tyhicks.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-11-15ecryptfs: Factor out mount option validationEric Sandeen
Under the new mount API, mount options are parsed one at a time. Any validation that examines multiple options must be done after parsing is complete, so factor out a ecryptfs_validate_options() which can be called separately. To facilitate this, temporarily move the local variables that tracked whether various options have been set in the parsing function, into the ecryptfs_mount_crypt_stat structure so that they can be examined later. These will be moved to a more ephemeral struct in the mount api conversion patch to follow. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Link: https://lore.kernel.org/r/20241028143359.605061-2-sandeen@redhat.com Acked-by: Tyler Hicks <code@tyhicks.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-11-15fs: open_by_handle_at() support for decoding "explicit connectable" file handlesAmir Goldstein
Teach open_by_handle_at(2) about the type format of "explicit connectable" file handles that were created using the AT_HANDLE_CONNECTABLE flag to name_to_handle_at(2). When decoding an "explicit connectable" file handles, name_to_handle_at(2) should fail if it cannot open a "connected" fd with known path, which is accessible (to capable user) from mount fd path. Note that this does not check if the path is accessible to the calling user, just that it is accessible wrt the mount namesapce, so if there is no "connected" alias, or if parts of the path are hidden in the mount namespace, open_by_handle_at(2) will return -ESTALE. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Link: https://lore.kernel.org/r/20241011090023.655623-4-amir73il@gmail.com Fixes: 570df4e9c23f ("ceph: snapshot nfs re-export") Acked-by: Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-11-15fs: name_to_handle_at() support for "explicit connectable" file handlesAmir Goldstein
nfsd encodes "connectable" file handles for the subtree_check feature, which can be resolved to an open file with a connected path. So far, userspace nfs server could not make use of this functionality. Introduce a new flag AT_HANDLE_CONNECTABLE to name_to_handle_at(2). When used, the encoded file handle is "explicitly connectable". The "explicitly connectable" file handle sets bits in the high 16bit of the handle_type field, so open_by_handle_at(2) will know that it needs to open a file with a connected path. old kernels will now recognize the handle_type with high bits set, so "explicitly connectable" file handles cannot be decoded by open_by_handle_at(2) on old kernels. The flag AT_HANDLE_CONNECTABLE is not allowed together with either AT_HANDLE_FID or AT_EMPTY_PATH. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Link: https://lore.kernel.org/r/20241011090023.655623-3-amir73il@gmail.com Fixes: 570df4e9c23f ("ceph: snapshot nfs re-export") Acked-by: Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-11-15fs: prepare for "explicit connectable" file handlesAmir Goldstein
We would like to use the high 16bit of the handle_type field to encode file handle traits, such as "connectable". In preparation for this change, make sure that filesystems do not return a handle_type value with upper bits set and that the open_by_handle_at(2) syscall rejects these handle types. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Link: https://lore.kernel.org/r/20241011090023.655623-2-amir73il@gmail.com Fixes: 570df4e9c23f ("ceph: snapshot nfs re-export") Acked-by: Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-11-15ovl: convert ovl_real_fdget() callers to ovl_real_file()Amir Goldstein
Stop using struct fd to return a real file from ovl_real_fdget(), because we no longer return a temporary file object and the callers always get a borrowed file reference. Rename the helper to ovl_real_file(), return a borrowed reference of the real file that is referenced from the overlayfs file or an error. Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2024-11-15ovl: convert ovl_real_fdget_path() callers to ovl_real_file_path()Amir Goldstein
Stop using struct fd to return a real file from ovl_real_fdget_path(), because we no longer return a temporary file object and the callers always get a borrowed file reference. Rename the helper to ovl_real_file_path(), return a borrowed reference of the real file that is referenced from the overlayfs file or an error. Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2024-11-15ovl: store upper real file in ovl_file structAmir Goldstein
When an overlayfs file is opened as lower and then the file is copied up, every operation on the overlayfs open file will open a temporary backing file to the upper dentry and close it at the end of the operation. Store the upper real file along side the original (lower) real file in ovl_file instead of opening a temporary upper file on every operation. Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2024-11-15ovl: allocate a container struct ovl_file for ovl private contextAmir Goldstein
Instead of using ->private_data to point at realfile directly, so that we can add more context per ovl open file. Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2024-11-15ovl: do not open non-data lower file for fsyncAmir Goldstein
ovl_fsync() with !datasync opens a backing file from the top most dentry in the stack, checks if this dentry is non-upper and skips the fsync. In case of an overlay dentry stack with lower data and lower metadata above it, but without an upper metadata above it, the backing file is opened from the top most lower metadata dentry and never used. Refactor the helper ovl_real_fdget_meta() into ovl_real_fdget_path() and open code the checks for non-upper inode in ovl_fsync(), so in that case we can avoid the unneeded backing file open. Signed-off-by: Amir Goldstein <amir73il@gmail.com>