summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2024-11-18Merge tag 'vfs-6.13.mgtime' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs multigrain timestamps from Christian Brauner: "This is another try at implementing multigrain timestamps. This time with significant help from the timekeeping maintainers to reduce the performance impact. Thomas provided a base branch that contains the required timekeeping interfaces for the VFS. It serves as the base for the multi-grain timestamp work: - Multigrain timestamps allow the kernel to use fine-grained timestamps when an inode's attributes is being actively observed via ->getattr(). With this support, it's possible for a file to get a fine-grained timestamp, and another modified after it to get a coarse-grained stamp that is earlier than the fine-grained time. If this happens then the files can appear to have been modified in reverse order, which breaks VFS ordering guarantees. To prevent this, a floor value is maintained for multigrain timestamps. Whenever a fine-grained timestamp is handed out, record it, and when later coarse-grained stamps are handed out, ensure they are not earlier than that value. If the coarse-grained timestamp is earlier than the fine-grained floor, return the floor value instead. The timekeeper changes add a static singleton atomic64_t into timekeeper.c that is used to keep track of the latest fine-grained time ever handed out. This is tracked as a monotonic ktime_t value to ensure that it isn't affected by clock jumps. Because it is updated at different times than the rest of the timekeeper object, the floor value is managed independently of the timekeeper via a cmpxchg() operation, and sits on its own cacheline. Two new public timekeeper interfaces are added: (1) ktime_get_coarse_real_ts64_mg() fills a timespec64 with the later of the coarse-grained clock and the floor time (2) ktime_get_real_ts64_mg() gets the fine-grained clock value, and tries to swap it into the floor. A timespec64 is filled with the result. - The VFS has always used coarse-grained timestamps when updating the ctime and mtime after a change. This has the benefit of allowing filesystems to optimize away a lot metadata updates, down to around 1 per jiffy, even when a file is under heavy writes. Unfortunately, this has always been an issue when we're exporting via NFSv3, which relies on timestamps to validate caches. A lot of changes can happen in a jiffy, so timestamps aren't sufficient to help the client decide when to invalidate the cache. Even with NFSv4, a lot of exported filesystems don't properly support a change attribute and are subject to the same problems with timestamp granularity. Other applications have similar issues with timestamps (e.g backup applications). If we were to always use fine-grained timestamps, that would improve the situation, but that becomes rather expensive, as the underlying filesystem would have to log a lot more metadata updates. This adds a way to only use fine-grained timestamps when they are being actively queried. Use the (unused) top bit in inode->i_ctime_nsec as a flag that indicates whether the current timestamps have been queried via stat() or the like. When it's set, we allow the kernel to use a fine-grained timestamp iff it's necessary to make the ctime show a different value. This solves the problem of being able to distinguish the timestamp between updates, but introduces a new problem: it's now possible for a file being changed to get a fine-grained timestamp. A file that is altered just a bit later can then get a coarse-grained one that appears older than the earlier fine-grained time. This violates timestamp ordering guarantees. This is where the earlier mentioned timkeeping interfaces help. A global monotonic atomic64_t value is kept that acts as a timestamp floor. When we go to stamp a file, we first get the latter of the current floor value and the current coarse-grained time. If the inode ctime hasn't been queried then we just attempt to stamp it with that value. If it has been queried, then first see whether the current coarse time is later than the existing ctime. If it is, then we accept that value. If it isn't, then we get a fine-grained time and try to swap that into the global floor. Whether that succeeds or fails, we take the resulting floor time, convert it to realtime and try to swap that into the ctime. We take the result of the ctime swap whether it succeeds or fails, since either is just as valid. Filesystems can opt into this by setting the FS_MGTIME fstype flag. Others should be unaffected (other than being subject to the same floor value as multigrain filesystems)" * tag 'vfs-6.13.mgtime' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs: reduce pointer chasing in is_mgtime() test tmpfs: add support for multigrain timestamps btrfs: convert to multigrain timestamps ext4: switch to multigrain timestamps xfs: switch to multigrain timestamps Documentation: add a new file documenting multigrain timestamps fs: add percpu counters for significant multigrain timestamp events fs: tracepoints around multigrain timestamp events fs: handle delegated timestamps in setattr_copy_mgtime timekeeping: Add percpu counter for tracking floor swap events timekeeping: Add interfaces for handling timestamps with a floor value fs: have setattr_copy handle multigrain timestamps appropriately fs: add infrastructure for multigrain timestamps
2024-11-18fsnotify: Fix ordering of iput() and watched_objects decrementJann Horn
Ensure the superblock is kept alive until we're done with iput(). Holding a reference to an inode is not allowed unless we ensure the superblock stays alive, which fsnotify does by keeping the watched_objects count elevated, so iput() must happen before the watched_objects decrement. This can lead to a UAF of something like sb->s_fs_info in tmpfs, but the UAF is hard to hit because race orderings that oops are more likely, thanks to the CHECK_DATA_CORRUPTION() block in generic_shutdown_super(). Also, ensure that fsnotify_put_sb_watched_objects() doesn't call fsnotify_sb_watched_objects() on a superblock that may have already been freed, which would cause a UAF read of sb->s_fsnotify_info. Cc: stable@kernel.org Fixes: d2f277e26f52 ("fsnotify: rename fsnotify_{get,put}_sb_connectors()") Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Jan Kara <jack@suse.cz>
2024-11-18dlm: fix dlm_recover_members refcount on errorAlexander Aring
If dlm_recover_members() fails we don't drop the references of the previous created root_list that holds and keep all rsbs alive during the recovery. It might be not an unlikely event because ping_members() could run into an -EINTR if another recovery progress was triggered again. Fixes: 3a747f4a2ee8 ("dlm: move rsb root_list to ls_recover() stack") Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2024-11-18erofs: handle NONHEAD !delta[1] lclusters gracefullyGao Xiang
syzbot reported a WARNING in iomap_iter_done: iomap_fiemap+0x73b/0x9b0 fs/iomap/fiemap.c:80 ioctl_fiemap fs/ioctl.c:220 [inline] Generally, NONHEAD lclusters won't have delta[1]==0, except for crafted images and filesystems created by pre-1.0 mkfs versions. Previously, it would immediately bail out if delta[1]==0, which led to inadequate decompressed lengths (thus FIEMAP is impacted). Treat it as delta[1]=1 to work around these legacy mkfs versions. `lclusterbits > 14` is illegal for compact indexes, error out too. Reported-by: syzbot+6c0b301317aa0156f9eb@syzkaller.appspotmail.com Closes: https://lore.kernel.org/r/67373c0c.050a0220.2a2fcc.0079.GAE@google.com Tested-by: syzbot+6c0b301317aa0156f9eb@syzkaller.appspotmail.com Fixes: d95ae5e25326 ("erofs: add support for the full decompressed length") Fixes: 001b8ccd0650 ("erofs: fix compact 4B support for 16k block size") Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20241115173651.3339514-1-hsiangkao@linux.alibaba.com
2024-11-18erofs: clarify direct I/O supportGao Xiang
Currently, only filesystems backed by block devices support direct I/O. Also remove the unnecessary strict checks that can be supported with iomap. Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20241115074625.2520728-1-hsiangkao@linux.alibaba.com
2024-11-18erofs: fix blksize < PAGE_SIZE for file-backed mountsHongzhen Luo
Adjust sb->s_blocksize{,_bits} directly for file-backed mounts when the fs block size is smaller than PAGE_SIZE. Previously, EROFS used sb_set_blocksize(), which caused a panic if bdev-backed mounts is not used. Fixes: fb176750266a ("erofs: add file-backed mount support") Signed-off-by: Hongzhen Luo <hongzhen@linux.alibaba.com> Link: https://lore.kernel.org/r/20241015103836.3757438-1-hongzhen@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2024-11-18erofs: get rid of `buf->kmap_type`Gao Xiang
After commit 927e5010ff5b ("erofs: use kmap_local_page() only for erofs_bread()"), `buf->kmap_type` actually has no use at all. Let's get rid of `buf->kmap_type` now. Suggested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20241114095813.839866-1-hsiangkao@linux.alibaba.com
2024-11-18erofs: fix file-backed mounts over FUSEGao Xiang
syzbot reported a null-ptr-deref in fuse_read_args_fill: fuse_read_folio+0xb0/0x100 fs/fuse/file.c:905 filemap_read_folio+0xc6/0x2a0 mm/filemap.c:2367 do_read_cache_folio+0x263/0x5c0 mm/filemap.c:3825 read_mapping_folio include/linux/pagemap.h:1011 [inline] erofs_bread+0x34d/0x7e0 fs/erofs/data.c:41 erofs_read_superblock fs/erofs/super.c:281 [inline] erofs_fc_fill_super+0x2b9/0x2500 fs/erofs/super.c:625 Unlike most filesystems, some network filesystems and FUSE need unavoidable valid `file` pointers for their read I/Os [1]. Anyway, those use cases need to be supported too. [1] https://docs.kernel.org/filesystems/vfs.html Reported-by: syzbot+0b1279812c46e48bb0c1@syzkaller.appspotmail.com Closes: https://lore.kernel.org/r/6727bbdf.050a0220.3c8d68.0a7e.GAE@google.com Fixes: fb176750266a ("erofs: add file-backed mount support") Tested-by: syzbot+0b1279812c46e48bb0c1@syzkaller.appspotmail.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20241114234905.1873723-1-hsiangkao@linux.alibaba.com
2024-11-18erofs: simplify definition of the log functionsGou Hao
Use printk instead of pr_info/err to reduce redundant code. Signed-off-by: Gou Hao <gouhao@uniontech.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20241114013247.30821-1-gouhao@uniontech.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2024-11-18erofs: add sysfs node to drop internal cachesChunhai Guo
Add a sysfs node to drop compression-related caches, currently used to drop in-memory pclusters and cached compressed folios. Signed-off-by: Chunhai Guo <guochunhai@vivo.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20241113041148.749129-1-guochunhai@vivo.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2024-11-18erofs: free pclusters if no cached folio is attachedChunhai Guo
Once a pcluster is fully decompressed and there are no attached cached folios, its corresponding `struct z_erofs_pcluster` will be freed. This will significantly reduce the frequency of calls to erofs_shrink_scan() and the memory allocated for `struct z_erofs_pcluster`. The tables below show approximately a 96% reduction in the calls to erofs_shrink_scan() and in the memory allocated for `struct z_erofs_pcluster` after applying this patch. The results were obtained by performing a test to copy a 4.1GB partition on ARM64 Android devices running the 6.6 kernel with an 8-core CPU and 12GB of memory. 1. The reduction in calls to erofs_shrink_scan(): +-----------------+-----------+----------+---------+ | | w/o patch | w/ patch | diff | +-----------------+-----------+----------+---------+ | Average (times) | 11390 | 390 | -96.57% | +-----------------+-----------+----------+---------+ 2. The reduction in memory released by erofs_shrink_scan(): +-----------------+-----------+----------+---------+ | | w/o patch | w/ patch | diff | +-----------------+-----------+----------+---------+ | Average (Byte) | 133612656 | 4434552 | -96.68% | +-----------------+-----------+----------+---------+ Signed-off-by: Chunhai Guo <guochunhai@vivo.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20241112043235.546164-1-guochunhai@vivo.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2024-11-18erofs: sunset `struct erofs_workgroup`Gao Xiang
`struct erofs_workgroup` was introduced to provide a unique header for all physically indexed objects. However, after big pclusters and shared pclusters are implemented upstream, it seems that all EROFS encoded data (which requires transformation) can be represented with `struct z_erofs_pcluster` directly. Move all members into `struct z_erofs_pcluster` for simplicity. Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20241021035323.3280682-3-hsiangkao@linux.alibaba.com
2024-11-18erofs: move erofs_workgroup operations into zdata.cGao Xiang
Move related helpers into zdata.c as an intermediate step of getting rid of `struct erofs_workgroup`, and rename: erofs_workgroup_put => z_erofs_put_pcluster erofs_workgroup_get => z_erofs_get_pcluster erofs_try_to_release_workgroup => erofs_try_to_release_pcluster erofs_shrink_workstation => z_erofs_shrink_scan Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20241021035323.3280682-2-hsiangkao@linux.alibaba.com
2024-11-18erofs: get rid of erofs_{find,insert}_workgroupGao Xiang
Just fold them into the only two callers since they are simple enough. Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20241021035323.3280682-1-hsiangkao@linux.alibaba.com
2024-11-17smb: client: fix use-after-free of signing keyPaulo Alcantara
Customers have reported use-after-free in @ses->auth_key.response with SMB2.1 + sign mounts which occurs due to following race: task A task B cifs_mount() dfs_mount_share() get_session() cifs_mount_get_session() cifs_send_recv() cifs_get_smb_ses() compound_send_recv() cifs_setup_session() smb2_setup_request() kfree_sensitive() smb2_calc_signature() crypto_shash_setkey() *UAF* Fix this by ensuring that we have a valid @ses->auth_key.response by checking whether @ses->ses_status is SES_GOOD or SES_EXITING with @ses->ses_lock held. After commit 24a9799aa8ef ("smb: client: fix UAF in smb2_reconnect_server()"), we made sure to call ->logoff() only when @ses was known to be good (e.g. valid ->auth_key.response), so it's safe to access signing key when @ses->ses_status == SES_EXITING. Cc: stable@vger.kernel.org Reported-by: Jay Shin <jaeshin@redhat.com> Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2024-11-17smb: client: Use str_yes_no() helper functionThorsten Blum
Remove hard-coded strings by using the str_yes_no() helper function. Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: Steve French <stfrench@microsoft.com>
2024-11-17smb: client: memcpy() with surrounding object base addressKees Cook
Like commit f1f047bd7ce0 ("smb: client: Fix -Wstringop-overflow issues"), adjust the memcpy() destination address to be based off the surrounding object rather than based off the 4-byte "Protocol" member. This avoids a build-time warning when compiling under CONFIG_FORTIFY_SOURCE with GCC 15: In function 'fortify_memcpy_chk', inlined from 'CIFSSMBSetPathInfo' at ../fs/smb/client/cifssmb.c:5358:2: ../include/linux/fortify-string.h:571:25: error: call to '__write_overflow_field' declared with attribute warning: detected write beyond size of field (1st parameter); maybe use struct_group()? [-Werror=attribute-warning] 571 | __write_overflow_field(p_size_field, size); | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Signed-off-by: Kees Cook <kees@kernel.org> Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2024-11-17cifs: Remove pre-historic unused CIFSSMBCopyDr. David Alan Gilbert
CIFSSMBCopy() is unused, remove it. It seems to have been that way pre-git; looking in a historic archive, I think it landed around May 2004 in Linus' BKrev: 40ab7591J_OgkpHW-qhzZukvAUAw9g and was unused back then. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Acked-by: Tom Talpey <tom@talpey.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2024-11-16Merge tag 'mm-hotfixes-stable-2024-11-16-15-33' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull hotfixes from Andrew Morton: "10 hotfixes, 7 of which are cc:stable. All singletons, please see the changelogs for details" * tag 'mm-hotfixes-stable-2024-11-16-15-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mm: revert "mm: shmem: fix data-race in shmem_getattr()" ocfs2: uncache inode which has failed entering the group mm: fix NULL pointer dereference in alloc_pages_bulk_noprof mm, doc: update read_ahead_kb for MADV_HUGEPAGE fs/proc/task_mmu: prevent integer overflow in pagemap_scan_get_args() sched/task_stack: fix object_is_on_stack() for KASAN tagged pointers crash, powerpc: default to CRASH_DUMP=n on PPC_BOOK3S_32 mm/mremap: fix address wraparound in move_page_tables() tools/mm: fix compile error mm, swap: fix allocation and scanning race with swapoff
2024-11-15dlm: fix recovery of middle conversionsAlexander Aring
In one special case, recovery is unable to reliably rebuild lock state by simply recreating lkb structs as sent from the lock holders. That case is when the lkb's include conversions between PR and CW modes. The recovery code has always recognized this special case, but the implemention has always been broken, and would set invalid modes in recovered lkb's. Unpredictable or bogus errors could then be returned for further locking calls on these locks. This bug has gone unnoticed for so long due to some combination of: - applications never or infrequently converting between PR/CW - recovery not occuring during these conversions - if the recovery bug does occur, the caller may not notice, depending on what further locking calls are made, e.g. if the lock is simply unlocked it may go unnoticed However, a core analysis from a recent gfs2 bug report points to this broken code. PR = Protected Read CW = Concurrent Write PR and CW are incompatible PR and PR are compatible CW and CW are compatible Example 1 node C, resource R granted: PR node A granted: PR node B granted: NL node C granted: NL node D - A sends convert PR->CW to C - C fails before A gets a reply - recovery occurs At this point, A does not know if it still holds the lock in PR, or if its conversion to CW was granted: - If A's conversion to CW was granted, then another node's CW lock may also have been granted. - If A's conversion to CW was not granted, it still holds a PR lock, and other nodes may also hold PR locks. So, the new master of R cannot simply recreate the lock from A using granted mode PR and requested mode CW. The new master must look at all the recovered locks to determine the correct granted modes, and ensure that all the recovered locks are recreated in compatible states. The correct lock recovery steps in this example are: - node D becomes the new master of R - node B sends D its lkb, granted PR - node A sends D its lkb, convert PR->CW - D determines the correct lock state is: granted: PR node B convert: PR->CW node A The lkb sent by each node was recreated without any change on the new master node. Example 2 node C, resource R granted: PR node A granted: NL node C granted: NL node D waiting: CW node B - A sends convert PR->CW to C - C grants the conversion to CW for A - C grants the waiting request for CW to B - C sends granted message to B, but fails before it can send the granted message to A - B receives the granted message from C At this point: - A believes it is converting PR->CW - B believes it is holding a CW lock The correct lock recovery steps in this example are: - node D becomes the new master of R - node A sends D its lkb, convert PR->CW - node B sends D its lkb, granted CW - D determins the correct lock state is: granted: CW node B granted: CW node A The lkb sent by B is recreated without change, but the lkb sent by A is changed because the granted mode was not compatible. Fixes to make this work correctly: recover_convert_waiter: should not make any changes to a converting lkb that is still waiting for a reply message. It was previously setting grmode to IV, which is invalid state, so the lkb would not be handled correctly by other code. receive_rcom_lock_args: was checking the wrong lkb field (wait_type instead of status) to determine if the lkb is being converted, and in need of inspection for this special recovery. It was also setting grmode to IV in the lkb, causing it to be mishandled by other code. Now, this function just puts the lkb, directly as sent, onto the convert queue of the resource being recovered, and corrects it in recover_conversion() later, if needed. recover_conversion: the job of this function is to detect and correct lkb states for the special PR/CW conversions. The new code now checks for recovered lkbs on the granted queue with grmode PR or CW, and takes the real grmode from that. Then it looks for lkbs on the convert queue with an incompatible grmode (i.e. grmode PR when the real grmode is CW, or v.v.) These converting lkbs need to be fixed. They are fixed by temporarily setting their grmode to NL, so that grmodes are not incompatible and won't confuse other locking code. The converting lkb will then be granted at the end of recovery, replacing the temporary NL grmode. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2024-11-15Merge tag 'for-6.12-rc7-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fix from David Sterba: "One more fix that seems urgent and good to have in 6.12 final. It could potentially lead to unexpected transaction aborts, due to wrong comparison and order of processing of delayed refs" * tag 'for-6.12-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix incorrect comparison for delayed refs
2024-11-15ovl: convert ovl_real_fdget() callers to ovl_real_file()Amir Goldstein
Stop using struct fd to return a real file from ovl_real_fdget(), because we no longer return a temporary file object and the callers always get a borrowed file reference. Rename the helper to ovl_real_file(), return a borrowed reference of the real file that is referenced from the overlayfs file or an error. Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2024-11-15ovl: convert ovl_real_fdget_path() callers to ovl_real_file_path()Amir Goldstein
Stop using struct fd to return a real file from ovl_real_fdget_path(), because we no longer return a temporary file object and the callers always get a borrowed file reference. Rename the helper to ovl_real_file_path(), return a borrowed reference of the real file that is referenced from the overlayfs file or an error. Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2024-11-15ovl: store upper real file in ovl_file structAmir Goldstein
When an overlayfs file is opened as lower and then the file is copied up, every operation on the overlayfs open file will open a temporary backing file to the upper dentry and close it at the end of the operation. Store the upper real file along side the original (lower) real file in ovl_file instead of opening a temporary upper file on every operation. Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2024-11-15ovl: allocate a container struct ovl_file for ovl private contextAmir Goldstein
Instead of using ->private_data to point at realfile directly, so that we can add more context per ovl open file. Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2024-11-15ovl: do not open non-data lower file for fsyncAmir Goldstein
ovl_fsync() with !datasync opens a backing file from the top most dentry in the stack, checks if this dentry is non-upper and skips the fsync. In case of an overlay dentry stack with lower data and lower metadata above it, but without an upper metadata above it, the backing file is opened from the top most lower metadata dentry and never used. Refactor the helper ovl_real_fdget_meta() into ovl_real_fdget_path() and open code the checks for non-upper inode in ovl_fsync(), so in that case we can avoid the unneeded backing file open. Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2024-11-15ovl: Optimize override/revert credsVinicius Costa Gomes
Use override_creds_light() in ovl_override_creds() and revert_creds_light() in ovl_revert_creds(). The _light() functions do not change the 'usage' of the credentials in question, as they refer to the credentials associated with the mounter, which have a longer lifetime. In ovl_setup_cred_for_create(), do not need to modify the mounter credentials (returned by override_creds_light()) 'usage' counter. Add a warning to verify that we are indeed working with the mounter credentials (stored in the superblock). Failure in this assumption means that creds may leak. Suggested-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2024-11-14ocfs2: uncache inode which has failed entering the groupDmitry Antipov
Syzbot has reported the following BUG: kernel BUG at fs/ocfs2/uptodate.c:509! ... Call Trace: <TASK> ? __die_body+0x5f/0xb0 ? die+0x9e/0xc0 ? do_trap+0x15a/0x3a0 ? ocfs2_set_new_buffer_uptodate+0x145/0x160 ? do_error_trap+0x1dc/0x2c0 ? ocfs2_set_new_buffer_uptodate+0x145/0x160 ? __pfx_do_error_trap+0x10/0x10 ? handle_invalid_op+0x34/0x40 ? ocfs2_set_new_buffer_uptodate+0x145/0x160 ? exc_invalid_op+0x38/0x50 ? asm_exc_invalid_op+0x1a/0x20 ? ocfs2_set_new_buffer_uptodate+0x2e/0x160 ? ocfs2_set_new_buffer_uptodate+0x144/0x160 ? ocfs2_set_new_buffer_uptodate+0x145/0x160 ocfs2_group_add+0x39f/0x15a0 ? __pfx_ocfs2_group_add+0x10/0x10 ? __pfx_lock_acquire+0x10/0x10 ? mnt_get_write_access+0x68/0x2b0 ? __pfx_lock_release+0x10/0x10 ? rcu_read_lock_any_held+0xb7/0x160 ? __pfx_rcu_read_lock_any_held+0x10/0x10 ? smack_log+0x123/0x540 ? mnt_get_write_access+0x68/0x2b0 ? mnt_get_write_access+0x68/0x2b0 ? mnt_get_write_access+0x226/0x2b0 ocfs2_ioctl+0x65e/0x7d0 ? __pfx_ocfs2_ioctl+0x10/0x10 ? smack_file_ioctl+0x29e/0x3a0 ? __pfx_smack_file_ioctl+0x10/0x10 ? lockdep_hardirqs_on_prepare+0x43d/0x780 ? __pfx_lockdep_hardirqs_on_prepare+0x10/0x10 ? __pfx_ocfs2_ioctl+0x10/0x10 __se_sys_ioctl+0xfb/0x170 do_syscall_64+0xf3/0x230 entry_SYSCALL_64_after_hwframe+0x77/0x7f ... </TASK> When 'ioctl(OCFS2_IOC_GROUP_ADD, ...)' has failed for the particular inode in 'ocfs2_verify_group_and_input()', corresponding buffer head remains cached and subsequent call to the same 'ioctl()' for the same inode issues the BUG() in 'ocfs2_set_new_buffer_uptodate()' (trying to cache the same buffer head of that inode). Fix this by uncaching the buffer head with 'ocfs2_remove_from_cache()' on error path in 'ocfs2_group_add()'. Link: https://lkml.kernel.org/r/20241114043844.111847-1-dmantipov@yandex.ru Fixes: 7909f2bf8353 ("[PATCH 2/2] ocfs2: Implement group add for online resize") Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru> Reported-by: syzbot+453873f1588c2d75b447@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=453873f1588c2d75b447 Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Dmitry Antipov <dmantipov@yandex.ru> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mark@fasheh.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Jun Piao <piaojun@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-14fs/proc/task_mmu: prevent integer overflow in pagemap_scan_get_args()Dan Carpenter
The "arg->vec_len" variable is a u64 that comes from the user at the start of the function. The "arg->vec_len * sizeof(struct page_region))" multiplication can lead to integer wrapping. Use size_mul() to avoid that. Also the size_add/mul() functions work on unsigned long so for 32bit systems we need to ensure that "arg->vec_len" fits in an unsigned long. Link: https://lkml.kernel.org/r/39d41335-dd4d-48ed-8a7f-402c57d8ea84@stanley.mountain Fixes: 52526ca7fdb9 ("fs/proc/task_mmu: implement IOCTL to get and optionally clear info about PTEs") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Cc: Andrei Vagin <avagin@google.com> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michał Mirosław <mirq-linux@rere.qmqm.pl> Cc: Muhammad Usama Anjum <usama.anjum@collabora.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-14Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.12-rc8). Conflicts: tools/testing/selftests/net/.gitignore 252e01e68241 ("selftests: net: add netlink-dumps to .gitignore") be43a6b23829 ("selftests: ncdevmem: Move ncdevmem under drivers/net/hw") https://lore.kernel.org/all/20241113122359.1b95180a@canb.auug.org.au/ drivers/net/phy/phylink.c 671154f174e0 ("net: phylink: ensure PHY momentary link-fails are handled") 7530ea26c810 ("net: phylink: remove "using_mac_select_pcs"") Adjacent changes: drivers/net/ethernet/stmicro/stmmac/dwmac-intel-plat.c 5b366eae7193 ("stmmac: dwmac-intel-plat: fix call balance of tx_clk handling routines") e96321fad3ad ("net: ethernet: Switch back to struct platform_driver::remove()") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-14Merge tag 'bcachefs-2024-11-13' of git://evilpiepirate.org/bcachefsLinus Torvalds
Pull bcachefs fixes from Kent Overstreet: "This fixes one minor regression from the btree cache fixes (in the scan_for_btree_nodes repair path) - and the shutdown path fix is the big one here, in terms of bugs closed: - Assorted tiny syzbot fixes - Shutdown path fix: "bch2_btree_write_buffer_flush_going_ro()" The shutdown path wasn't flushing the btree write buffer, leading to shutting down while we still had operations in flight. This fixes a whole slew of syzbot bugs, and undoubtedly other strange heisenbugs. * tag 'bcachefs-2024-11-13' of git://evilpiepirate.org/bcachefs: bcachefs: Fix assertion pop in bch2_ptr_swab() bcachefs: Fix journal_entry_dev_usage_to_text() overrun bcachefs: Allow for unknown key types in backpointers fsck bcachefs: Fix assertion pop in topology repair bcachefs: Fix hidden btree errors when reading roots bcachefs: Fix validate_bset() repair path bcachefs: Fix missing validation for bch_backpointer.level bcachefs: Fix bch_member.btree_bitmap_shift validation bcachefs: bch2_btree_write_buffer_flush_going_ro()
2024-11-14statmount: retrieve security mount optionsChristian Brauner
Add the ability to retrieve security mount options. Keep them separate from filesystem specific mount options so it's easy to tell them apart. Also allow to retrieve them separate from other mount options as most of the time users won't be interested in security specific mount options. Link: https://lore.kernel.org/r/20241114-radtour-ofenrohr-ff34b567b40a@brauner Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-11-14btrfs: fix incorrect comparison for delayed refsJosef Bacik
When I reworked delayed ref comparison in cf4f04325b2b ("btrfs: move ->parent and ->ref_root into btrfs_delayed_ref_node"), I made a mistake and returned -1 for the case where ref1->ref_root was > than ref2->ref_root. This is a subtle bug that can result in improper delayed ref running order, which can result in transaction aborts. Fixes: cf4f04325b2b ("btrfs: move ->parent and ->ref_root into btrfs_delayed_ref_node") CC: stable@vger.kernel.org # 6.10+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-14ovl: pass an explicit reference of creators creds to callersAmir Goldstein
ovl_setup_cred_for_create() decrements one refcount of new creds and ovl_revert_creds() in callers decrements the last refcount. In preparation to revert_creds_light() back to caller creds, pass an explicit reference of the creators creds to the callers and drop the refcount explicitly in the callers after ovl_revert_creds(). Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2024-11-14Merge branches 'for-next/gcs', 'for-next/probes', 'for-next/asm-offsets', ↵Catalin Marinas
'for-next/tlb', 'for-next/misc', 'for-next/mte', 'for-next/sysreg', 'for-next/stacktrace', 'for-next/hwcap3', 'for-next/kselftest', 'for-next/crc32', 'for-next/guest-cca', 'for-next/haft' and 'for-next/scs', remote-tracking branch 'arm64/for-next/perf' into for-next/core * arm64/for-next/perf: perf: Switch back to struct platform_driver::remove() perf: arm_pmuv3: Add support for Samsung Mongoose PMU dt-bindings: arm: pmu: Add Samsung Mongoose core compatible perf/dwc_pcie: Fix typos in event names perf/dwc_pcie: Add support for Ampere SoCs ARM: pmuv3: Add missing write_pmuacr() perf/marvell: Marvell PEM performance monitor support perf/arm_pmuv3: Add PMUv3.9 per counter EL0 access control perf/dwc_pcie: Convert the events with mixed case to lowercase perf/cxlpmu: Support missing events in 3.1 spec perf: imx_perf: add support for i.MX91 platform dt-bindings: perf: fsl-imx-ddr: Add i.MX91 compatible drivers perf: remove unused field pmu_node * for-next/gcs: (42 commits) : arm64 Guarded Control Stack user-space support kselftest/arm64: Fix missing printf() argument in gcs/gcs-stress.c arm64/gcs: Fix outdated ptrace documentation kselftest/arm64: Ensure stable names for GCS stress test results kselftest/arm64: Validate that GCS push and write permissions work kselftest/arm64: Enable GCS for the FP stress tests kselftest/arm64: Add a GCS stress test kselftest/arm64: Add GCS signal tests kselftest/arm64: Add test coverage for GCS mode locking kselftest/arm64: Add a GCS test program built with the system libc kselftest/arm64: Add very basic GCS test program kselftest/arm64: Always run signals tests with GCS enabled kselftest/arm64: Allow signals tests to specify an expected si_code kselftest/arm64: Add framework support for GCS to signal handling tests kselftest/arm64: Add GCS as a detected feature in the signal tests kselftest/arm64: Verify the GCS hwcap arm64: Add Kconfig for Guarded Control Stack (GCS) arm64/ptrace: Expose GCS via ptrace and core files arm64/signal: Expose GCS state in signal frames arm64/signal: Set up and restore the GCS context for signal handlers arm64/mm: Implement map_shadow_stack() ... * for-next/probes: : Various arm64 uprobes/kprobes cleanups arm64: insn: Simulate nop instruction for better uprobe performance arm64: probes: Remove probe_opcode_t arm64: probes: Cleanup kprobes endianness conversions arm64: probes: Move kprobes-specific fields arm64: probes: Fix uprobes for big-endian kernels arm64: probes: Fix simulate_ldr*_literal() arm64: probes: Remove broken LDR (literal) uprobe support * for-next/asm-offsets: : arm64 asm-offsets.c cleanup (remove unused offsets) arm64: asm-offsets: remove PREEMPT_DISABLE_OFFSET arm64: asm-offsets: remove DMA_{TO,FROM}_DEVICE arm64: asm-offsets: remove VM_EXEC and PAGE_SZ arm64: asm-offsets: remove MM_CONTEXT_ID arm64: asm-offsets: remove COMPAT_{RT_,SIGFRAME_REGS_OFFSET arm64: asm-offsets: remove VMA_VM_* arm64: asm-offsets: remove TSK_ACTIVE_MM * for-next/tlb: : TLB flushing optimisations arm64: optimize flush tlb kernel range arm64: tlbflush: add __flush_tlb_range_limit_excess() * for-next/misc: : Miscellaneous patches arm64: tls: Fix context-switching of tpidrro_el0 when kpti is enabled arm64/ptrace: Clarify documentation of VL configuration via ptrace acpi/arm64: remove unnecessary cast arm64/mm: Change protval as 'pteval_t' in map_range() arm64: uprobes: Optimize cache flushes for xol slot acpi/arm64: Adjust error handling procedure in gtdt_parse_timer_block() arm64: fix .data.rel.ro size assertion when CONFIG_LTO_CLANG arm64/ptdump: Test both PTE_TABLE_BIT and PTE_VALID for block mappings arm64/mm: Sanity check PTE address before runtime P4D/PUD folding arm64/mm: Drop setting PTE_TYPE_PAGE in pte_mkcont() ACPI: GTDT: Tighten the check for the array of platform timer structures arm64/fpsimd: Fix a typo arm64: Expose ID_AA64ISAR1_EL1.XS to sanitised feature consumers arm64: Return early when break handler is found on linked-list arm64/mm: Re-organize arch_make_huge_pte() arm64/mm: Drop _PROT_SECT_DEFAULT arm64: Add command-line override for ID_AA64MMFR0_EL1.ECV arm64: head: Drop SWAPPER_TABLE_SHIFT arm64: cpufeature: add POE to cpucap_is_possible() arm64/mm: Change pgattr_change_is_safe() arguments as pteval_t * for-next/mte: : Various MTE improvements selftests: arm64: add hugetlb mte tests hugetlb: arm64: add mte support * for-next/sysreg: : arm64 sysreg updates arm64/sysreg: Update ID_AA64MMFR1_EL1 to DDI0601 2024-09 * for-next/stacktrace: : arm64 stacktrace improvements arm64: preserve pt_regs::stackframe during exec*() arm64: stacktrace: unwind exception boundaries arm64: stacktrace: split unwind_consume_stack() arm64: stacktrace: report recovered PCs arm64: stacktrace: report source of unwind data arm64: stacktrace: move dump_backtrace() to kunwind_stack_walk() arm64: use a common struct frame_record arm64: pt_regs: swap 'unused' and 'pmr' fields arm64: pt_regs: rename "pmr_save" -> "pmr" arm64: pt_regs: remove stale big-endian layout arm64: pt_regs: assert pt_regs is a multiple of 16 bytes * for-next/hwcap3: : Add AT_HWCAP3 support for arm64 (also wire up AT_HWCAP4) arm64: Support AT_HWCAP3 binfmt_elf: Wire up AT_HWCAP3 at AT_HWCAP4 * for-next/kselftest: (30 commits) : arm64 kselftest fixes/cleanups kselftest/arm64: Try harder to generate different keys during PAC tests kselftest/arm64: Don't leak pipe fds in pac.exec_sign_all() kselftest/arm64: Corrupt P0 in the irritator when testing SSVE kselftest/arm64: Add FPMR coverage to fp-ptrace kselftest/arm64: Expand the set of ZA writes fp-ptrace does kselftets/arm64: Use flag bits for features in fp-ptrace assembler code kselftest/arm64: Enable build of PAC tests with LLVM=1 kselftest/arm64: Check that SVCR is 0 in signal handlers kselftest/arm64: Fix printf() compiler warnings in the arm64 syscall-abi.c tests kselftest/arm64: Fix printf() warning in the arm64 MTE prctl() test kselftest/arm64: Fix printf() compiler warnings in the arm64 fp tests kselftest/arm64: Fix build with stricter assemblers kselftest/arm64: Test signal handler state modification in fp-stress kselftest/arm64: Provide a SIGUSR1 handler in the kernel mode FP stress test kselftest/arm64: Implement irritators for ZA and ZT kselftest/arm64: Remove unused ADRs from irritator handlers kselftest/arm64: Correct misleading comments on fp-stress irritators kselftest/arm64: Poll less often while waiting for fp-stress children kselftest/arm64: Increase frequency of signal delivery in fp-stress kselftest/arm64: Fix encoding for SVE B16B16 test ... * for-next/crc32: : Optimise CRC32 using PMULL instructions arm64/crc32: Implement 4-way interleave using PMULL arm64/crc32: Reorganize bit/byte ordering macros arm64/lib: Handle CRC-32 alternative in C code * for-next/guest-cca: : Support for running Linux as a guest in Arm CCA arm64: Document Arm Confidential Compute virt: arm-cca-guest: TSM_REPORT support for realms arm64: Enable memory encrypt for Realms arm64: mm: Avoid TLBI when marking pages as valid arm64: Enforce bounce buffers for realm DMA efi: arm64: Map Device with Prot Shared arm64: rsi: Map unprotected MMIO as decrypted arm64: rsi: Add support for checking whether an MMIO is protected arm64: realm: Query IPA size from the RMM arm64: Detect if in a realm and set RIPAS RAM arm64: rsi: Add RSI definitions * for-next/haft: : Support for arm64 FEAT_HAFT arm64: pgtable: Warn unexpected pmdp_test_and_clear_young() arm64: Enable ARCH_HAS_NONLEAF_PMD_YOUNG arm64: Add support for FEAT_HAFT arm64: setup: name 'tcr2' register arm64/sysreg: Update ID_AA64MMFR1_EL1 register * for-next/scs: : Dynamic shadow call stack fixes arm64/scs: Drop unused prototype __pi_scs_patch_vmlinux() arm64/scs: Deal with 64-bit relative offsets in FDE frames arm64/scs: Fix handling of DWARF augmentation data in CIE/FDE frames
2024-11-14fs: reduce pointer chasing in is_mgtime() testJeff Layton
The is_mgtime test checks whether the FS_MGTIME flag is set in the fstype. To get there from the inode though, we have to dereference 3 pointers. Add a new IOP_MGTIME flag, and have inode_init_always() set that flag when the fstype flag is set. Then, make is_mgtime test for IOP_MGTIME instead. Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/20241113-mgtime-v1-1-84e256980e11@kernel.org Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-11-14vfs: make evict() use smp_mb__after_spinlock instead of smp_mbMateusz Guzik
It literally directly follows a spin_lock() call. This whacks an explicit barrier on x86-64. Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://lore.kernel.org/r/20241113155103.4194099-1-mjguzik@gmail.com Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-11-14configfs: improve item creation performanceSeamus Connor
As the size of a directory increases item creation slows down. Optimizing access to s_children removes this bottleneck. dirents are already pinned into the cache, there is no need to scan the s_children list looking for duplicate Items. The configfs_dirent_exists check is moved to a location where it is called only during subsystem initialization. d_lookup will only need to call configfs_lookup in the case where the item in question is not pinned to dcache. The only items not pinned to dcache are attributes. These are placed at the front of the s_children list, whilst pinned items are inserted at the back. configfs_lookup stops scanning when it encounters the first pinned entry in s_children. The assumption of the above optimizations is that there will be few attributes, but potentially many Items in a given directory. Signed-off-by: Seamus Connor <sconnor@purestorage.com> Reviewed-by: Joel Becker <jlbec@evilplan.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
2024-11-14configfs: remove unused configfs_hash_and_removeDr. David Alan Gilbert
configfs_hash_and_remove() has been unused since it was added in 2005 by commit 7063fbf22611 ("[PATCH] configfs: User-driven configuration filesystem") Remove it. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
2024-11-13btrfs: validate queue limitsChristoph Hellwig
Call blk_validate_limits on the queue limits used for zone append splitting so that calculated values get filled in and any stacking conflicts get cought. Without this there isn't a max_zone_append_sectors limits as of commit 559218d43ec9 ("block: pre-calculate max_zone_append_sectors"). Fixes: 559218d43ec9 ("block: pre-calculate max_zone_append_sectors") Reported-by: Yi Zhang <yi.zhang@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20241113084541.34315-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-13jbd2: Fix comment describing journal_init_common()Daniel Martín Gómez
The code indicates that journal_init_common() fills the journal_t object it returns while the comment incorrectly states that only a few fields are initialised. Also, the comment claims that journal structures could be created from scratch which isn't possible as journal_init_common() calls journal_load_superblock() which loads and checks journal superblock from disk. Signed-off-by: Daniel Martín Gómez <dalme@riseup.net> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20241107144538.3544-1-dalme@riseup.net Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-11-13ext4: prevent an infinite loop in the lazyinit threadMathieu Othacehe
Use ktime_get_ns instead of ktime_get_real_ns when computing the lr_timeout not to be affected by system time jumps. Use a boolean instead of the MAX_JIFFY_OFFSET value to determine whether the next_wakeup value has been set. Comparing elr->lr_next_sched to MAX_JIFFY_OFFSET can cause the lazyinit thread to loop indefinitely. Co-developed-by: Lukas Skupinski <lukas.skupinski@landisgyr.com> Signed-off-by: Lukas Skupinski <lukas.skupinski@landisgyr.com> Signed-off-by: Mathieu Othacehe <othacehe@gnu.org> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20241106134741.26948-2-othacehe@gnu.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-11-13ext4: use struct_size() to improve ext4_htree_store_dirent()Thorsten Blum
Inline and use struct_size() to calculate the number of bytes to allocate for new_fn and remove the local variable len. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Link: https://patch.msgid.link/20241105103353.11590-2-thorsten.blum@linux.dev Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-11-13ext4: annotate struct fname with __counted_by()Thorsten Blum
Add the __counted_by compiler attribute to the flexible array member name to improve access bounds-checking via CONFIG_UBSAN_BOUNDS and CONFIG_FORTIFY_SOURCE. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Link: https://patch.msgid.link/20241105101813.10864-2-thorsten.blum@linux.dev Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-11-13ext4: use str_yes_no() helper functionThorsten Blum
Remove hard-coded strings by using the str_yes_no() helper function. Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Link: https://patch.msgid.link/20241021100056.5521-2-thorsten.blum@linux.dev Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-11-13fsnotify: fix sending inotify event with unexpected filenameAmir Goldstein
We got a report that adding a fanotify filsystem watch prevents tail -f from receiving events. Reproducer: 1. Create 3 windows / login sessions. Become root in each session. 2. Choose a mounted filesystem that is pretty quiet; I picked /boot. 3. In the first window, run: fsnotifywait -S -m /boot 4. In the second window, run: echo data >> /boot/foo 5. In the third window, run: tail -f /boot/foo 6. Go back to the second window and run: echo more data >> /boot/foo 7. Observe that the tail command doesn't show the new data. 8. In the first window, hit control-C to interrupt fsnotifywait. 9. In the second window, run: echo still more data >> /boot/foo 10. Observe that the tail command in the third window has now printed the missing data. When stracing tail, we observed that when fanotify filesystem mark is set, tail does get the inotify event, but the event is receieved with the filename: read(4, "\1\0\0\0\2\0\0\0\0\0\0\0\20\0\0\0foo\0\0\0\0\0\0\0\0\0\0\0\0\0", 50) = 32 This is unexpected, because tail is watching the file itself and not its parent and is inconsistent with the inotify event received by tail when fanotify filesystem mark is not set: read(4, "\1\0\0\0\2\0\0\0\0\0\0\0\0\0\0\0", 50) = 16 The inteference between different fsnotify groups was caused by the fact that the mark on the sb requires the filename, so the filename is passed to fsnotify(). Later on, fsnotify_handle_event() tries to take care of not passing the filename to groups (such as inotify) that are interested in the filename only when the parent is watching. But the logic was incorrect for the case that no group is watching the parent, some groups are watching the sb and some watching the inode. Reported-by: Miklos Szeredi <miklos@szeredi.hu> Fixes: 7372e79c9eb9 ("fanotify: fix logic of reporting name info with watched parent") Cc: stable@vger.kernel.org # 5.10+ Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>
2024-11-13Merge tag 'mm-hotfixes-stable-2024-11-12-16-39' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "10 hotfixes, 7 of which are cc:stable. 7 are MM, 3 are not. All singletons" * tag 'mm-hotfixes-stable-2024-11-12-16-39' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mm: swapfile: fix cluster reclaim work crash on rotational devices selftests: hugetlb_dio: fixup check for initial conditions to skip in the start mm/thp: fix deferred split queue not partially_mapped: fix mm/gup: avoid an unnecessary allocation call for FOLL_LONGTERM cases nommu: pass NULL argument to vma_iter_prealloc() ocfs2: fix UBSAN warning in ocfs2_verify_volume() nilfs2: fix null-ptr-deref in block_dirty_buffer tracepoint nilfs2: fix null-ptr-deref in block_touch_buffer tracepoint mm: page_alloc: move mlocked flag clearance into free_pages_prepare() mm: count zeromap read and set for swapout and swapin
2024-11-13libfs: kill empty_dir_getattr()Al Viro
It's used only to initialize ->getattr in one inode_operations instance (empty_dir_inode_operations) and its behaviour had always been equivalent to what we get with NULL ->getattr. Just remove that initializer, along with empty_dir_getattr() itself. While we are at it, the same instance has ->permission initialized to generic_permission, which is what NULL ->permission ends up doing. Again, no point keeping it. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-11-13fs: Simplify getattr interface function checking AT_GETATTR_NOSEC flagStefan Berger
Commit 8a924db2d7b5 ("fs: Pass AT_GETATTR_NOSEC flag to getattr interface function")' introduced the AT_GETATTR_NOSEC flag to ensure that the call paths only call vfs_getattr_nosec if it is set instead of vfs_getattr. Now, simplify the getattr interface functions of filesystems where the flag AT_GETATTR_NOSEC is checked. There is only a single caller of inode_operations getattr function and it is located in fs/stat.c in vfs_getattr_nosec. The caller there is the only one from which the AT_GETATTR_NOSEC flag is passed from. Two filesystems are checking this flag in .getattr and the flag is always passed to them unconditionally from only vfs_getattr_nosec: - ecryptfs: Simplify by always calling vfs_getattr_nosec in ecryptfs_getattr. From there the flag is passed to no other function and this function is not called otherwise. - overlayfs: Simplify by always calling vfs_getattr_nosec in ovl_getattr. From there the flag is passed to no other function and this function is not called otherwise. The query_flags in vfs_getattr_nosec will mask-out AT_GETATTR_NOSEC from any caller using AT_STATX_SYNC_TYPE as mask so that the flag is not important inside this function. Also, since no filesystem is checking the flag anymore, remove the flag entirely now, including the BUG_ON check that never triggered. The net change of the changes here combined with the original commit is that ecryptfs and overlayfs do not call vfs_getattr but only vfs_getattr_nosec. Fixes: 8a924db2d7b5 ("fs: Pass AT_GETATTR_NOSEC flag to getattr interface function") Reported-by: Al Viro <viro@zeniv.linux.org.uk> Closes: https://lore.kernel.org/linux-fsdevel/20241101011724.GN1350452@ZenIV/T/#u Cc: Tyler Hicks <code@tyhicks.com> Cc: ecryptfs@vger.kernel.org Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Amir Goldstein <amir73il@gmail.com> Cc: linux-unionfs@vger.kernel.org Cc: Christian Brauner <brauner@kernel.org> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Stefan Berger <stefanb@linux.ibm.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-11-13fs/stat.c: switch to CLASS(fd_raw)Al Viro
... and use fd_empty() consistently Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>