summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2024-11-04NFS: Fix attribute delegation behaviour on exclusive createTrond Myklebust
When the client does an exclusive create and the server decides to store the verifier in the timestamps, a SETATTR is subsequently sent to fix up those timestamps. When that is the case, suppress the exceptions for attribute delegations in nfs4_bitmap_copy_adjust(). Fixes: 32215c1f893a ("NFSv4: Don't request atime/mtime/size if they are delegated to us") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2024-11-04nfs: Fix KMSAN warning in decode_getfattr_attrs()Roberto Sassu
Fix the following KMSAN warning: CPU: 1 UID: 0 PID: 7651 Comm: cp Tainted: G B Tainted: [B]=BAD_PAGE Hardware name: QEMU Standard PC (Q35 + ICH9, 2009) ===================================================== ===================================================== BUG: KMSAN: uninit-value in decode_getfattr_attrs+0x2d6d/0x2f90 decode_getfattr_attrs+0x2d6d/0x2f90 decode_getfattr_generic+0x806/0xb00 nfs4_xdr_dec_getattr+0x1de/0x240 rpcauth_unwrap_resp_decode+0xab/0x100 rpcauth_unwrap_resp+0x95/0xc0 call_decode+0x4ff/0xb50 __rpc_execute+0x57b/0x19d0 rpc_execute+0x368/0x5e0 rpc_run_task+0xcfe/0xee0 nfs4_proc_getattr+0x5b5/0x990 __nfs_revalidate_inode+0x477/0xd00 nfs_access_get_cached+0x1021/0x1cc0 nfs_do_access+0x9f/0xae0 nfs_permission+0x1e4/0x8c0 inode_permission+0x356/0x6c0 link_path_walk+0x958/0x1330 path_lookupat+0xce/0x6b0 filename_lookup+0x23e/0x770 vfs_statx+0xe7/0x970 vfs_fstatat+0x1f2/0x2c0 __se_sys_newfstatat+0x67/0x880 __x64_sys_newfstatat+0xbd/0x120 x64_sys_call+0x1826/0x3cf0 do_syscall_64+0xd0/0x1b0 entry_SYSCALL_64_after_hwframe+0x77/0x7f The KMSAN warning is triggered in decode_getfattr_attrs(), when calling decode_attr_mdsthreshold(). It appears that fattr->mdsthreshold is not initialized. Fix the issue by initializing fattr->mdsthreshold to NULL in nfs_fattr_init(). Cc: stable@vger.kernel.org # v3.5.x Fixes: 88034c3d88c2 ("NFSv4.1 mdsthreshold attribute xdr") Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2024-11-04NFSv3: only use NFS timeout for MOUNT when protocols are compatibleNeilBrown
If a timeout is specified in the mount options, it currently applies to both the NFS protocol and (with v3) the MOUNT protocol. This is sensible when they both use the same underlying protocol, or those protocols are compatible w.r.t timeouts as RDMA and TCP are. However if, for example, NFS is using TCP and MOUNT is using UDP then using the same timeout doesn't make much sense. If you mount -o vers=3,proto=tcp,mountproto=udp,timeo=600,retrans=5 \ server:/path /mountpoint then the timeo=600 which was intended for the NFS/TCP request will apply to the MOUNT/UDP requests with the result that there will only be one request sent (because UDP has a maximum timeout of 60 seconds). This is not what a reasonable person might expect. This patch disables the sharing of timeout information in cases where the underlying protocols are not compatible. Fixes: c9301cb35b59 ("nfs: hornor timeo and retrans option when mounting NFSv3") Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2024-11-03smb: client: Fix use-after-free of network namespace.Kuniyuki Iwashima
Recently, we got a customer report that CIFS triggers oops while reconnecting to a server. [0] The workload runs on Kubernetes, and some pods mount CIFS servers in non-root network namespaces. The problem rarely happened, but it was always while the pod was dying. The root cause is wrong reference counting for network namespace. CIFS uses kernel sockets, which do not hold refcnt of the netns that the socket belongs to. That means CIFS must ensure the socket is always freed before its netns; otherwise, use-after-free happens. The repro steps are roughly: 1. mount CIFS in a non-root netns 2. drop packets from the netns 3. destroy the netns 4. unmount CIFS We can reproduce the issue quickly with the script [1] below and see the splat [2] if CONFIG_NET_NS_REFCNT_TRACKER is enabled. When the socket is TCP, it is hard to guarantee the netns lifetime without holding refcnt due to async timers. Let's hold netns refcnt for each socket as done for SMC in commit 9744d2bf1976 ("smc: Fix use-after-free in tcp_write_timer_handler()."). Note that we need to move put_net() from cifs_put_tcp_session() to clean_demultiplex_info(); otherwise, __sock_create() still could touch a freed netns while cifsd tries to reconnect from cifs_demultiplex_thread(). Also, maybe_get_net() cannot be put just before __sock_create() because the code is not under RCU and there is a small chance that the same address happened to be reallocated to another netns. [0]: CIFS: VFS: \\XXXXXXXXXXX has not responded in 15 seconds. Reconnecting... CIFS: Serverclose failed 4 times, giving up Unable to handle kernel paging request at virtual address 14de99e461f84a07 Mem abort info: ESR = 0x0000000096000004 EC = 0x25: DABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 FSC = 0x04: level 0 translation fault Data abort info: ISV = 0, ISS = 0x00000004 CM = 0, WnR = 0 [14de99e461f84a07] address between user and kernel address ranges Internal error: Oops: 0000000096000004 [#1] SMP Modules linked in: cls_bpf sch_ingress nls_utf8 cifs cifs_arc4 cifs_md4 dns_resolver tcp_diag inet_diag veth xt_state xt_connmark nf_conntrack_netlink xt_nat xt_statistic xt_MASQUERADE xt_mark xt_addrtype ipt_REJECT nf_reject_ipv4 nft_chain_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_comment nft_compat nf_tables nfnetlink overlay nls_ascii nls_cp437 sunrpc vfat fat aes_ce_blk aes_ce_cipher ghash_ce sm4_ce_cipher sm4 sm3_ce sm3 sha3_ce sha512_ce sha512_arm64 sha1_ce ena button sch_fq_codel loop fuse configfs dmi_sysfs sha2_ce sha256_arm64 dm_mirror dm_region_hash dm_log dm_mod dax efivarfs CPU: 5 PID: 2690970 Comm: cifsd Not tainted 6.1.103-109.184.amzn2023.aarch64 #1 Hardware name: Amazon EC2 r7g.4xlarge/, BIOS 1.0 11/1/2018 pstate: 00400005 (nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : fib_rules_lookup+0x44/0x238 lr : __fib_lookup+0x64/0xbc sp : ffff8000265db790 x29: ffff8000265db790 x28: 0000000000000000 x27: 000000000000bd01 x26: 0000000000000000 x25: ffff000b4baf8000 x24: ffff00047b5e4580 x23: ffff8000265db7e0 x22: 0000000000000000 x21: ffff00047b5e4500 x20: ffff0010e3f694f8 x19: 14de99e461f849f7 x18: 0000000000000000 x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000 x14: 0000000000000000 x13: 0000000000000000 x12: 3f92800abd010002 x11: 0000000000000001 x10: ffff0010e3f69420 x9 : ffff800008a6f294 x8 : 0000000000000000 x7 : 0000000000000006 x6 : 0000000000000000 x5 : 0000000000000001 x4 : ffff001924354280 x3 : ffff8000265db7e0 x2 : 0000000000000000 x1 : ffff0010e3f694f8 x0 : ffff00047b5e4500 Call trace: fib_rules_lookup+0x44/0x238 __fib_lookup+0x64/0xbc ip_route_output_key_hash_rcu+0x2c4/0x398 ip_route_output_key_hash+0x60/0x8c tcp_v4_connect+0x290/0x488 __inet_stream_connect+0x108/0x3d0 inet_stream_connect+0x50/0x78 kernel_connect+0x6c/0xac generic_ip_connect+0x10c/0x6c8 [cifs] __reconnect_target_unlocked+0xa0/0x214 [cifs] reconnect_dfs_server+0x144/0x460 [cifs] cifs_reconnect+0x88/0x148 [cifs] cifs_readv_from_socket+0x230/0x430 [cifs] cifs_read_from_socket+0x74/0xa8 [cifs] cifs_demultiplex_thread+0xf8/0x704 [cifs] kthread+0xd0/0xd4 Code: aa0003f8 f8480f13 eb18027f 540006c0 (b9401264) [1]: CIFS_CRED="/root/cred.cifs" CIFS_USER="Administrator" CIFS_PASS="Password" CIFS_IP="X.X.X.X" CIFS_PATH="//${CIFS_IP}/Users/Administrator/Desktop/CIFS_TEST" CIFS_MNT="/mnt/smb" DEV="enp0s3" cat <<EOF > ${CIFS_CRED} username=${CIFS_USER} password=${CIFS_PASS} domain=EXAMPLE.COM EOF unshare -n bash -c " mkdir -p ${CIFS_MNT} ip netns attach root 1 ip link add eth0 type veth peer veth0 netns root ip link set eth0 up ip -n root link set veth0 up ip addr add 192.168.0.2/24 dev eth0 ip -n root addr add 192.168.0.1/24 dev veth0 ip route add default via 192.168.0.1 dev eth0 ip netns exec root sysctl net.ipv4.ip_forward=1 ip netns exec root iptables -t nat -A POSTROUTING -s 192.168.0.2 -o ${DEV} -j MASQUERADE mount -t cifs ${CIFS_PATH} ${CIFS_MNT} -o vers=3.0,sec=ntlmssp,credentials=${CIFS_CRED},rsize=65536,wsize=65536,cache=none,echo_interval=1 touch ${CIFS_MNT}/a.txt ip netns exec root iptables -t nat -D POSTROUTING -s 192.168.0.2 -o ${DEV} -j MASQUERADE " umount ${CIFS_MNT} [2]: ref_tracker: net notrefcnt@000000004bbc008d has 1/1 users at sk_alloc (./include/net/net_namespace.h:339 net/core/sock.c:2227) inet_create (net/ipv4/af_inet.c:326 net/ipv4/af_inet.c:252) __sock_create (net/socket.c:1576) generic_ip_connect (fs/smb/client/connect.c:3075) cifs_get_tcp_session.part.0 (fs/smb/client/connect.c:3160 fs/smb/client/connect.c:1798) cifs_mount_get_session (fs/smb/client/trace.h:959 fs/smb/client/connect.c:3366) dfs_mount_share (fs/smb/client/dfs.c:63 fs/smb/client/dfs.c:285) cifs_mount (fs/smb/client/connect.c:3622) cifs_smb3_do_mount (fs/smb/client/cifsfs.c:949) smb3_get_tree (fs/smb/client/fs_context.c:784 fs/smb/client/fs_context.c:802 fs/smb/client/fs_context.c:794) vfs_get_tree (fs/super.c:1800) path_mount (fs/namespace.c:3508 fs/namespace.c:3834) __x64_sys_mount (fs/namespace.c:3848 fs/namespace.c:4057 fs/namespace.c:4034 fs/namespace.c:4034) do_syscall_64 (arch/x86/entry/common.c:52 arch/x86/entry/common.c:83) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) Fixes: 26abe14379f8 ("net: Modify sk_alloc to not reference count the netns of kernel sockets.") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Tom Talpey <tom@talpey.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2024-11-03Merge tag 'mm-hotfixes-stable-2024-11-03-10-50' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "17 hotfixes. 9 are cc:stable. 13 are MM and 4 are non-MM. The usual collection of singletons - please see the changelogs" * tag 'mm-hotfixes-stable-2024-11-03-10-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mm: multi-gen LRU: use {ptep,pmdp}_clear_young_notify() mm: multi-gen LRU: remove MM_LEAF_OLD and MM_NONLEAF_TOTAL stats mm, mmap: limit THP alignment of anonymous mappings to PMD-aligned sizes mm: shrinker: avoid memleak in alloc_shrinker_info .mailmap: update e-mail address for Eugen Hristev vmscan,migrate: fix page count imbalance on node stats when demoting pages mailmap: update Jarkko's email addresses mm: allow set/clear page_type again nilfs2: fix potential deadlock with newly created symlinks Squashfs: fix variable overflow in squashfs_readpage_block kasan: remove vmalloc_percpu test tools/mm: -Werror fixes in page-types/slabinfo mm, swap: avoid over reclaim of full clusters mm: fix PSWPIN counter for large folios swap-in mm: avoid VM_BUG_ON when try to map an anon large folio to zero page. mm/codetag: fix null pointer check logic for ref and tag mm/gup: stop leaking pinned pages in low memory conditions
2024-11-03xattr: switch to CLASS(fd)Al Viro
Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-11-03do_pollfd(): convert to CLASS(fd)Al Viro
lift setting ->revents into the caller, so that failure exits (including the early one) would be plain returns. We need the scope of our struct fd to end before the store to ->revents, since that's shared with the failure exits prior to the point where we can do fdget(). Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-11-03convert do_select()Al Viro
take the logics from fdget() to fdput() into an inlined helper - with existing wait_key_set() subsumed into that. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-11-03convert vfs_dedupe_file_range().Al Viro
fdput() is followed by checking fatal_signal_pending() (and aborting the loop in such case). fdput() is transposable with that check. Yes, it'll probably end up with slightly fatter code (call after the check has returned false + call on the almost never taken out-of-line path instead of one call before the check), but it's not worth bothering with explicit extra scope there (or dragging the check into the loop condition, for that matter). Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-11-03convert cifs_ioctl_copychunk()Al Viro
fdput() moved past mnt_drop_file_write(); harmless, if somewhat cringeworthy. Reordering could be avoided either by adding an explicit scope or by making mnt_drop_file_write() called via __cleanup. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-11-03convert do_preadv()/do_pwritev()Al Viro
fdput() can be transposed with add_{r,w}char() and inc_sysc{r,w}(); it's the same story as with do_readv()/do_writev(), only with fdput() instead of fdput_pos(). Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-11-03fdget(), more trivial conversionsAl Viro
all failure exits prior to fdget() leave the scope, all matching fdput() are immediately followed by leaving the scope. [xfs_ioc_commit_range() chunk moved here as well] Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-11-03fdget(), trivial conversionsAl Viro
fdget() is the first thing done in scope, all matching fdput() are immediately followed by leaving the scope. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-11-03o2hb_region_dev_store(): avoid goto around fdget()/fdput()Al Viro
Preparation for CLASS(fd) conversion. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-11-03introduce "fd_pos" class, convert fdget_pos() users to it.Al Viro
fdget_pos() for constructor, fdput_pos() for cleanup, all users of fd..._pos() converted trivially. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-11-03fdget_raw() users: switch to CLASS(fd_raw)Al Viro
Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-11-03convert vmsplice() to CLASS(fd)Al Viro
Irregularity here is fdput() not in the same scope as fdget(); we could just lift it out vmsplice_type() in vmsplice(2), but there's no much point keeping vmsplice_type() separate after that... Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-11-03simplify xfs_find_handle() a bitAl Viro
XFS_IOC_FD_TO_HANDLE can grab a reference to copied ->f_path and let the file go; results in simpler control flow - cleanup is the same for both "by descriptor" and "by pathname" cases. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-11-03timerfd: switch to CLASS(fd)Al Viro
Fold timerfd_fget() into both callers to have fdget() and fdput() in the same scope. Could be done in different ways, but this is probably the smallest solution. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-11-03regularize emptiness checks in fini_module(2) and vfs_dedupe_file_range()Al Viro
With few exceptions emptiness checks are done as fd_file(...) in boolean context (usually something like if (!fd_file(f))...); those will be taken care of later. However, there's a couple of places where we do those checks as 'store fd_file(...) into a variable, then check if this variable is NULL' and those are harder to spot. Get rid of those now. use fd_empty() instead of extracting file and then checking it for NULL. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-11-02Merge tag 'nfsd-6.12-3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux Pull nfsd fixes from Chuck Lever: - Fix two async COPY bugs found during NFS bake-a-thon - Fix an svcrdma memory leak * tag 'nfsd-6.12-3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: rpcrdma: Always release the rpcrdma_device's xa_array NFSD: Never decrement pending_async_copies on error NFSD: Initialize struct nfsd4_copy earlier
2024-11-02Merge tag 'xfs-6.12-fixes-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull xfs fixes from Carlos Maiolino: - fix a sysbot reported crash on filestreams - Reduce cpu time spent searching for extents in a very fragmented FS - Check for delayed allocations before setting extsize * tag 'xfs-6.12-fixes-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: streamline xfs_filestream_pick_ag xfs: fix finding a last resort AG in xfs_filestream_pick_ag xfs: Reduce unnecessary searches when searching for the best extents xfs: Check for delayed allocations before setting extsize
2024-11-01Merge tag 'vfs-6.12-rc6.iomap' of ↵Linus Torvalds
gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs Pull iomap fixes from Christian Brauner: "Fixes for iomap to prevent data corruption bugs in the fallocate unshare range implementation of fsdax and a small cleanup to turn iomap_want_unshare_iter() into an inline function" * tag 'vfs-6.12-rc6.iomap' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: iomap: turn iomap_want_unshare_iter into an inline function fsdax: dax_unshare_iter needs to copy entire blocks fsdax: remove zeroing code from dax_unshare_iter iomap: share iomap_unshare_iter predicate code with fsdax xfs: don't allocate COW extents when unsharing a hole
2024-11-01Merge tag 'vfs-6.12-rc6.fixes' of ↵Linus Torvalds
gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs Pull filesystem fixes from Christian Brauner: "VFS: - Fix copy_page_from_iter_atomic() if KMAP_LOCAL_FORCE_MAP=y is set - Add a get_tree_bdev_flags() helper that allows to modify e.g., whether errors are logged into the filesystem context during superblock creation. This is used by erofs to fix a userspace regression where an error is currently logged when its used on a regular file which is an new allowed mode in erofs. netfs: - Fix the sysfs debug path in the documentation. - Fix iov_iter_get_pages*() for folio queues by skipping the page extracation if we're at the end of a folio. afs: - Fix moving subdirectories to different parent directory. autofs: - Fix handling of AUTOFS_DEV_IOCTL_TIMEOUT_CMD ioctl in validate_dev_ioctl(). The actual ioctl number, not the ioctl command needs to be checked for autofs" * tag 'vfs-6.12-rc6.fixes' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: iov_iter: fix copy_page_from_iter_atomic() if KMAP_LOCAL_FORCE_MAP autofs: fix thinko in validate_dev_ioctl() iov_iter: Fix iov_iter_get_pages*() for folio_queue afs: Fix missing subdir edit when renamed between parent dirs doc: correcting the debug path for cachefiles erofs: use get_tree_bdev_flags() to avoid misleading messages fs/super.c: introduce get_tree_bdev_flags()
2024-11-01Merge tag 'for-6.12-rc5-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "A few more stability fixes. There's one patch adding export of MIPS cmpxchg helper, used in the error propagation fix. - fix error propagation from split bios to the original btrfs bio - fix merging of adjacent extents (normal operation, defragmentation) - fix potential use after free after freeing btrfs device structures" * tag 'for-6.12-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix defrag not merging contiguous extents due to merged extent maps btrfs: fix extent map merging not happening for adjacent extents btrfs: fix use-after-free of block device file in __btrfs_free_extra_devids() btrfs: fix error propagation of split bios MIPS: export __cmpxchg_small()
2024-11-01Merge tag 'bcachefs-2024-10-31' of git://evilpiepirate.org/bcachefsLinus Torvalds
Pull bcachefs fixes from Kent Overstreet: "Various syzbot fixes, and the more notable ones: - Fix for pointers in an extent overflowing the max (16) on a filesystem with many devices: we were creating too many cached copies when moving data around. Now, we only create at most one cached copy if there's a promote target set. Caching will be a bit broken for reflinked data until 6.13: I have larger series queued up which significantly improves the plumbing for data options down into the extent (bch_extent_rebalance) to fix this. - Fix for deadlock on -ENOSPC on tiny filesystems Allocation from the partial open_bucket list wasn't correctly accounting partial open_buckets as free: this fixes the main cause of tests timing out in the automated tests" * tag 'bcachefs-2024-10-31' of git://evilpiepirate.org/bcachefs: bcachefs: Fix NULL ptr dereference in btree_node_iter_and_journal_peek bcachefs: fix possible null-ptr-deref in __bch2_ec_stripe_head_get() bcachefs: Fix deadlock on -ENOSPC w.r.t. partial open buckets bcachefs: Don't filter partial list buckets in open_buckets_to_text() bcachefs: Don't keep tons of cached pointers around bcachefs: init freespace inited bits to 0 in bch2_fs_initialize bcachefs: Fix unhandled transaction restart in fallocate bcachefs: Fix UAF in bch2_reconstruct_alloc() bcachefs: fix null-ptr-deref in have_stripes() bcachefs: fix shift oob in alloc_lru_idx_fragmentation bcachefs: Fix invalid shift in validate_sb_layout()
2024-11-01fs: optimize acl_permission_check()Linus Torvalds
generic_permission() turned out to be costlier than expected. The reason is that acl_permission_check() performs owner checks that involve costly pointer dereferences. There's already code that skips expensive group checks if the group and other permission bits are the same wrt to the mask that is checked. This logic can be extended to also shortcut permission checking in acl_permission_check(). If the inode doesn't have POSIX ACLs than ownership doesn't matter. If there are no missing UGO permissions the permission check can be shortcut. Acked-by: Al Viro <viro@zeniv.linux.org.uk> Link: https://lore.kernel.org/all/CAHk-=wgXEoAOFRkDg+grxs+p1U+QjWXLixRGmYEfd=vG+OBuFw@mail.gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-11-01tracing: Fix tracefs mount optionsKalesh Singh
Commit 78ff64081949 ("vfs: Convert tracefs to use the new mount API") converted tracefs to use the new mount APIs caused mount options (e.g. gid=<gid>) to not take effect. The tracefs superblock can be updated from multiple paths: - on fs_initcall() to init_trace_printk_function_export() - from a work queue to initialize eventfs tracer_init_tracefs_work_func() - fsconfig() syscall to mount or remount of tracefs The tracefs superblock root inode gets created early on in init_trace_printk_function_export(). With the new mount API, tracefs effectively uses get_tree_single() instead of the old API mount_single(). Previously, mount_single() ensured that the options are always applied to the superblock root inode: (1) If the root inode didn't exist, call fill_super() to create it and apply the options. (2) If the root inode exists, call reconfigure_single() which effectively calls tracefs_apply_options() to parse and apply options to the subperblock's fs_info and inode and remount eventfs (if necessary) On the other hand, get_tree_single() effectively calls vfs_get_super() which: (3) If the root inode doesn't exists, calls fill_super() to create it and apply the options. (4) If the root inode already exists, updates the fs_context root with the superblock's root inode. (4) above is always the case for tracefs mounts, since the super block's root inode will already be created by init_trace_printk_function_export(). This means that the mount options get ignored: - Since it isn't applied to the superblock's root inode, it doesn't get inherited by the children. - Since eventfs is initialized from a separate work queue and before call to mount with the options, and it doesn't get remounted for mount. Ensure that the mount options are applied to the super block and eventfs is remounted to respect the mount options. To understand this better, if fstab has the following: tracefs /sys/kernel/tracing tracefs nosuid,nodev,noexec,gid=tracing 0 0 On boot up, permissions look like: # ls -l /sys/kernel/tracing/trace -rw-r----- 1 root root 0 Nov 1 08:37 /sys/kernel/tracing/trace When it should look like: # ls -l /sys/kernel/tracing/trace -rw-r----- 1 root tracing 0 Nov 1 08:37 /sys/kernel/tracing/trace Link: https://lore.kernel.org/r/536e99d3-345c-448b-adee-a21389d7ab4b@redhat.com/ Cc: Eric Sandeen <sandeen@redhat.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Ali Zahraee <ahzahraee@gmail.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Howells <dhowells@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: stable@vger.kernel.org Fixes: 78ff64081949 ("vfs: Convert tracefs to use the new mount API") Link: https://lore.kernel.org/20241030171928.4168869-2-kaleshsingh@google.com Signed-off-by: Kalesh Singh <kaleshsingh@google.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-11-01fs/ntfs3: Accumulated refactoring changesKonstantin Komarov
Changes made to improve readability and debuggability. Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
2024-11-01fs/ntfs3: Switch to folio to release resourcesKonstantin Komarov
As part of the process of switching from page to folio. Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
2024-11-01fs/ntfs3: Add check in ntfs_extend_initialized_sizeKonstantin Komarov
Check arguments again after lock. Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
2024-11-01fs/ntfs3: Add more checks in mi_enum_attr (part 2)Konstantin Komarov
Add offset check before access to attr->non_res field as mentioned in [1]. [1] https://lore.kernel.org/ntfs3/20241010110005.42792-1-llfamsec@gmail.com/ Suggested-by: lei lu <llfamsec@gmail.com> Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
2024-11-01fs/ntfs3: Equivalent transition from page to folioKonstantin Komarov
If using the proposed function folio_zero_range(), should one switch from 'start + end' to 'start + length,' or use folio_zero_segment() Fixes: 1da86618bdce ("fs: Convert aops->write_begin to take a folio") Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
2024-11-01fs/ntfs3: Fix case when unmarked clusters intersect with zoneKonstantin Komarov
Reported-by: syzbot+7f3761b790fa41d0f3d5@syzkaller.appspotmail.com Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
2024-11-01fs/ntfs3: Fix warning in ni_fiemapKonstantin Komarov
Use local runs_tree instead of cached. This way excludes rw_semaphore lock. Reported-by: syzbot+1c25748a40fe79b8a119@syzkaller.appspotmail.com Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
2024-11-01f2fs: remove redundant atomic file check in defragmentZhiguo Niu
f2fs_is_atomic_file(inode) is checked in f2fs_defragment_range, so remove the redundant checking in f2fs_ioc_defragment. Signed-off-by: Zhiguo Niu <zhiguo.niu@unisoc.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2024-11-01f2fs: fix to convert log type to segment data type correctlyChao Yu
This patch introduces a new helper log_type_to_seg_type() to convert log type to segment data type, and uses it to clean up opened codes in build_curseg(), and also it fixes to convert log type before use in do_write_page(). Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2024-11-01f2fs: clean up the unused variable additional_reserved_segmentsLongPing Wei
additional_reserved_segments was introduced by commit 300a842937fb ("f2fs: fix to reserve space for IO align feature"), and its initialization was deleted by commit 87161a2b0aed ("f2fs: deprecate io_bits"). Signed-off-by: LongPing Wei <weilongping@oppo.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2024-11-01f2fs: clean up opened code w/ {get,set}_nid()Chao Yu
Just cleanup, no logic change. Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2024-11-01f2fs: fix to avoid use GC_AT when setting gc_mode as GC_URGENT_LOW or ↵Zhiguo Niu
GC_URGENT_MID If gc_mode is set to GC_URGENT_LOW or GC_URGENT_MID, cost benefit GC approach should be used, but if ATGC is enabled at the same time, Age-threshold approach will be selected, which can only do amount of GC and it is much less than the numbers of CB approach. some traces: f2fs_gc-254:48-396 [007] ..... 2311600.684028: f2fs_gc_begin: dev = (254,48), gc_type = Background GC, no_background_GC = 0, nr_free_secs = 0, nodes = 1053, dents = 2, imeta = 18, free_sec:44898, free_seg:44898, rsv_seg:239, prefree_seg:0 f2fs_gc-254:48-396 [007] ..... 2311600.684527: f2fs_get_victim: dev = (254,48), type = No TYPE, policy = (Background GC, LFS-mode, Age-threshold), victim = 10, cost = 4294364975, ofs_unit = 1, pre_victim_secno = -1, prefree = 0, free = 44898 f2fs_gc-254:48-396 [007] ..... 2311600.714835: f2fs_gc_end: dev = (254,48), ret = 0, seg_freed = 0, sec_freed = 0, nodes = 1562, dents = 2, imeta = 18, free_sec:44898, free_seg:44898, rsv_seg:239, prefree_seg:0 f2fs_gc-254:48-396 [007] ..... 2311600.714843: f2fs_background_gc: dev = (254,48), wait_ms = 50, prefree = 0, free = 44898 f2fs_gc-254:48-396 [007] ..... 2311600.771785: f2fs_gc_begin: dev = (254,48), gc_type = Background GC, no_background_GC = 0, nr_free_secs = 0, nodes = 1562, dents = 2, imeta = 18, free_sec:44898, free_seg:44898, rsv_seg:239, prefree_seg: f2fs_gc-254:48-396 [007] ..... 2311600.772275: f2fs_gc_end: dev = (254,48), ret = -61, seg_freed = 0, sec_freed = 0, nodes = 1562, dents = 2, imeta = 18, free_sec:44898, free_seg:44898, rsv_seg:239, prefree_seg:0 Fixes: 0e5e81114de1 ("f2fs: add GC_URGENT_LOW mode in gc_urgent") Fixes: d98af5f45520 ("f2fs: introduce gc_urgent_mid mode") Signed-off-by: Zhiguo Niu <zhiguo.niu@unisoc.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2024-11-01f2fs: fix to parse temperature correctly in f2fs_get_segment_temp()Chao Yu
In __get_segment_type(), __get_segment_type_6() may return CURSEG_COLD_DATA_PINNED or CURSEG_ALL_DATA_ATGC log type, but following f2fs_get_segment_temp() can only handle persistent log type, fix it. Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2024-11-01f2fs: fix to avoid potential deadlock in f2fs_record_stop_reason()Chao Yu
syzbot reports deadlock issue of f2fs as below: ====================================================== WARNING: possible circular locking dependency detected 6.12.0-rc3-syzkaller-00087-gc964ced77262 #0 Not tainted ------------------------------------------------------ kswapd0/79 is trying to acquire lock: ffff888011824088 (&sbi->sb_lock){++++}-{3:3}, at: f2fs_down_write fs/f2fs/f2fs.h:2199 [inline] ffff888011824088 (&sbi->sb_lock){++++}-{3:3}, at: f2fs_record_stop_reason+0x52/0x1d0 fs/f2fs/super.c:4068 but task is already holding lock: ffff88804bd92610 (sb_internal#2){.+.+}-{0:0}, at: f2fs_evict_inode+0x662/0x15c0 fs/f2fs/inode.c:842 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (sb_internal#2){.+.+}-{0:0}: lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5825 percpu_down_read include/linux/percpu-rwsem.h:51 [inline] __sb_start_write include/linux/fs.h:1716 [inline] sb_start_intwrite+0x4d/0x1c0 include/linux/fs.h:1899 f2fs_evict_inode+0x662/0x15c0 fs/f2fs/inode.c:842 evict+0x4e8/0x9b0 fs/inode.c:725 f2fs_evict_inode+0x1a4/0x15c0 fs/f2fs/inode.c:807 evict+0x4e8/0x9b0 fs/inode.c:725 dispose_list fs/inode.c:774 [inline] prune_icache_sb+0x239/0x2f0 fs/inode.c:963 super_cache_scan+0x38c/0x4b0 fs/super.c:223 do_shrink_slab+0x701/0x1160 mm/shrinker.c:435 shrink_slab+0x1093/0x14d0 mm/shrinker.c:662 shrink_one+0x43b/0x850 mm/vmscan.c:4818 shrink_many mm/vmscan.c:4879 [inline] lru_gen_shrink_node mm/vmscan.c:4957 [inline] shrink_node+0x3799/0x3de0 mm/vmscan.c:5937 kswapd_shrink_node mm/vmscan.c:6765 [inline] balance_pgdat mm/vmscan.c:6957 [inline] kswapd+0x1ca3/0x3700 mm/vmscan.c:7226 kthread+0x2f0/0x390 kernel/kthread.c:389 ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 -> #1 (fs_reclaim){+.+.}-{0:0}: lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5825 __fs_reclaim_acquire mm/page_alloc.c:3834 [inline] fs_reclaim_acquire+0x88/0x130 mm/page_alloc.c:3848 might_alloc include/linux/sched/mm.h:318 [inline] prepare_alloc_pages+0x147/0x5b0 mm/page_alloc.c:4493 __alloc_pages_noprof+0x16f/0x710 mm/page_alloc.c:4722 alloc_pages_mpol_noprof+0x3e8/0x680 mm/mempolicy.c:2265 alloc_pages_noprof mm/mempolicy.c:2345 [inline] folio_alloc_noprof+0x128/0x180 mm/mempolicy.c:2352 filemap_alloc_folio_noprof+0xdf/0x500 mm/filemap.c:1010 do_read_cache_folio+0x2eb/0x850 mm/filemap.c:3787 read_mapping_folio include/linux/pagemap.h:1011 [inline] f2fs_commit_super+0x3c0/0x7d0 fs/f2fs/super.c:4032 f2fs_record_stop_reason+0x13b/0x1d0 fs/f2fs/super.c:4079 f2fs_handle_critical_error+0x2ac/0x5c0 fs/f2fs/super.c:4174 f2fs_write_inode+0x35f/0x4d0 fs/f2fs/inode.c:785 write_inode fs/fs-writeback.c:1503 [inline] __writeback_single_inode+0x711/0x10d0 fs/fs-writeback.c:1723 writeback_single_inode+0x1f3/0x660 fs/fs-writeback.c:1779 sync_inode_metadata+0xc4/0x120 fs/fs-writeback.c:2849 f2fs_release_file+0xa8/0x100 fs/f2fs/file.c:1941 __fput+0x23f/0x880 fs/file_table.c:431 task_work_run+0x24f/0x310 kernel/task_work.c:228 resume_user_mode_work include/linux/resume_user_mode.h:50 [inline] exit_to_user_mode_loop kernel/entry/common.c:114 [inline] exit_to_user_mode_prepare include/linux/entry-common.h:328 [inline] __syscall_exit_to_user_mode_work kernel/entry/common.c:207 [inline] syscall_exit_to_user_mode+0x168/0x370 kernel/entry/common.c:218 do_syscall_64+0x100/0x230 arch/x86/entry/common.c:89 entry_SYSCALL_64_after_hwframe+0x77/0x7f -> #0 (&sbi->sb_lock){++++}-{3:3}: check_prev_add kernel/locking/lockdep.c:3161 [inline] check_prevs_add kernel/locking/lockdep.c:3280 [inline] validate_chain+0x18ef/0x5920 kernel/locking/lockdep.c:3904 __lock_acquire+0x1384/0x2050 kernel/locking/lockdep.c:5202 lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5825 down_write+0x99/0x220 kernel/locking/rwsem.c:1577 f2fs_down_write fs/f2fs/f2fs.h:2199 [inline] f2fs_record_stop_reason+0x52/0x1d0 fs/f2fs/super.c:4068 f2fs_handle_critical_error+0x2ac/0x5c0 fs/f2fs/super.c:4174 f2fs_evict_inode+0xa61/0x15c0 fs/f2fs/inode.c:883 evict+0x4e8/0x9b0 fs/inode.c:725 f2fs_evict_inode+0x1a4/0x15c0 fs/f2fs/inode.c:807 evict+0x4e8/0x9b0 fs/inode.c:725 dispose_list fs/inode.c:774 [inline] prune_icache_sb+0x239/0x2f0 fs/inode.c:963 super_cache_scan+0x38c/0x4b0 fs/super.c:223 do_shrink_slab+0x701/0x1160 mm/shrinker.c:435 shrink_slab+0x1093/0x14d0 mm/shrinker.c:662 shrink_one+0x43b/0x850 mm/vmscan.c:4818 shrink_many mm/vmscan.c:4879 [inline] lru_gen_shrink_node mm/vmscan.c:4957 [inline] shrink_node+0x3799/0x3de0 mm/vmscan.c:5937 kswapd_shrink_node mm/vmscan.c:6765 [inline] balance_pgdat mm/vmscan.c:6957 [inline] kswapd+0x1ca3/0x3700 mm/vmscan.c:7226 kthread+0x2f0/0x390 kernel/kthread.c:389 ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 other info that might help us debug this: Chain exists of: &sbi->sb_lock --> fs_reclaim --> sb_internal#2 Possible unsafe locking scenario: CPU0 CPU1 ---- ---- rlock(sb_internal#2); lock(fs_reclaim); lock(sb_internal#2); lock(&sbi->sb_lock); Root cause is there will be potential deadlock in between below tasks: Thread A Kswapd - f2fs_ioc_commit_atomic_write - mnt_want_write_file -- down_read lock A - balance_pgdat - __fs_reclaim_acquire -- lock B - shrink_node - prune_icache_sb - dispose_list - f2fs_evict_inode - sb_start_intwrite -- down_read lock A - f2fs_do_sync_file - f2fs_write_inode - f2fs_handle_critical_error - f2fs_record_stop_reason - f2fs_commit_super - read_mapping_folio - filemap_alloc_folio_noprof - fs_reclaim_acquire -- lock B Both threads try to acquire read lock of lock A, then its upcoming write lock grabber will trigger deadlock. Let's always create an asynchronous task in f2fs_handle_critical_error() rather than calling f2fs_record_stop_reason() synchronously to avoid this potential deadlock issue. Fixes: b62e71be2110 ("f2fs: support errors=remount-ro|continue|panic mountoption") Reported-by: syzbot+be4a9983e95a5e25c8d3@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/6704d667.050a0220.1e4d62.0081.GAE@google.com Signed-off-by: Chao Yu <chao@kernel.org> Reviewed-by: Daejun Park <daejun7.park@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2024-11-01f2fs: Fix not used variable 'index'Zeng Heng
Fix the following compilation warning: fs/f2fs/data.c:2391:10: warning: variable ‘index’ set but not used [-Wunused-but-set-variable] 2391 | pgoff_t index; Only define and set the variable index when the CONFIG_F2FS_FS_COMPRESSION is enabled. Fixes: db92e6c729d8 ("f2fs: convert f2fs_mpage_readpages() to use folio") Signed-off-by: Zeng Heng <zengheng4@huawei.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2024-11-01f2fs: fix typosAndrew Kreimer
Fix typos: datas -> data. Via codespell. Signed-off-by: Andrew Kreimer <algonell@gmail.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2024-11-01f2fs: modify f2fs_is_checkpoint_ready logic to allow more data to be written ↵Qi Han
with the CP disable When the free segment is used up during CP disable, many write or ioctl operations will get ENOSPC error codes, even if there are still many blocks available. We can reproduce it in the following steps: dd if=/dev/zero of=f2fs.img bs=1M count=65 mkfs.f2fs -f f2fs.img mount f2fs.img f2fs_dir -o checkpoint=disable:10% cd f2fs_dir i=1 ; while [[ $i -lt 50 ]] ; do (file_name=./2M_file$i ; dd \ if=/dev/random of=$file_name bs=1M count=2); i=$((i+1)); done sync i=1 ; while [[ $i -lt 50 ]] ; do (file_name=./2M_file$i ; truncate \ -s 1K $file_name); i=$((i+1)); done sync dd if=/dev/zero of=./file bs=1M count=20 In f2fs_need_SSR() function, it is allowed to use SSR to allocate blocks when CP is disabled, so in f2fs_is_checkpoint_ready function, can we judge the number of invalid blocks when free segment is not enough, and return ENOSPC only if the number of invalid blocks is also not enough. Signed-off-by: Qi Han <hanqi@vivo.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2024-11-01f2fs: check curseg->inited before write_sum_page in change_cursegYongpeng Yang
In the __f2fs_init_atgc_curseg->get_atssr_segment calling, curseg->segno is NULL_SEGNO, indicating that there is no summary block that needs to be written. Fixes: 093749e296e2 ("f2fs: support age threshold based garbage collection") Signed-off-by: Yongpeng Yang <yangyongpeng1@oppo.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2024-11-01f2fs: fix the wrong f2fs_bug_on condition in f2fs_do_replace_blockLongPing Wei
This f2fs_bug_on was introduced by commit 2c1905042c8c ("f2fs: check segment type in __f2fs_replace_block") when there were only 6 curseg types. After commit d0b9e42ab615 ("f2fs: introduce inmem curseg") was introduced, the condition should be changed to checking curseg->seg_type. Fixes: d0b9e42ab615 ("f2fs: introduce inmem curseg") Signed-off-by: LongPing Wei <weilongping@oppo.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2024-11-01f2fs: zone: introduce first_zoned_segno in f2fs_sb_infoChao Yu
first_zoned_segno() returns a fixed value, let's cache it in structure f2fs_sb_info to avoid redundant calculation. Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2024-11-01f2fs: introduce device aliasing fileDaeho Jeong
F2FS should understand how the device aliasing file works and support deleting the file after use. A device aliasing file can be created by mkfs.f2fs tool and it can map the whole device with an extent, not using node blocks. The file space should be pinned and normally used for read-only usages. Signed-off-by: Daeho Jeong <daehojeong@google.com> Signed-off-by: Chao Yu <chao@kernel.org> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2024-11-01f2fs: decrease spare area for pinned files for zoned devicesDaeho Jeong
Now we reclaim too much space before allocating pinned space for zoned devices. Signed-off-by: Daeho Jeong <daehojeong@google.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>