summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2025-09-05nfs/localio: restore creds before releasing pageio dataScott Mayhew
Otherwise if the nfsd filecache code releases the nfsd_file immediately, it can trigger the BUG_ON(cred == current->cred) in __put_cred() when it puts the nfsd_file->nf_file->f-cred. Fixes: b9f5dd57f4a5 ("nfs/localio: use dedicated workqueues for filesystem read and write") Signed-off-by: Scott Mayhew <smayhew@redhat.com> Reviewed-by: Mike Snitzer <snitzer@kernel.org> Link: https://lore.kernel.org/r/20250807164938.2395136-1-smayhew@redhat.com Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-09-05Merge tag '6.17-RC4-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds
Pull smb client fixes from Steve French: - Fix two potential NULL pointer references - Two debugging improvements (to help debug recent issues) a new tracepoint, and minor improvement to DebugData - Trivial comment cleanup * tag '6.17-RC4-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6: cifs: prevent NULL pointer dereference in UTF16 conversion smb: client: show negotiated cipher in DebugData smb: client: add new tracepoint to trace lease break notification smb: client: fix spellings in comments smb: client: Fix NULL pointer dereference in cifs_debug_dirs_proc_show()
2025-09-05btrfs: don't allow adding block device of less than 1 MBMark Harmstone
Commit 15ae0410c37a79 ("btrfs-progs: add error handling for device_get_partition_size_fd_stat()") in btrfs-progs inadvertently changed it so that if the BLKGETSIZE64 ioctl on a block device returned a size of 0, this was no longer seen as an error condition. Unfortunately this is how disconnected NBD devices behave, meaning that with btrfs-progs 6.16 it's now possible to add a device you can't remove: # btrfs device add /dev/nbd0 /root/temp # btrfs device remove /dev/nbd0 /root/temp ERROR: error removing device '/dev/nbd0': Invalid argument This check should always have been done kernel-side anyway, so add a check in btrfs_init_new_device() that the new device doesn't have a size less than BTRFS_DEVICE_RANGE_RESERVED (i.e. 1 MB). Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Mark Harmstone <mark@harmstone.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-05fuse: virtio_fs: fix page fault for DAX page addressHaiyue Wang
The commit ced17ee32a99 ("Revert "virtio: reject shm region if length is zero"") exposes the following DAX page fault bug (this fix the failure that getting shm region alway returns false because of zero length): The commit 21aa65bf82a7 ("mm: remove callers of pfn_t functionality") handles the DAX physical page address incorrectly: the removed macro 'phys_to_pfn_t()' should be replaced with 'PHYS_PFN()'. [ 1.390321] BUG: unable to handle page fault for address: ffffd3fb40000008 [ 1.390875] #PF: supervisor read access in kernel mode [ 1.391257] #PF: error_code(0x0000) - not-present page [ 1.391509] PGD 0 P4D 0 [ 1.391626] Oops: Oops: 0000 [#1] SMP NOPTI [ 1.391806] CPU: 6 UID: 1000 PID: 162 Comm: weston Not tainted 6.17.0-rc3-WSL2-STABLE #2 PREEMPT(none) [ 1.392361] RIP: 0010:dax_to_folio+0x14/0x60 [ 1.392653] Code: 52 c9 c3 00 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 48 c1 ef 05 48 c1 e7 06 48 03 3d 34 b5 31 01 <48> 8b 57 08 48 89 f8 f6 c2 01 75 2b 66 90 c3 cc cc cc cc f7 c7 ff [ 1.393727] RSP: 0000:ffffaf7d04407aa8 EFLAGS: 00010086 [ 1.394003] RAX: 000000a000000000 RBX: ffffaf7d04407bb0 RCX: 0000000000000000 [ 1.394524] RDX: ffffd17b40000008 RSI: 0000000000000083 RDI: ffffd3fb40000000 [ 1.394967] RBP: 0000000000000011 R08: 000000a000000000 R09: 0000000000000000 [ 1.395400] R10: 0000000000001000 R11: ffffaf7d04407c10 R12: 0000000000000000 [ 1.395806] R13: ffffa020557be9c0 R14: 0000014000000001 R15: 0000725970e94000 [ 1.396268] FS: 000072596d6d2ec0(0000) GS:ffffa0222dc59000(0000) knlGS:0000000000000000 [ 1.396715] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1.397100] CR2: ffffd3fb40000008 CR3: 000000011579c005 CR4: 0000000000372ef0 [ 1.397518] Call Trace: [ 1.397663] <TASK> [ 1.397900] dax_insert_entry+0x13b/0x390 [ 1.398179] dax_fault_iter+0x2a5/0x6c0 [ 1.398443] dax_iomap_pte_fault+0x193/0x3c0 [ 1.398750] __fuse_dax_fault+0x8b/0x270 [ 1.398997] ? vm_mmap_pgoff+0x161/0x210 [ 1.399175] __do_fault+0x30/0x180 [ 1.399360] do_fault+0xc4/0x550 [ 1.399547] __handle_mm_fault+0x8e3/0xf50 [ 1.399731] ? do_syscall_64+0x72/0x1e0 [ 1.399958] handle_mm_fault+0x192/0x2f0 [ 1.400204] do_user_addr_fault+0x20e/0x700 [ 1.400418] exc_page_fault+0x66/0x150 [ 1.400602] asm_exc_page_fault+0x26/0x30 [ 1.400831] RIP: 0033:0x72596d1bf703 [ 1.401076] Code: 31 f6 45 31 e4 48 8d 15 b3 73 00 00 e8 06 03 00 00 8b 83 68 01 00 00 e9 8e fa ff ff 0f 1f 00 48 8b 44 24 08 4c 89 ee 48 89 df <c7> 00 21 43 34 12 e8 72 09 00 00 e9 6a fa ff ff 0f 1f 44 00 00 e8 [ 1.402172] RSP: 002b:00007ffc350f6dc0 EFLAGS: 00010202 [ 1.402488] RAX: 0000725970e94000 RBX: 00005b7c642c2560 RCX: 0000725970d359a7 [ 1.402898] RDX: 0000000000000003 RSI: 00007ffc350f6dc0 RDI: 00005b7c642c2560 [ 1.403284] RBP: 00007ffc350f6e90 R08: 000000000000000d R09: 0000000000000000 [ 1.403634] R10: 00007ffc350f6dd8 R11: 0000000000000246 R12: 0000000000000001 [ 1.404078] R13: 00007ffc350f6dc0 R14: 0000725970e29ce0 R15: 0000000000000003 [ 1.404450] </TASK> [ 1.404570] Modules linked in: [ 1.404821] CR2: ffffd3fb40000008 [ 1.405029] ---[ end trace 0000000000000000 ]--- [ 1.405323] RIP: 0010:dax_to_folio+0x14/0x60 [ 1.405556] Code: 52 c9 c3 00 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 48 c1 ef 05 48 c1 e7 06 48 03 3d 34 b5 31 01 <48> 8b 57 08 48 89 f8 f6 c2 01 75 2b 66 90 c3 cc cc cc cc f7 c7 ff [ 1.406639] RSP: 0000:ffffaf7d04407aa8 EFLAGS: 00010086 [ 1.406910] RAX: 000000a000000000 RBX: ffffaf7d04407bb0 RCX: 0000000000000000 [ 1.407379] RDX: ffffd17b40000008 RSI: 0000000000000083 RDI: ffffd3fb40000000 [ 1.407800] RBP: 0000000000000011 R08: 000000a000000000 R09: 0000000000000000 [ 1.408246] R10: 0000000000001000 R11: ffffaf7d04407c10 R12: 0000000000000000 [ 1.408666] R13: ffffa020557be9c0 R14: 0000014000000001 R15: 0000725970e94000 [ 1.409170] FS: 000072596d6d2ec0(0000) GS:ffffa0222dc59000(0000) knlGS:0000000000000000 [ 1.409608] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1.409977] CR2: ffffd3fb40000008 CR3: 000000011579c005 CR4: 0000000000372ef0 [ 1.410437] Kernel panic - not syncing: Fatal exception [ 1.410857] Kernel Offset: 0xc000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) Fixes: 21aa65bf82a7 ("mm: remove callers of pfn_t functionality") Signed-off-by: Haiyue Wang <haiyuewa@163.com> Link: https://lore.kernel.org/20250904120339.972-1-haiyuewa@163.com Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-04cifs: prevent NULL pointer dereference in UTF16 conversionMakar Semyonov
There can be a NULL pointer dereference bug here. NULL is passed to __cifs_sfu_make_node without checks, which passes it unchecked to cifs_strndup_to_utf16, which in turn passes it to cifs_local_to_utf16_bytes where '*from' is dereferenced, causing a crash. This patch adds a check for NULL 'src' in cifs_strndup_to_utf16 and returns NULL early to prevent dereferencing NULL pointer. Found by Linux Verification Center (linuxtesting.org) with SVACE Signed-off-by: Makar Semyonov <m.semenov@tssltd.ru> Cc: stable@vger.kernel.org Signed-off-by: Steve French <stfrench@microsoft.com>
2025-09-03Merge tag 'v6.17-rc4-ksmbd-fix' of git://git.samba.org/ksmbdLinus Torvalds
Pull smb server fix from Steve French: - fix handling filenames with ":" (colon) in them * tag 'v6.17-rc4-ksmbd-fix' of git://git.samba.org/ksmbd: ksmbd: allow a filename to contain colons on SMB3.1.1 posix extensions
2025-09-02smb: client: show negotiated cipher in DebugDataBharath SM
Print the negotiated encryption cipher type in DebugData Signed-off-by: Bharath SM <bharathsm@microsoft.com> Acked-by: Paulo Alcantara (Red Hat) <pc@manguebit.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-09-02smb: client: add new tracepoint to trace lease break notificationBharath SM
Add smb3_lease_break_enter to trace lease break notifications, recording lease state, flags, epoch, and lease key. Align smb3_lease_not_found to use the same payload and print format. Signed-off-by: Bharath SM <bharathsm@microsoft.com> Acked-by: Paulo Alcantara (Red Hat) <pc@manguebit.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-09-02smb: client: fix spellings in commentsBharath SM
correct spellings in comments Signed-off-by: Bharath SM <bharathsm@microsoft.com> Acked-by: Paulo Alcantara (Red Hat) <pc@manguebit.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-09-02Merge tag 'mm-hotfixes-stable-2025-09-01-17-20' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "17 hotfixes. 13 are cc:stable and the remainder address post-6.16 issues or aren't considered necessary for -stable kernels. 11 of these fixes are for MM. This includes a three-patch series from Harry Yoo which fixes an intermittent boot failure which can occur on x86 systems. And a two-patch series from Alexander Gordeev which fixes a KASAN crash on S390 systems" * tag 'mm-hotfixes-stable-2025-09-01-17-20' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mm: fix possible deadlock in kmemleak x86/mm/64: define ARCH_PAGE_TABLE_SYNC_MASK and arch_sync_kernel_mappings() mm: introduce and use {pgd,p4d}_populate_kernel() mm: move page table sync declarations to linux/pgtable.h proc: fix missing pde_set_flags() for net proc files mm: fix accounting of memmap pages mm/damon/core: prevent unnecessary overflow in damos_set_effective_quota() kexec: add KEXEC_FILE_NO_CMA as a legal flag kasan: fix GCC mem-intrinsic prefix with sw tags mm/kasan: avoid lazy MMU mode hazards mm/kasan: fix vmalloc shadow memory (de-)population races kunit: kasan_test: disable fortify string checker on kasan_strings() test selftests/mm: fix FORCE_READ to read input value correctly mm/userfaultfd: fix kmap_local LIFO ordering for CONFIG_HIGHPTE ocfs2: prevent release journal inode after journal shutdown rust: mm: mark VmaNew as transparent of_numa: fix uninitialized memory nodes causing kernel panic
2025-09-02Merge tag 'for-6.17-rc4-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - fix a few races related to inode link count - fix inode leak on failure to add link to inode - move transaction aborts closer to where they happen * tag 'for-6.17-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: avoid load/store tearing races when checking if an inode was logged btrfs: fix race between setting last_dir_index_offset and inode logging btrfs: fix race between logging inode and checking if it was logged before btrfs: simplify error handling logic for btrfs_link() btrfs: fix inode leak on failure to add link to inode btrfs: abort transaction on failure to add link to inode
2025-09-02btrfs: fix subvolume deletion lockup caused by inodes xarray raceOmar Sandoval
There is a race condition between inode eviction and inode caching that can cause a live struct btrfs_inode to be missing from the root->inodes xarray. Specifically, there is a window during evict() between the inode being unhashed and deleted from the xarray. If btrfs_iget() is called for the same inode in that window, it will be recreated and inserted into the xarray, but then eviction will delete the new entry, leaving nothing in the xarray: Thread 1 Thread 2 --------------------------------------------------------------- evict() remove_inode_hash() btrfs_iget_path() btrfs_iget_locked() btrfs_read_locked_inode() btrfs_add_inode_to_root() destroy_inode() btrfs_destroy_inode() btrfs_del_inode_from_root() __xa_erase In turn, this can cause issues for subvolume deletion. Specifically, if an inode is in this lost state, and all other inodes are evicted, then btrfs_del_inode_from_root() will call btrfs_add_dead_root() prematurely. If the lost inode has a delayed_node attached to it, then when btrfs_clean_one_deleted_snapshot() calls btrfs_kill_all_delayed_nodes(), it will loop forever because the delayed_nodes xarray will never become empty (unless memory pressure forces the inode out). We saw this manifest as soft lockups in production. Fix it by only deleting the xarray entry if it matches the given inode (using __xa_cmpxchg()). Fixes: 310b2f5d5a94 ("btrfs: use an xarray to track open inodes in a root") Cc: stable@vger.kernel.org # 6.11+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Co-authored-by: Leo Martins <loemra.dev@gmail.com> Signed-off-by: Leo Martins <loemra.dev@gmail.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-02btrfs: fix corruption reading compressed range when block size is smaller ↵Qu Wenruo
than page size [BUG] With 64K page size (aarch64 with 64K page size config) and 4K btrfs block size, the following workload can easily lead to a corrupted read: mkfs.btrfs -f -s 4k $dev > /dev/null mount -o compress $dev $mnt xfs_io -f -c "pwrite -S 0xff 0 64k" $mnt/base > /dev/null echo "correct result:" od -Ad -t x1 $mnt/base xfs_io -f -c "reflink $mnt/base 32k 0 32k" \ -c "reflink $mnt/base 0 32k 32k" \ -c "pwrite -S 0xff 60k 4k" $mnt/new > /dev/null echo "incorrect result:" od -Ad -t x1 $mnt/new umount $mnt This shows the following result: correct result: 0000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff * 0065536 incorrect result: 0000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff * 0032768 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 * 0061440 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff * 0065536 Notice the zero in the range [32K, 60K), which is incorrect. [CAUSE] With extra trace printk, it shows the following events during od: (some unrelated info removed like CPU and context) od-3457 btrfs_do_readpage: enter r/i=5/258 folio=0(65536) prev_em_start=0000000000000000 The "r/i" is indicating the root and inode number. In our case the file "new" is using ino 258 from fs tree (root 5). Here notice the @prev_em_start pointer is NULL. This means the btrfs_do_readpage() is called from btrfs_read_folio(), not from btrfs_readahead(). od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=0 got em start=0 len=32768 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=4096 got em start=0 len=32768 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=8192 got em start=0 len=32768 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=12288 got em start=0 len=32768 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=16384 got em start=0 len=32768 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=20480 got em start=0 len=32768 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=24576 got em start=0 len=32768 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=28672 got em start=0 len=32768 These above 32K blocks will be read from the first half of the compressed data extent. od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=32768 got em start=32768 len=32768 Note here there is no btrfs_submit_compressed_read() call. Which is incorrect now. Although both extent maps at 0 and 32K are pointing to the same compressed data, their offsets are different thus can not be merged into the same read. So this means the compressed data read merge check is doing something wrong. od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=36864 got em start=32768 len=32768 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=40960 got em start=32768 len=32768 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=45056 got em start=32768 len=32768 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=49152 got em start=32768 len=32768 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=53248 got em start=32768 len=32768 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=57344 got em start=32768 len=32768 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=61440 skip uptodate od-3457 btrfs_submit_compressed_read: cb orig_bio: file off=0 len=61440 The function btrfs_submit_compressed_read() is only called at the end of folio read. The compressed bio will only have an extent map of range [0, 32K), but the original bio passed in is for the whole 64K folio. This will cause the decompression part to only fill the first 32K, leaving the rest untouched (aka, filled with zero). This incorrect compressed read merge leads to the above data corruption. There were similar problems that happened in the past, commit 808f80b46790 ("Btrfs: update fix for read corruption of compressed and shared extents") is doing pretty much the same fix for readahead. But that's back to 2015, where btrfs still only supports bs (block size) == ps (page size) cases. This means btrfs_do_readpage() only needs to handle a folio which contains exactly one block. Only btrfs_readahead() can lead to a read covering multiple blocks. Thus only btrfs_readahead() passes a non-NULL @prev_em_start pointer. With v5.15 kernel btrfs introduced bs < ps support. This breaks the above assumption that a folio can only contain one block. Now btrfs_read_folio() can also read multiple blocks in one go. But btrfs_read_folio() doesn't pass a @prev_em_start pointer, thus the existing bio force submission check will never be triggered. In theory, this can also happen for btrfs with large folios, but since large folio is still experimental, we don't need to bother it, thus only bs < ps support is affected for now. [FIX] Instead of passing @prev_em_start to do the proper compressed extent check, introduce one new member, btrfs_bio_ctrl::last_em_start, so that the existing bio force submission logic will always be triggered. CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-02btrfs: accept and ignore compression level for lzoCalvin Owens
The compression level is meaningless for lzo, but before commit 3f093ccb95f30 ("btrfs: harden parsing of compression mount options"), it was silently ignored if passed. After that commit, passing a level with lzo fails to mount: BTRFS error: unrecognized compression value lzo:1 It seems reasonable for users to expect that lzo would permit a numeric level option, as all the other algos do, even though the kernel's implementation of LZO currently only supports a single level. Because it has always worked to pass a level, it seems likely to me that users in the real world are relying on doing so. This patch restores the old behavior, giving "lzo:N" the same semantics as all of the other compression algos. To be clear, silly variants like "lzo:one", "lzo:the_first_option", or "lzo:armageddon" also used to work. This isn't meant to suggest that any possible mis-interpretation of mount options that once worked must continue to work forever. This is an exceptional case where it makes sense to preserve compatibility, both because the mis-interpretation is reasonable, and because nothing tangible is sacrificed. Finally update btrfs_show_options() to ignore the level of LZO, as it is only the default level without any extra meaning. Fixes: 3f093ccb95f30 ("btrfs: harden parsing of compression mount options") Reviewed-by: Daniel Vacek <neelx@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Calvin Owens <calvin@wbinvd.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-02btrfs: fix squota compressed stats leakBoris Burkov
The following workload on a squota enabled fs: btrfs subvol create mnt/subvol # ensure subvol extents get accounted sync btrfs qgroup create 1/1 mnt btrfs qgroup assign mnt/subvol 1/1 mnt btrfs qgroup delete mnt/subvol # make the cleaner thread run btrfs filesystem sync mnt sleep 1 btrfs filesystem sync mnt btrfs qgroup destroy 1/1 mnt will fail with EBUSY. The reason is that 1/1 does the quick accounting when we assign subvol to it, gaining its exclusive usage as excl and excl_cmpr. But then when we delete subvol, the decrement happens via record_squota_delta() which does not update excl_cmpr, as squotas does not make any distinction between compressed and normal extents. Thus, we increment excl_cmpr but never decrement it, and are unable to delete 1/1. The two possible fixes are to make squota always mirror excl and excl_cmpr or to make the fast accounting separately track the plain and cmpr numbers. The latter felt cleaner to me so that is what I opted for. Fixes: 1e0e9d5771c3 ("btrfs: add helper for recording simple quota deltas") CC: stable@vger.kernel.org # 6.12+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-01Merge tag 'fuse-fixes-6.17-rc5' of ↵Christian Brauner
ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse into vfs.fixes fuse fixes for 6.17-rc5 * tag 'fuse-fixes-6.17-rc5' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: (6 commits) fuse: Block access to folio overlimit fuse: fix fuseblk i_blkbits for iomap partial writes fuse: reflect cached blocksize if blocksize was changed fuse: prevent overflow in copy_file_range return value fuse: check if copy_file_range() returns larger than requested size fuse: do not allow mapping a non-regular backing file Link: https://lore.kernel.org/CAJfpeguEVMMyw_zCb+hbOuSxdE2Z3Raw=SJsq=Y56Ae6dn2W3g@mail.gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-08-31ksmbd: allow a filename to contain colons on SMB3.1.1 posix extensionsPhilipp Kerling
If the client sends SMB2_CREATE_POSIX_CONTEXT to ksmbd, allow the filename to contain a colon (':'). This requires disabling the support for Alternate Data Streams (ADS), which are denoted by a colon-separated suffix to the filename on Windows. This should not be an issue, since this concept is not known to POSIX anyway and the client has to explicitly request a POSIX context to get this behavior. Link: https://lore.kernel.org/all/f9401718e2be2ab22058b45a6817db912784ef61.camel@rx2.rx-server.de/ Signed-off-by: Philipp Kerling <pkerling@casix.org> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-08-31smb: client: Fix NULL pointer dereference in cifs_debug_dirs_proc_show()Wang Zhaolong
Reading /proc/fs/cifs/open_dirs may hit a NULL dereference when tcon->cfids is NULL. Add NULL check before accessing cfids to prevent the crash. Reproduction: - Mount CIFS share - cat /proc/fs/cifs/open_dirs Fixes: 844e5c0eb176 ("smb3 client: add way to show directory leases for improved debugging") Signed-off-by: Wang Zhaolong <wangzhaolong@huaweicloud.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-08-29NFSv4: Clear the NFS_CAP_XATTR flag if not supported by the serverTrond Myklebust
nfs_server_set_fsinfo() shouldn't assume that NFS_CAP_XATTR is unset on entry to the function. Fixes: b78ef845c35d ("NFSv4.2: query the server for extended attribute support") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-08-29NFSv4: Clear NFS_CAP_OPEN_XOR and NFS_CAP_DELEGTIME if not supportedTrond Myklebust
_nfs4_server_capabilities() should clear capabilities that are not supported by the server. Fixes: d2a00cceb93a ("NFSv4: Detect support for OPEN4_SHARE_ACCESS_WANT_OPEN_XOR_DELEGATION") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-08-29NFSv4: Clear the NFS_CAP_FS_LOCATIONS flag if it is not setTrond Myklebust
_nfs4_server_capabilities() is expected to clear any flags that are not supported by the server. Fixes: 8a59bb93b7e3 ("NFSv4 store server support for fs_location attribute") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-08-29NFSv4: Don't clear capabilities that won't be resetTrond Myklebust
Don't clear the capabilities that are not going to get reset by the call to _nfs4_server_capabilities(). Reported-by: Scott Haiden <scott.b.haiden@gmail.com> Fixes: b01f21cacde9 ("NFS: Fix the setting of capabilities when automounting a new filesystem") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-08-29Merge tag 'efi-fixes-for-v6.17-1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi Pull EFI fixes from Ard Biesheuvel: - Assorted fixes for the OP-TEE based pseudo-EFI variable store - Fix for an OOB access when looking up the same non-existing efivarfs entry multiple times in parallel * tag 'efi-fixes-for-v6.17-1' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi: efivarfs: Fix slab-out-of-bounds in efivarfs_d_compare efi: stmm: Drop unneeded null pointer check efi: stmm: Drop unused EFI error from setup_mm_hdr arguments efi: stmm: Do not return EFI_OUT_OF_RESOURCES on internal errors efi: stmm: Fix incorrect buffer allocation method
2025-08-29Merge tag 'v6.17-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds
Pull smb client fixes from Steve French: - Fix possible refcount leak in compound operations - Fix remap_file_range() return code mapping, found by generic/157 * tag 'v6.17-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6: fs/smb: Fix inconsistent refcnt update smb3 client: fix return code mapping of remap_file_range
2025-08-29Merge tag 'xfs-fixes-6.17-rc4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull xfs fixes from Carlos Maiolino: "The highlight I'd like to point here is related to the XFS_RT Kconfig, which has been updated to be enabled by default now if CONFIG_BLK_DEV_ZONED is enabled. This also contains a few fixes for zoned devices support in XFS, specially related to swapon requests in inodes belonging to the zoned FS. A null-ptr dereference fix in the xattr data, due to a mishandling of medium errors generated by block devices is also included" * tag 'xfs-fixes-6.17-rc4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: do not propagate ENODATA disk errors into xattr code xfs: reject swapon for inodes on a zoned file system earlier xfs: kick off inodegc when failing to reserve zoned blocks xfs: remove xfs_last_used_zone xfs: Default XFS_RT to Y if CONFIG_BLK_DEV_ZONED is enabled
2025-08-29fhandle: use more consistent rules for decoding file handle from usernsAmir Goldstein
Commit 620c266f39493 ("fhandle: relax open_by_handle_at() permission checks") relaxed the coditions for decoding a file handle from non init userns. The conditions are that that decoded dentry is accessible from the user provided mountfd (or to fs root) and that all the ancestors along the path have a valid id mapping in the userns. These conditions are intentionally more strict than the condition that the decoded dentry should be "lookable" by path from the mountfd. For example, the path /home/amir/dir/subdir is lookable by path from unpriv userns of user amir, because /home perms is 755, but the owner of /home does not have a valid id mapping in unpriv userns of user amir. The current code did not check that the decoded dentry itself has a valid id mapping in the userns. There is no security risk in that, because that final open still performs the needed permission checks, but this is inconsistent with the checks performed on the ancestors, so the behavior can be a bit confusing. Add the check for the decoded dentry itself, so that the entire path, including the last component has a valid id mapping in the userns. Fixes: 620c266f39493 ("fhandle: relax open_by_handle_at() permission checks") Signed-off-by: Amir Goldstein <amir73il@gmail.com> Link: https://lore.kernel.org/20250827194309.1259650-1-amir73il@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-08-28efivarfs: Fix slab-out-of-bounds in efivarfs_d_compareLi Nan
Observed on kernel 6.6 (present on master as well): BUG: KASAN: slab-out-of-bounds in memcmp+0x98/0xd0 Call trace: kasan_check_range+0xe8/0x190 __asan_loadN+0x1c/0x28 memcmp+0x98/0xd0 efivarfs_d_compare+0x68/0xd8 __d_lookup_rcu_op_compare+0x178/0x218 __d_lookup_rcu+0x1f8/0x228 d_alloc_parallel+0x150/0x648 lookup_open.isra.0+0x5f0/0x8d0 open_last_lookups+0x264/0x828 path_openat+0x130/0x3f8 do_filp_open+0x114/0x248 do_sys_openat2+0x340/0x3c0 __arm64_sys_openat+0x120/0x1a0 If dentry->d_name.len < EFI_VARIABLE_GUID_LEN , 'guid' can become negative, leadings to oob. The issue can be triggered by parallel lookups using invalid filename: T1 T2 lookup_open ->lookup simple_lookup d_add // invalid dentry is added to hash list lookup_open d_alloc_parallel __d_lookup_rcu __d_lookup_rcu_op_compare hlist_bl_for_each_entry_rcu // invalid dentry can be retrieved ->d_compare efivarfs_d_compare // oob Fix it by checking 'guid' before cmp. Fixes: da27a24383b2 ("efivarfs: guid part of filenames are case-insensitive") Signed-off-by: Li Nan <linan122@huawei.com> Signed-off-by: Wu Guanghao <wuguanghao3@huawei.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
2025-08-27proc: fix missing pde_set_flags() for net proc fileswangzijie
To avoid potential UAF issues during module removal races, we use pde_set_flags() to save proc_ops flags in PDE itself before proc_register(), and then use pde_has_proc_*() helpers instead of directly dereferencing pde->proc_ops->*. However, the pde_set_flags() call was missing when creating net related proc files. This omission caused incorrect behavior which FMODE_LSEEK was being cleared inappropriately in proc_reg_open() for net proc files. Lars reported it in this link[1]. Fix this by ensuring pde_set_flags() is called when register proc entry, and add NULL check for proc_ops in pde_set_flags(). [wangzijie1@honor.com: stash pde->proc_ops in a local const variable, per Christian] Link: https://lkml.kernel.org/r/20250821105806.1453833-1-wangzijie1@honor.com Link: https://lkml.kernel.org/r/20250818123102.959595-1-wangzijie1@honor.com Link: https://lore.kernel.org/all/20250815195616.64497967@chagall.paradoxon.rec/ [1] Fixes: ff7ec8dc1b64 ("proc: use the same treatment to check proc_lseek as ones for proc_read_iter et.al") Signed-off-by: wangzijie <wangzijie1@honor.com> Reported-by: Lars Wendler <polynomial-c@gmx.de> Tested-by: Stefano Brivio <sbrivio@redhat.com> Tested-by: Petr Vaněk <pv@excello.cz> Tested by: Lars Wendler <polynomial-c@gmx.de> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jiri Slaby <jirislaby@kernel.org> Cc: Kirill A. Shutemov <k.shutemov@gmail.com> Cc: wangzijie <wangzijie1@honor.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-08-27ocfs2: prevent release journal inode after journal shutdownEdward Adam Davis
Before calling ocfs2_delete_osb(), ocfs2_journal_shutdown() has already been executed in ocfs2_dismount_volume(), so osb->journal must be NULL. Therefore, the following calltrace will inevitably fail when it reaches jbd2_journal_release_jbd_inode(). ocfs2_dismount_volume()-> ocfs2_delete_osb()-> ocfs2_free_slot_info()-> __ocfs2_free_slot_info()-> evict()-> ocfs2_evict_inode()-> ocfs2_clear_inode()-> jbd2_journal_release_jbd_inode(osb->journal->j_journal, Adding osb->journal checks will prevent null-ptr-deref during the above execution path. Link: https://lkml.kernel.org/r/tencent_357489BEAEE4AED74CBD67D246DBD2C4C606@qq.com Fixes: da5e7c87827e ("ocfs2: cleanup journal init and shutdown") Signed-off-by: Edward Adam Davis <eadavis@qq.com> Reported-by: syzbot+47d8cb2f2cc1517e515a@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=47d8cb2f2cc1517e515a Tested-by: syzbot+47d8cb2f2cc1517e515a@syzkaller.appspotmail.com Reviewed-by: Mark Tinguely <mark.tinguely@oracle.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Jun Piao <piaojun@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-08-28erofs: fix invalid algorithm for encoded extentsGao Xiang
The current algorithm sanity checks do not properly apply to new encoded extents. Unify the algorithm check with Z_EROFS_COMPRESSION(_RUNTIME)_MAX and ensure consistency with sbi->available_compr_algs. Reported-and-tested-by: syzbot+5a398eb460ddaa6f242f@syzkaller.appspotmail.com Closes: https://lore.kernel.org/r/68a8bd20.050a0220.37038e.005a.GAE@google.com Fixes: 1d191b4ca51d ("erofs: implement encoded extent metadata") Thanks-to: Edward Adam Davis <eadavis@qq.com> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2025-08-27fs/smb: Fix inconsistent refcnt updateShuhao Fu
A possible inconsistent update of refcount was identified in `smb2_compound_op`. Such inconsistent update could lead to possible resource leaks. Why it is a possible bug: 1. In the comment section of the function, it clearly states that the reference to `cfile` should be dropped after calling this function. 2. Every control flow path would check and drop the reference to `cfile`, except the patched one. 3. Existing callers would not handle refcount update of `cfile` if -ENOMEM is returned. To fix the bug, an extra goto label "out" is added, to make sure that the cleanup logic would always be respected. As the problem is caused by the allocation failure of `vars`, the cleanup logic between label "finished" and "out" can be safely ignored. According to the definition of function `is_replayable_error`, the error code of "-ENOMEM" is not recoverable. Therefore, the replay logic also gets ignored. Signed-off-by: Shuhao Fu <sfual@cse.ust.hk> Acked-by: Paulo Alcantara (Red Hat) <pc@manguebit.org> Cc: stable@vger.kernel.org Signed-off-by: Steve French <stfrench@microsoft.com>
2025-08-27fuse: Block access to folio overlimitEdward Adam Davis
syz reported a slab-out-of-bounds Write in fuse_dev_do_write. When the number of bytes to be retrieved is truncated to the upper limit by fc->max_pages and there is an offset, the oob is triggered. Add a loop termination condition to prevent overruns. Fixes: 3568a9569326 ("fuse: support large folios for retrieves") Reported-by: syzbot+2d215d165f9354b9c4ea@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=2d215d165f9354b9c4ea Tested-by: syzbot+2d215d165f9354b9c4ea@syzkaller.appspotmail.com Signed-off-by: Edward Adam Davis <eadavis@qq.com> Reviewed-by: Joanne Koong <joannelkoong@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-08-26fuse: fix fuseblk i_blkbits for iomap partial writesJoanne Koong
On regular fuse filesystems, i_blkbits is set to PAGE_SHIFT which means any iomap partial writes will mark the entire folio as uptodate. However fuseblk filesystems work differently and allow the blocksize to be less than the page size. As such, this may lead to data corruption if fuseblk sets its blocksize to less than the page size, uses the writeback cache, and does a partial write, then a read and the read happens before the write has undergone writeback, since the folio will not be marked uptodate from the partial write so the read will read in the entire folio from disk, which will overwrite the partial write. The long-term solution for this, which will also be needed for fuse to enable large folios with the writeback cache on, is to have fuse also use iomap for folio reads, but until that is done, the cleanest workaround is to use the page size for fuseblk's internal kernel inode blksize/blkbits values while maintaining current behavior for stat(). This was verified using ntfs-3g: $ sudo mkfs.ntfs -f -c 512 /dev/vdd1 $ sudo ntfs-3g /dev/vdd1 ~/fuseblk $ stat ~/fuseblk/hi.txt IO Block: 512 Fixes: a4c9ab1d4975 ("fuse: use iomap for buffered writes") Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-08-26fuse: reflect cached blocksize if blocksize was changedJoanne Koong
As pointed out by Miklos[1], in the fuse_update_get_attr() path, the attributes returned to stat may be cached values instead of fresh ones fetched from the server. In the case where the server returned a modified blocksize value, we need to cache it and reflect it back to stat if values are not re-fetched since we now no longer directly change inode->i_blkbits. Link: https://lore.kernel.org/linux-fsdevel/CAJfpeguCOxeVX88_zPd1hqziB_C+tmfuDhZP5qO2nKmnb-dTUA@mail.gmail.com/ [1] Fixes: 542ede096e48 ("fuse: keep inode->i_blkbits constant") Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-08-26fuse: prevent overflow in copy_file_range return valueMiklos Szeredi
The FUSE protocol uses struct fuse_write_out to convey the return value of copy_file_range, which is restricted to uint32_t. But the COPY_FILE_RANGE interface supports a 64-bit size copies. Currently the number of bytes copied is silently truncated to 32-bit, which may result in poor performance or even failure to copy in case of truncation to zero. Reported-by: Florian Weimer <fweimer@redhat.com> Closes: https://lore.kernel.org/all/lhuh5ynl8z5.fsf@oldenburg.str.redhat.com/ Fixes: 88bc7d5097a1 ("fuse: add support for copy_file_range()") Cc: <stable@vger.kernel.org> # v4.20 Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-08-26fuse: check if copy_file_range() returns larger than requested sizeMiklos Szeredi
Just like write(), copy_file_range() should check if the return value is less or equal to the requested number of bytes. Reported-by: Chunsheng Luo <luochunsheng@ustc.edu> Closes: https://lore.kernel.org/all/20250807062425.694-1-luochunsheng@ustc.edu/ Fixes: 88bc7d5097a1 ("fuse: add support for copy_file_range()") Cc: <stable@vger.kernel.org> # v4.20 Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-08-26fuse: do not allow mapping a non-regular backing fileAmir Goldstein
We do not support passthrough operations other than read/write on regular file, so allowing non-regular backing files makes no sense. Fixes: efad7153bf93 ("fuse: allow O_PATH fd for FUSE_DEV_IOC_BACKING_OPEN") Cc: stable@vger.kernel.org Signed-off-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Bernd Schubert <bschubert@ddn.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-08-26xfs: do not propagate ENODATA disk errors into xattr codeEric Sandeen
ENODATA (aka ENOATTR) has a very specific meaning in the xfs xattr code; namely, that the requested attribute name could not be found. However, a medium error from disk may also return ENODATA. At best, this medium error may escape to userspace as "attribute not found" when in fact it's an IO (disk) error. At worst, we may oops in xfs_attr_leaf_get() when we do: error = xfs_attr_leaf_hasname(args, &bp); if (error == -ENOATTR) { xfs_trans_brelse(args->trans, bp); return error; } because an ENODATA/ENOATTR error from disk leaves us with a null bp, and the xfs_trans_brelse will then null-deref it. As discussed on the list, we really need to modify the lower level IO functions to trap all disk errors and ensure that we don't let unique errors like this leak up into higher xfs functions - many like this should be remapped to EIO. However, this patch directly addresses a reported bug in the xattr code, and should be safe to backport to stable kernels. A larger-scope patch to handle more unique errors at lower levels can follow later. (Note, prior to 07120f1abdff we did not oops, but we did return the wrong error code to userspace.) Signed-off-by: Eric Sandeen <sandeen@redhat.com> Fixes: 07120f1abdff ("xfs: Add xfs_has_attr and subroutines") Cc: stable@vger.kernel.org # v5.9+ Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-08-25smb3 client: fix return code mapping of remap_file_rangeSteve French
We were returning -EOPNOTSUPP for various remap_file_range cases but for some of these the copy_file_range_syscall() requires -EINVAL to be returned (e.g. where source and target file ranges overlap when source and target are the same file). This fixes xfstest generic/157 which was expecting EINVAL for that (and also e.g. for when the src offset is beyond end of file). Cc: stable@vger.kernel.org Acked-by: Paulo Alcantara (Red Hat) <pc@manguebit.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-08-23Merge tag 'driver-core-6.17-rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core Pull driver core fixes from Danilo Krummrich: - Fix swapped handling of lru_gen and lru_gen_full debugfs files in vmscan - Fix debugfs mount options (uid, gid, mode) being silently ignored - Fix leak of devres action in the unwind path of Devres::new() - Documentation: - Expand and fix documentation of (outdated) Device, DeviceContext and generic driver infrastructure - Fix C header link of faux device abstractions - Clarify expected interaction with the security team - Smooth text flow in the security bug reporting process documentation * tag 'driver-core-6.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core: Documentation: smooth the text flow in the security bug reporting process Documentation: clarify the expected collaboration with security bugs reporters debugfs: fix mount options not being applied rust: devres: fix leaking call to devm_add_action() rust: faux: fix C header link driver: rust: expand documentation for driver infrastructure device: rust: expand documentation for Device device: rust: expand documentation for DeviceContext mm/vmscan: fix inverted polarity in lru_gen_seq_show()
2025-08-22Merge tag '6.17-rc2-smb3-client-fix' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds
Pull smb client fix from Steve French: "Fix for netfs smb3 oops" * tag '6.17-rc2-smb3-client-fix' of git://git.samba.org/sfrench/cifs-2.6: cifs: Fix oops due to uninitialised variable
2025-08-22Merge tag 'nfs-for-6.17-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds
Pull NFS client fix from Trond Myklebust: - NFS: Fix a data corrupting race when updating an existing write * tag 'nfs-for-6.17-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: NFS: Fix a race when updating an existing write
2025-08-22Merge tag 'mm-hotfixes-stable-2025-08-21-18-17' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "20 hotfixes. 10 are cc:stable and the remainder address post-6.16 issues or aren't considered necessary for -stable kernels. 17 of these fixes are for MM. As usual, singletons all over the place, apart from a three-patch series of KHO followup work from Pasha which is actually also a bunch of singletons" * tag 'mm-hotfixes-stable-2025-08-21-18-17' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mm/mremap: fix WARN with uffd that has remap events disabled mm/damon/sysfs-schemes: put damos dests dir after removing its files mm/migrate: fix NULL movable_ops if CONFIG_ZSMALLOC=m mm/damon/core: fix damos_commit_filter not changing allow mm/memory-failure: fix infinite UCE for VM_PFNMAP pfn MAINTAINERS: mark MGLRU as maintained mm: rust: add page.rs to MEMORY MANAGEMENT - RUST iov_iter: iterate_folioq: fix handling of offset >= folio size selftests/damon: fix selftests by installing drgn related script .mailmap: add entry for Easwar Hariharan selftests/mm: add test for invalid multi VMA operations mm/mremap: catch invalid multi VMA moves earlier mm/mremap: allow multi-VMA move when filesystem uses thp_get_unmapped_area mm/damon/core: fix commit_ops_filters by using correct nth function tools/testing: add linux/args.h header and fix radix, VMA tests mm/debug_vm_pgtable: clear page table entries at destroy_args() squashfs: fix memory leak in squashfs_fill_super kho: warn if KHO is disabled due to an error kho: mm: don't allow deferred struct page with KHO kho: init new_physxa->phys_bits to fix lockdep
2025-08-22btrfs: avoid load/store tearing races when checking if an inode was loggedFilipe Manana
At inode_logged() we do a couple lockless checks for ->logged_trans, and these are generally safe except the second one in case we get a load or store tearing due to a concurrent call updating ->logged_trans (either at btrfs_log_inode() or later at inode_logged()). In the first case it's safe to compare to the current transaction ID since once ->logged_trans is set the current transaction, we never set it to a lower value. In the second case, where we check if it's greater than zero, we are prone to load/store tearing races, since we can have a concurrent task updating to the current transaction ID with store tearing for example, instead of updating with a single 64 bits write, to update with two 32 bits writes or four 16 bits writes. In that case the reading side at inode_logged() could see a positive value that does not match the current transaction and then return a false negative. Fix this by doing the second check while holding the inode's spinlock, add some comments about it too. Also add the data_race() annotation to the first check to avoid any reports from KCSAN (or similar tools) and comment about it. Fixes: 0f8ce49821de ("btrfs: avoid inode logging during rename and link when possible") Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-08-22btrfs: fix race between setting last_dir_index_offset and inode loggingFilipe Manana
At inode_logged() if we find that the inode was not logged before we update its ->last_dir_index_offset to (u64)-1 with the goal that the next directory log operation will see the (u64)-1 and then figure out it must check what was the index of the last logged dir index key and update ->last_dir_index_offset to that key's offset (this is done in update_last_dir_index_offset()). This however has a possibility for a time window where a race can happen and lead to directory logging skipping dir index keys that should be logged. The race happens like this: 1) Task A calls inode_logged(), sees ->logged_trans as 0 and then checks that the inode item was logged before, but before it sets the inode's ->last_dir_index_offset to (u64)-1... 2) Task B is at btrfs_log_inode() which calls inode_logged() early, and that has set ->last_dir_index_offset to (u64)-1; 3) Task B then enters log_directory_changes() which calls update_last_dir_index_offset(). There it sees ->last_dir_index_offset is (u64)-1 and that the inode was logged before (ctx->logged_before is true), and so it searches for the last logged dir index key in the log tree and it finds that it has an offset (index) value of N, so it sets ->last_dir_index_offset to N, so that we can skip index keys that are less than or equal to N (later at process_dir_items_leaf()); 4) Task A now sets ->last_dir_index_offset to (u64)-1, undoing the update that task B just did; 5) Task B will now skip every index key when it enters process_dir_items_leaf(), since ->last_dir_index_offset is (u64)-1. Fix this by making inode_logged() not touch ->last_dir_index_offset and initializing it to 0 when an inode is loaded (at btrfs_alloc_inode()) and then having update_last_dir_index_offset() treat a value of 0 as meaning we must check the log tree and update with the index of the last logged index key. This is fine since the minimum possible value for ->last_dir_index_offset is 1 (BTRFS_DIR_START_INDEX - 1 = 2 - 1 = 1). This also simplifies the management of ->last_dir_index_offset and now all accesses to it are done under the inode's log_mutex. Fixes: 0f8ce49821de ("btrfs: avoid inode logging during rename and link when possible") Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-08-22btrfs: fix race between logging inode and checking if it was logged beforeFilipe Manana
There's a race between checking if an inode was logged before and logging an inode that can cause us to mark an inode as not logged just after it was logged by a concurrent task: 1) We have inode X which was not logged before neither in the current transaction not in past transaction since the inode was loaded into memory, so it's ->logged_trans value is 0; 2) We are at transaction N; 3) Task A calls inode_logged() against inode X, sees that ->logged_trans is 0 and there is a log tree and so it proceeds to search in the log tree for an inode item for inode X. It doesn't see any, but before it sets ->logged_trans to N - 1... 3) Task B calls btrfs_log_inode() against inode X, logs the inode and sets ->logged_trans to N; 4) Task A now sets ->logged_trans to N - 1; 5) At this point anyone calling inode_logged() gets 0 (inode not logged) since ->logged_trans is greater than 0 and less than N, but our inode was really logged. As a consequence operations like rename, unlink and link that happen afterwards in the current transaction end up not updating the log when they should. Fix this by ensuring inode_logged() only updates ->logged_trans in case the inode item is not found in the log tree if after tacking the inode's lock (spinlock struct btrfs_inode::lock) the ->logged_trans value is still zero, since the inode lock is what protects setting ->logged_trans at btrfs_log_inode(). Fixes: 0f8ce49821de ("btrfs: avoid inode logging during rename and link when possible") Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-08-22btrfs: simplify error handling logic for btrfs_link()Filipe Manana
Instead of incrementing the inode's link count and refcount early before adding the link, updating the inode and deleting orphan item, do it after all those steps succeeded right before calling d_instantiate(). This makes the error handling logic simpler by avoiding the need for the 'drop_inode' variable to signal if we need to undo the link count increment and the inode refcount increase under the 'fail' label. This also reduces the level of indentation by one, making the code easier to read. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-08-22btrfs: fix inode leak on failure to add link to inodeFilipe Manana
If we fail to update the inode or delete the orphan item we leak the inode since we update its refcount with the ihold() call to account for the d_instantiate() call which never happens in case we fail those steps. Fix this by setting 'drop_inode' to true in case we fail those steps. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-08-22btrfs: abort transaction on failure to add link to inodeFilipe Manana
If we fail to update the inode or delete the orphan item, we must abort the transaction to prevent persisting an inconsistent state. For example if we fail to update the inode item, we have the inconsistency of having a persisted inode item with a link count of N but we have N + 1 inode ref items and N + 1 directory entries pointing to our inode in case the transaction gets committed. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-08-21coredump: don't pointlessly check and spew warningsChristian Brauner
When a write happens it doesn't make sense to check perform checks on the input. Skip them. Whether a fixes tag is licensed is a bit of a gray area here but I'll add one for the socket validation part I added recently. Link: https://lore.kernel.org/20250821-moosbedeckt-denunziant-7908663f3563@brauner Fixes: 16195d2c7dd2 ("coredump: validate socket name as it is written") Reported-by: Brad Spengler <brad.spengler@opensrcsec.com> Signed-off-by: Christian Brauner <brauner@kernel.org>