summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2025-07-25ext4: add ext4_try_lock_group() to skip busy groupsBaokun Li
When ext4 allocates blocks, we used to just go through the block groups one by one to find a good one. But when there are tons of block groups (like hundreds of thousands or even millions) and not many have free space (meaning they're mostly full), it takes a really long time to check them all, and performance gets bad. So, we added the "mb_optimize_scan" mount option (which is on by default now). It keeps track of some group lists, so when we need a free block, we can just grab a likely group from the right list. This saves time and makes block allocation much faster. But when multiple processes or containers are doing similar things, like constantly allocating 8k blocks, they all try to use the same block group in the same list. Even just two processes doing this can cut the IOPS in half. For example, one container might do 300,000 IOPS, but if you run two at the same time, the total is only 150,000. Since we can already look at block groups in a non-linear way, the first and last groups in the same list are basically the same for finding a block right now. Therefore, add an ext4_try_lock_group() helper function to skip the current group when it is locked by another process, thereby avoiding contention with other processes. This helps ext4 make better use of having multiple block groups. Also, to make sure we don't skip all the groups that have free space when allocating blocks, we won't try to skip busy groups anymore when ac_criteria is CR_ANY_FREE. Performance test data follows: Test: Running will-it-scale/fallocate2 on CPU-bound containers. Observation: Average fallocate operations per container per second. |CPU: Kunpeng 920 | P80 | |Memory: 512GB |-------------------------| |960GB SSD (0.5GB/s)| base | patched | |-------------------|-------|-----------------| |mb_optimize_scan=0 | 2667 | 4821 (+80.7%) | |mb_optimize_scan=1 | 2643 | 4784 (+81.0%) | |CPU: AMD 9654 * 2 | P96 | |Memory: 1536GB |-------------------------| |960GB SSD (1GB/s) | base | patched | |-------------------|-------|-----------------| |mb_optimize_scan=0 | 3450 | 15371 (+345%) | |mb_optimize_scan=1 | 3209 | 6101 (+90.0%) | Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250714130327.1830534-2-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-25ext4: initialize superblock fields in the kballoc-test.c kunit testsZhang Yi
Various changes in the "ext4: better scalability for ext4 block allocation" patch series have resulted in kunit test failures, most notably in the test_new_blocks_simple and the test_mb_mark_used tests. The root cause of these failures is that various in-memory ext4 data structures were not getting initialized, and while previous versions of the functions exercised by the unit tests didn't use these structure members, this was arguably a test bug. Since one of the patches in the block allocation scalability patches is a fix which is has a cc:stable tag, this commit also has a cc:stable tag. CC: stable@vger.kernel.org Link: https://lore.kernel.org/r/20250714130327.1830534-1-libaokun1@huawei.com Link: https://patch.msgid.link/20250725021550.3177573-1-yi.zhang@huaweicloud.com Link: https://patch.msgid.link/20250725021654.3188798-1-yi.zhang@huaweicloud.com Reported-by: Guenter Roeck <linux@roeck-us.net> Closes: https://lore.kernel.org/linux-ext4/b0635ad0-7ebf-4152-a69b-58e7e87d5085@roeck-us.net/ Tested-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-25ovl: properly print correct variableAntonio Quartulli
In case of ovl_lookup_temp() failure, we currently print `err` which is actually not initialized at all. Instead, properly print PTR_ERR(whiteout) which is where the actual error really is. Address-Coverity-ID: 1647983 ("Uninitialized variables (UNINIT)") Fixes: 8afa0a7367138 ("ovl: narrow locking in ovl_whiteout()") Signed-off-by: Antonio Quartulli <antonio@mandelbit.com> Link: https://lore.kernel.org/20250721203821.7812-1-antonio@mandelbit.com Reviewed-by: NeilBrown <neil@brown.name> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-24ksmbd: fix corrupted mtime and ctime in smb2_openNamjae Jeon
If STATX_BASIC_STATS flags are not given as an argument to vfs_getattr, It can not get ctime and mtime in kstat. This causes a problem showing mtime and ctime outdated from cifs.ko. File: /xfstest.test/foo Size: 4096 Blocks: 8 IO Block: 1048576 regular file Device: 0,65 Inode: 2033391 Links: 1 Access: (0755/-rwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root) Context: system_u:object_r:cifs_t:s0 Access: 2025-07-23 22:15:30.136051900 +0100 Modify: 1970-01-01 01:00:00.000000000 +0100 Change: 1970-01-01 01:00:00.000000000 +0100 Birth: 2025-07-23 22:15:30.136051900 +0100 Cc: stable@vger.kernel.org Reported-by: David Howells <dhowells@redhat.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-07-24ksmbd: fix Preauh_HashValue race conditionNamjae Jeon
If client send multiple session setup requests to ksmbd, Preauh_HashValue race condition could happen. There is no need to free sess->Preauh_HashValue at session setup phase. It can be freed together with session at connection termination phase. Cc: stable@vger.kernel.org Reported-by: zdi-disclosures@trendmicro.com # ZDI-CAN-27661 Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-07-24ksmbd: check return value of xa_store() in krb5_authenticateNamjae Jeon
xa_store() may fail so check its return value and return error code if error occurred. Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-07-24ksmbd: fix null pointer dereference error in generate_encryptionkeyNamjae Jeon
If client send two session setups with krb5 authenticate to ksmbd, null pointer dereference error in generate_encryptionkey could happen. sess->Preauth_HashValue is set to NULL if session is valid. So this patch skip generate encryption key if session is valid. Cc: stable@vger.kernel.org Reported-by: zdi-disclosures@trendmicro.com # ZDI-CAN-27654 Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-07-24bcachefs: Add missing snapshots_seen_add_inorder()Kent Overstreet
This fixes an infinite loop when repairing "extent past end of inode", when the extent is an older snapshot than the inode that needs repair. Without the snaphsots_seen_add_inorder() we keep trying to delete the same extent, even though it's no longer visible in the inode's snapshot. Fixes: 63d6e9311999 ("bcachefs: bch2_fpunch_snapshot()") Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-24bcachefs: Fix write buffer flushing from open journal entryKent Overstreet
When flushing the btree write buffer, we pull write buffer keys directly from the journal instead of letting the journal write path copy them to the write buffer. When flushing from the currently open journal buffer, we have to block new reservations and wait for outstanding reservations to complete. Recheck the reservation state after blocking new reservations: previously, we were checking the reservation count from before calling __journal_block(). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-24Merge tag 'mm-hotfixes-stable-2025-07-24-18-03' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "11 hotfixes. 9 are cc:stable and the remainder address post-6.15 issues or aren't considered necessary for -stable kernels. 7 are for MM" * tag 'mm-hotfixes-stable-2025-07-24-18-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: sprintf.h requires stdarg.h resource: fix false warning in __request_region() mm/damon/core: commit damos_quota_goal->nid kasan: use vmalloc_dump_obj() for vmalloc error reports mm/ksm: fix -Wsometimes-uninitialized from clang-21 in advisor_mode_show() mm: update MAINTAINERS entry for HMM nilfs2: reject invalid file types when reading inodes selftests/mm: fix split_huge_page_test for folio_split() tests mailmap: add entry for Senozhatsky mm/zsmalloc: do not pass __GFP_MOVABLE if CONFIG_COMPACTION=n mm/vmscan: fix hwpoisoned large folio handling in shrink_folio_list
2025-07-24fs/Kconfig: enable HUGETLBFS only if ARCH_SUPPORTS_HUGETLBFSAnshuman Khandual
Enable HUGETLBFS only when platform subscrbes via ARCH_SUPPORTS_HUGETLBFS. Hence select ARCH_SUPPORTS_HUGETLBFS on existing x86 and sparc for their continuing HUGETLBFS support. While here also just drop existing 'BROKEN' dependency. Link: https://lkml.kernel.org/r/20250711102934.2399533-1-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Christian Brauner <brauner@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-24fs/proc/task_mmu: read proc/pid/maps under per-vma lockSuren Baghdasaryan
With maple_tree supporting vma tree traversal under RCU and per-vma locks, /proc/pid/maps can be read while holding individual vma locks instead of locking the entire address space. A completely lockless approach (walking vma tree under RCU) would be quite complex with the main issue being get_vma_name() using callbacks which might not work correctly with a stable vma copy, requiring original (unstable) vma - see special_mapping_name() for example. When per-vma lock acquisition fails, we take the mmap_lock for reading, lock the vma, release the mmap_lock and continue. This fallback to mmap read lock guarantees the reader to make forward progress even during lock contention. This will interfere with the writer but for a very short time while we are acquiring the per-vma lock and only when there was contention on the vma reader is interested in. We shouldn't see a repeated fallback to mmap read locks in practice, as this require a very unlikely series of lock contentions (for instance due to repeated vma split operations). However even if this did somehow happen, we would still progress. One case requiring special handling is when a vma changes between the time it was found and the time it got locked. A problematic case would be if a vma got shrunk so that its vm_start moved higher in the address space and a new vma was installed at the beginning: reader found: |--------VMA A--------| VMA is modified: |-VMA B-|----VMA A----| reader locks modified VMA A reader reports VMA A: | gap |----VMA A----| This would result in reporting a gap in the address space that does not exist. To prevent this we retry the lookup after locking the vma, however we do that only when we identify a gap and detect that the address space was changed after we found the vma. This change is designed to reduce mmap_lock contention and prevent a process reading /proc/pid/maps files (often a low priority task, such as monitoring/data collection services) from blocking address space updates. Note that this change has a userspace visible disadvantage: it allows for sub-page data tearing as opposed to the previous mechanism where data tearing could happen only between pages of generated output data. Since current userspace considers data tearing between pages to be acceptable, we assume is will be able to handle sub-page data tearing as well. Link: https://lkml.kernel.org/r/20250719182854.3166724-7-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Jeongjun Park <aha310510@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Thomas Weißschuh <linux@weissschuh.net> Cc: T.J. Mercier <tjmercier@google.com> Cc: Ye Bin <yebin10@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-24fs/proc/task_mmu: remove conversion of seq_file position to unsignedSuren Baghdasaryan
Back in 2.6 era, last_addr used to be stored in seq_file->version variable, which was unsigned long. As a result, sentinels to represent gate vma and end of all vmas used unsigned values. In more recent kernels we don't used seq_file->version anymore and therefore conversion from loff_t into unsigned type is not needed. Similarly, sentinel values don't need to be unsigned. Remove type conversion for set_file position and change sentinel values to signed. While at it, change the hardcoded sentinel values with named definitions for better documentation. Link: https://lkml.kernel.org/r/20250719182854.3166724-6-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Hildenbrand <david@redhat.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Jann Horn <jannh@google.com> Cc: Jeongjun Park <aha310510@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Thomas Weißschuh <linux@weissschuh.net> Cc: T.J. Mercier <tjmercier@google.com> Cc: Ye Bin <yebin10@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-24fs: stable_page_flags(): use snapshot_page()Luiz Capitulino
A race condition is possible in stable_page_flags() where user-space is reading /proc/kpageflags concurrently to a folio split. This may lead to oopses or BUG_ON()s being triggered. To fix this, this commit uses snapshot_page() in stable_page_flags() so that stable_page_flags() works with a stable page and folio snapshots instead. Note that stable_page_flags() makes use of some functions that require the original page or folio pointer to work properly (eg. is_free_budy_page() and folio_test_idle()). Since those functions can't be used on the page snapshot, we replace their usage with flags that were set by snapshot_page() for this purpose. Link: https://lkml.kernel.org/r/52c16c0f00995a812a55980c2f26848a999a34ab.1752499009.git.luizcap@redhat.com Signed-off-by: Luiz Capitulino <luizcap@redhat.com> Reviewed-by: Shivank Garg <shivankg@amd.com> Tested-by: Harry Yoo <harry.yoo@oracle.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-24proc: kpagecount: use snapshot_page()Luiz Capitulino
Currently, the call to folio_precise_page_mapcount() from kpage_read() can race with a folio split. When the race happens we trigger a VM_BUG_ON_FOLIO() in folio_entire_mapcount() (see splat below). This commit fixes this race by using snapshot_page() so that we retrieve the folio mapcount using a folio snapshot. [ 2356.558576] page: refcount:1 mapcount:1 mapping:0000000000000000 index:0xffff85200 pfn:0x6f7c00 [ 2356.558748] memcg:ffff000651775780 [ 2356.558763] anon flags: 0xafffff60020838(uptodate|dirty|lru|owner_2|swapbacked|node=1|zone=2|lastcpupid=0xfffff) [ 2356.558796] raw: 00afffff60020838 fffffdffdb5d0048 fffffdffdadf7fc8 ffff00064c1629c1 [ 2356.558817] raw: 0000000ffff85200 0000000000000000 0000000100000000 ffff000651775780 [ 2356.558839] page dumped because: VM_BUG_ON_FOLIO(!folio_test_large(folio)) [ 2356.558882] ------------[ cut here ]------------ [ 2356.558897] kernel BUG at ./include/linux/mm.h:1103! [ 2356.558982] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP [ 2356.564729] CPU: 8 UID: 0 PID: 1864 Comm: folio-split-rac Tainted: G S W 6.15.0+ #3 PREEMPT(voluntary) [ 2356.566196] Tainted: [S]=CPU_OUT_OF_SPEC, [W]=WARN [ 2356.566814] Hardware name: Red Hat KVM, BIOS edk2-20241117-3.el9 11/17/2024 [ 2356.567684] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 2356.568563] pc : kpage_read.constprop.0+0x26c/0x290 [ 2356.569605] lr : kpage_read.constprop.0+0x26c/0x290 [ 2356.569992] sp : ffff80008fb739b0 [ 2356.570263] x29: ffff80008fb739b0 x28: ffff00064aa69580 x27: 00000000ff000000 [ 2356.570842] x26: 0000fffffffffff8 x25: ffff00064aa69580 x24: ffff80008fb73ae0 [ 2356.571411] x23: 0000000000000001 x22: 0000ffff86c6e8b8 x21: 0000000000000008 [ 2356.571978] x20: 00000000006f7c00 x19: 0000ffff86c6e8b8 x18: 0000000000000000 [ 2356.572581] x17: 3630303066666666 x16: 0000000000000003 x15: 0000000000001000 [ 2356.573217] x14: 00000000ffffffff x13: 0000000000000004 x12: 00aaaaaa00aaaaaa [ 2356.577674] x11: 0000000000000000 x10: 00aaaaaa00aaaaaa x9 : ffffbf3afca6c300 [ 2356.578332] x8 : 0000000000000002 x7 : 0000000000000001 x6 : 0000000000000001 [ 2356.578984] x5 : ffff000c79812408 x4 : 0000000000000000 x3 : 0000000000000000 [ 2356.579635] x2 : 0000000000000000 x1 : ffff00064aa69580 x0 : 000000000000003e [ 2356.580286] Call trace: [ 2356.580524] kpage_read.constprop.0+0x26c/0x290 (P) [ 2356.580982] kpagecount_read+0x28/0x40 [ 2356.581336] proc_reg_read+0x38/0x100 [ 2356.581681] vfs_read+0xcc/0x320 [ 2356.581992] ksys_read+0x74/0x118 [ 2356.582306] __arm64_sys_read+0x24/0x38 [ 2356.582668] invoke_syscall+0x70/0x100 [ 2356.583022] el0_svc_common.constprop.0+0x48/0xf8 [ 2356.583456] do_el0_svc+0x28/0x40 [ 2356.583930] el0_svc+0x38/0x118 [ 2356.584328] el0t_64_sync_handler+0x144/0x168 [ 2356.584883] el0t_64_sync+0x19c/0x1a0 [ 2356.585350] Code: aa0103e0 9003a541 91082021 97f813fc (d4210000) [ 2356.586130] ---[ end trace 0000000000000000 ]--- [ 2356.587377] note: folio-split-rac[1864] exited with irqs disabled [ 2356.588050] note: folio-split-rac[1864] exited with preempt_count 1 Link: https://lkml.kernel.org/r/1c05cc725b90962d56323ff2e28e9cc3ae397b68.1752499009.git.luizcap@redhat.com Signed-off-by: Luiz Capitulino <luizcap@redhat.com> Reported-by: syzbot+3d7dc5eaba6b932f8535@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/67812fbd.050a0220.d0267.0030.GAE@google.com/Reviewed-by: Shivank Garg <shivankg@amd.com> Tested-by: Harry Yoo <harry.yoo@oracle.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-24mm/mremap: use an explicit uffd failure path for mremapLorenzo Stoakes
Right now it appears that the code is relying upon the returned destination address having bits outside PAGE_MASK to indicate whether an error value is specified, and decrementing the increased refcount on the uffd ctx if so. This is not a safe means of determining an error value, so instead, be specific. It makes far more sense to do so in a dedicated error path, so add mremap_userfaultfd_fail() for this purpose and use this when an error arises. A vm_userfaultfd_ctx is not established until we are at the point where mremap_userfaultfd_prep() is invoked in copy_vma_and_data(), so this is a no-op until this happens. That is - uffd remap notification only occurs if the VMA is actually moved - at which point a UFFD_EVENT_REMAP event is raised. No errors can occur after this point currently, though it's certainly not guaranteed this will always remain the case, and we mustn't rely on this. However, the reason for needing to handle this case is that, when an error arises on a VMA move at the point of adjusting page tables, we revert this operation, and propagate the error. At this point, it is not correct to raise a uffd remap event, and we must handle it. This refactoring makes it abundantly clear what we are doing. We assume vrm->new_addr is always valid, which a prior change made the case even for mremap() invocations which don't move the VMA, however given no uffd context would be set up in this case it's immaterial to this change anyway. No functional change intended. Link: https://lkml.kernel.org/r/a70e8a1f7bce9f43d1431065b414e0f212297297.1752770784.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Peter Xu <peterx@redhat.com> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-24ubifs: stop using write_cache_pagesChristoph Hellwig
Stop using the obsolete write_cache_pages and use writeback_iter directly. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2025-07-24f2fs: zone: wait for inflight dio completion, excluding pinned files read ↵yohan.joung
using dio read for the pinfile using Direct I/O do not wait for dio write. Signed-off-by: yohan.joung <yohan.joung@sk.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-07-24f2fs: ignore valid ratio when free section count is lowDaeho Jeong
Otherwise F2FS will not do GC in background in low free section. Signed-off-by: Daeho Jeong <daehojeong@google.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-07-24f2fs: don't break allocation when crossing contiguous sectionsChao Yu
Commit 0638a3197c19 ("f2fs: avoid unused block when dio write in LFS mode") has fixed unused block issue for dio write in lfs mode. However, f2fs_map_blocks() may break and return smaller extent when last allocated block locates in the end of section, even allocator can allocate contiguous blocks across sections. Actually, for the case that allocator returns a block address which is not contiguous w/ current extent, we can record the block address in iomap->private, in the next round, skip reallocating for the last allocated block, then we can fix unused block issue, meanwhile, also, we can allocates contiguous physical blocks as much as possible for dio write in lfs mode. Testcase: - mkfs.f2fs -f /dev/vdb - mount -o mode=lfs /dev/vdb /mnt/f2fs - dd if=/dev/zero of=/mnt/f2fs/file bs=1M count=3; sync; - dd if=/dev/zero of=/mnt/f2fs/dio bs=2M count=1 oflag=direct; - umount /mnt/f2fs Before: f2fs_map_blocks: dev = (253,16), ino = 4, file offset = 0, start blkaddr = 0x0, len = 0x100, flags = 1, seg_type = 8, may_create = 1, multidevice = 0, flag = 5, err = 0 f2fs_map_blocks: dev = (253,16), ino = 4, file offset = 256, start blkaddr = 0x0, len = 0x100, flags = 1, seg_type = 8, may_create = 1, multidevice = 0, flag = 5, err = 0 f2fs_map_blocks: dev = (253,16), ino = 4, file offset = 512, start blkaddr = 0x0, len = 0x100, flags = 1, seg_type = 8, may_create = 1, multidevice = 0, flag = 5, err = 0 f2fs_map_blocks: dev = (253,16), ino = 5, file offset = 0, start blkaddr = 0x4700, len = 0x100, flags = 3, seg_type = 1, may_create = 1, multidevice = 0, flag = 3, err = 0 f2fs_map_blocks: dev = (253,16), ino = 5, file offset = 256, start blkaddr = 0x4800, len = 0x100, flags = 3, seg_type = 1, may_create = 1, multidevice = 0, flag = 3, err = 0 After: f2fs_map_blocks: dev = (253,16), ino = 4, file offset = 0, start blkaddr = 0x0, len = 0x100, flags = 1, seg_type = 8, may_create = 1, multidevice = 0, flag = 5, err = 0 f2fs_map_blocks: dev = (253,16), ino = 4, file offset = 256, start blkaddr = 0x0, len = 0x100, flags = 1, seg_type = 8, may_create = 1, multidevice = 0, flag = 5, err = 0 f2fs_map_blocks: dev = (253,16), ino = 4, file offset = 512, start blkaddr = 0x0, len = 0x100, flags = 1, seg_type = 8, may_create = 1, multidevice = 0, flag = 5, err = 0 f2fs_map_blocks: dev = (253,16), ino = 5, file offset = 0, start blkaddr = 0x4700, len = 0x200, flags = 3, seg_type = 1, may_create = 1, multidevice = 0, flag = 3, err = 0 Cc: Daejun Park <daejun7.park@samsung.com> Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-07-24f2fs: remove unnecessary tracepoint enabled checkSheng Yong
There is no extra work before trace_f2fs_[dataread|datawrite]_end(), so there is no need to check trace_<tracepoint>_enabled(). Signed-off-by: Sheng Yong <shengyong1@xiaomi.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-07-24f2fs: merge the two conditions to avoid code duplicationmason.zhang
No functional changes. Signed-off-by: mason.zhang <masonzhang.linuxer@gmail.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-07-24f2fs: vm_unmap_ram() may be called from an invalid contextJan Prusakowski
When testing F2FS with xfstests using UFS backed virtual disks the kernel complains sometimes that f2fs_release_decomp_mem() calls vm_unmap_ram() from an invalid context. Example trace from f2fs/007 test: f2fs/007 5s ... [12:59:38][ 8.902525] run fstests f2fs/007 [ 11.468026] BUG: sleeping function called from invalid context at mm/vmalloc.c:2978 [ 11.471849] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 68, name: irq/22-ufshcd [ 11.475357] preempt_count: 1, expected: 0 [ 11.476970] RCU nest depth: 0, expected: 0 [ 11.478531] CPU: 0 UID: 0 PID: 68 Comm: irq/22-ufshcd Tainted: G W 6.16.0-rc5-xfstests-ufs-g40f92e79b0aa #9 PREEMPT(none) [ 11.478535] Tainted: [W]=WARN [ 11.478536] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 [ 11.478537] Call Trace: [ 11.478543] <TASK> [ 11.478545] dump_stack_lvl+0x4e/0x70 [ 11.478554] __might_resched.cold+0xaf/0xbe [ 11.478557] vm_unmap_ram+0x21/0xb0 [ 11.478560] f2fs_release_decomp_mem+0x59/0x80 [ 11.478563] f2fs_free_dic+0x18/0x1a0 [ 11.478565] f2fs_finish_read_bio+0xd7/0x290 [ 11.478570] blk_update_request+0xec/0x3b0 [ 11.478574] ? sbitmap_queue_clear+0x3b/0x60 [ 11.478576] scsi_end_request+0x27/0x1a0 [ 11.478582] scsi_io_completion+0x40/0x300 [ 11.478583] ufshcd_mcq_poll_cqe_lock+0xa3/0xe0 [ 11.478588] ufshcd_sl_intr+0x194/0x1f0 [ 11.478592] ufshcd_threaded_intr+0x68/0xb0 [ 11.478594] ? __pfx_irq_thread_fn+0x10/0x10 [ 11.478599] irq_thread_fn+0x20/0x60 [ 11.478602] ? __pfx_irq_thread_fn+0x10/0x10 [ 11.478603] irq_thread+0xb9/0x180 [ 11.478605] ? __pfx_irq_thread_dtor+0x10/0x10 [ 11.478607] ? __pfx_irq_thread+0x10/0x10 [ 11.478609] kthread+0x10a/0x230 [ 11.478614] ? __pfx_kthread+0x10/0x10 [ 11.478615] ret_from_fork+0x7e/0xd0 [ 11.478619] ? __pfx_kthread+0x10/0x10 [ 11.478621] ret_from_fork_asm+0x1a/0x30 [ 11.478623] </TASK> This patch modifies in_task() check inside f2fs_read_end_io() to also check if interrupts are disabled. This ensures that pages are unmapped asynchronously in an interrupt handler. Fixes: bff139b49d9f ("f2fs: handle decompress only post processing in softirq") Signed-off-by: Jan Prusakowski <jprusakowski@google.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-07-24Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.16-rc8). Conflicts: drivers/net/ethernet/microsoft/mana/gdma_main.c 9669ddda18fb ("net: mana: Fix warnings for missing export.h header inclusion") 755391121038 ("net: mana: Allocate MSI-X vectors dynamically") https://lore.kernel.org/20250711130752.23023d98@canb.auug.org.au Adjacent changes: drivers/net/ethernet/ti/icssg/icssg_prueth.h 6e86fb73de0f ("net: ti: icssg-prueth: Fix buffer allocation for ICSSG") ffe8a4909176 ("net: ti: icssg-prueth: Read firmware-names from device tree") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-24smb/server: add ksmbd_vfs_kern_path()NeilBrown
The function ksmbd_vfs_kern_path_locked() seems to serve two functions and as a result has an odd interface. On success it returns with the parent directory locked and with write access on that filesystem requested, but it may have crossed over a mount point to return the path, which makes the lock and the write access irrelevant. This patches separates the functionality into two functions: - ksmbd_vfs_kern_path() does not lock the parent, does not request write access, but does cross mount points - ksmbd_vfs_kern_path_locked() does not cross mount points but does lock the parent and request write access. The parent_path parameter is no longer needed. For the _locked case the final path is sufficient to drop write access and to unlock the parent (using path->dentry->d_parent which is safe while the lock is held). There were 3 caller of ksmbd_vfs_kern_path_locked(). - smb2_create_link() needs to remove the target if it existed and needs the lock and the write-access, so it continues to use ksmbd_vfs_kern_path_locked(). It would not make sense to cross mount points in this case. - smb2_open() is the only user that needs to cross mount points and it has no need for the lock or write access, so it now uses ksmbd_vfs_kern_path() - smb2_creat() does not need to cross mountpoints as it is accessing a file that it has just created on *this* filesystem. But also it does not need the lock or write access because by the time ksmbd_vfs_kern_path_locked() was called it has already created the file. So it could use either interface. It is simplest to use ksmbd_vfs_kern_path(). ksmbd_vfs_kern_path_unlock() is still needed after ksmbd_vfs_kern_path_locked() but it doesn't require the parent_path any more. After a successful call to ksmbd_vfs_kern_path(), only path_put() is needed to release the path. Signed-off-by: NeilBrown <neil@brown.name> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-07-24xfs: don't use a xfs_log_iovec for ri_buf in log recoveryChristoph Hellwig
ri_buf just holds a pointer/len pair and is not a log iovec used for writing to the log. Switch to use a kvec instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-07-24xfs: don't use a xfs_log_iovec for attr_item names and valuesChristoph Hellwig
These buffers are not directly logged, just use a kvec and remove the xlog_copy_from_iovec helper only used here. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-07-24xfs: use better names for size members in xfs_log_vecChristoph Hellwig
The lv_size member counts the size of the entire allocation, rename it to lv_alloc_size to make that clear. The lv_buf_len member tracks how much of lv_buf has been used up to format the log item, rename it to lv_buf_used to make that more clear. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-07-24xfs: cleanup the ordered item logic in xlog_cil_insert_format_itemsChristoph Hellwig
Split out handling of ordered items into a single branch in xlog_cil_insert_format_items so that the rest of the code becomes more clear. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-07-24xfs: don't pass the old lv to xfs_cil_prepare_itemChristoph Hellwig
By the time xfs_cil_prepare_item is called, the old lv is still pointed to by the log item. Take it from there instead of spreading the old lv logic over xlog_cil_insert_format_items and xfs_cil_prepare_item. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-07-24xfs: remove unused trace event xfs_reflink_cow_enospcSteven Rostedt
The call to the event xfs_reflink_cow_enospc was removed when the COW handling was merged into xfs_file_iomap_begin_delay, but the trace event itself was not. Remove it. Fixes: db46e604adf8 ("xfs: merge COW handling into xfs_file_iomap_begin_delay") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-07-24xfs: remove unused trace event xfs_discard_rtrelaxSteven Rostedt
The trace event xfs_discard_rtrelax was added but never used. Remove it. Fixes: a330cae8a7147 ("xfs: Remove header files which are included more than once") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-07-24xfs: remove unused trace event xfs_log_cil_returnSteven Rostedt
The trace event xfs_log_cil_return was added but never used. Remove it. Fixes: c1220522ef405 ("xfs: grant heads track byte counts, not LSNs") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-07-24xfs: remove unused trace event xfs_dqreclaim_dirtySteven Rostedt
The tracepoint trace_xfs_dqreclaim_dirty was removed with other code removed from xfs_qm_dquot_isolate() but the defined tracepoint was not. Fixes: d62016b1a2df ("xfs: avoid dquot buffer pin deadlock") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-07-24fs/xfs: replace strncpy with memtostr_pad()Pranav Tyagi
Replace the deprecated strncpy() with memtostr_pad(). This also avoids the need for separate zeroing using memset(). Mark sb_fname buffer with __nonstring as its size is XFSLABEL_MAX and so no terminating NULL for sb_fname. Signed-off-by: Pranav Tyagi <pranav.tyagi03@gmail.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-07-24xfs: Remove unused label in xfs_dax_notify_dev_failureAlan Huang
Fixes: e967dc40d501 ("xfs: return the allocated transaction from xfs_trans_alloc_empty") Signed-off-by: Alan Huang <mmpgouride@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-07-24xfs: improve the comments in xfs_select_zone_nowaitChristoph Hellwig
The top of the function comment is outdated, and the parts still correct duplicate information in comment inside the function. Remove the top of the function comment and instead improve a comment inside the function. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-07-24xfs: improve the comments in xfs_max_open_zonesChristoph Hellwig
Describe the rationale for the decisions a bit better. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-07-24xfs: stop passing an inode to the zone space reservation helpersChristoph Hellwig
None of them actually needs the inode, the mount is enough. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-07-24xfs: rename oz_write_pointer to oz_allocatedChristoph Hellwig
This member just tracks how much space we handed out for sequential write required zones. Only for conventional space it actually is the pointer where thing are written at, otherwise zone append manages that. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-07-24xfs: use a uint32_t to cache i_used_blocks in xfs_init_zoneChristoph Hellwig
i_used_blocks is a uint32_t, so use the same value for the local variable caching it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-07-24xfs: improve the xg_active_ref check in xfs_group_freeChristoph Hellwig
Split up the XFS_IS_CORRUPT statement so that it immediately shows if the reference counter overflowed or underflowed. I ran into this quite a bit when developing the zoned allocator, and had to reapply the patch for some work recently. We might as well just apply it upstream given that freeing group is far removed from performance critical code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-07-24xfs: remove the xlog_ticket_t typedefChristoph Hellwig
Almost no users of the typedef left, kill it and switch the remaining users to use the underlying struct. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-07-24xfs: remove xrep_trans_{alloc,cancel}_hook_dummyChristoph Hellwig
XFS stopped using current->journal_info in commit f2e812c1522d ("xfs: don't use current->journal_info"), so there is no point in saving and restoring it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-07-24xfs: return the allocated transaction from xchk_trans_alloc_emptyChristoph Hellwig
xchk_trans_alloc_empty can't return errors, so return the allocated transaction directly instead of an output double pointer argument. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-07-24xfs: return the allocated transaction from xfs_trans_alloc_emptyChristoph Hellwig
xfs_trans_alloc_empty can't return errors, so return the allocated transaction directly instead of an output double pointer argument. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-07-24xfs: don't use xfs_trans_reserve in xfs_trans_rollChristoph Hellwig
xfs_trans_roll uses xfs_trans_reserve to basically just call into xfs_log_regrant while bypassing the reset of xfs_trans_reserve. Open code the call to xfs_log_regrant in xfs_trans_roll and simplify xfs_trans_reserve now that it never regrants and always asks for a log reservation. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-07-24xfs: decouple xfs_trans_alloc_empty from xfs_trans_allocChristoph Hellwig
xfs_trans_alloc_empty only shares the very basic transaction structure allocation and initialization with xfs_trans_alloc. Split out a new __xfs_trans_alloc helper for that and otherwise decouple xfs_trans_alloc_empty from xfs_trans_alloc. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-07-24xfs: don't use xfs_trans_reserve in xfs_trans_reserve_moreChristoph Hellwig
xfs_trans_reserve_more just tries to allocate additional blocks and/or rtextents and is otherwise unrelated to the transaction reservation logic. Open code the block and rtextent reservation in xfs_trans_reserve_more to prepare for simplifying xfs_trans_reserve. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-07-24xfs: use xfs_trans_reserve_more in xfs_trans_reserve_more_inodeChristoph Hellwig
Instead of duplicating the empty transacaction reservation definition. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>