summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2025-03-11f2fs: Remove check for ->writepageMatthew Wilcox (Oracle)
We're almost able to remove a_ops->writepage. This check is unnecessary as we'll never call into __f2fs_write_data_pages() for character devices. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-03-11jfs: add index corruption check to DT_GETPAGE()Roman Smirnov
If the file system is corrupted, the header.stblindex variable may become greater than 127. Because of this, an array access out of bounds may occur: ------------[ cut here ]------------ UBSAN: array-index-out-of-bounds in fs/jfs/jfs_dtree.c:3096:10 index 237 is out of range for type 'struct dtslot[128]' CPU: 0 UID: 0 PID: 5822 Comm: syz-executor740 Not tainted 6.13.0-rc4-syzkaller-00110-g4099a71718b0 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024 Call Trace: <TASK> __dump_stack lib/dump_stack.c:94 [inline] dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120 ubsan_epilogue lib/ubsan.c:231 [inline] __ubsan_handle_out_of_bounds+0x121/0x150 lib/ubsan.c:429 dtReadFirst+0x622/0xc50 fs/jfs/jfs_dtree.c:3096 dtReadNext fs/jfs/jfs_dtree.c:3147 [inline] jfs_readdir+0x9aa/0x3c50 fs/jfs/jfs_dtree.c:2862 wrap_directory_iterator+0x91/0xd0 fs/readdir.c:65 iterate_dir+0x571/0x800 fs/readdir.c:108 __do_sys_getdents64 fs/readdir.c:403 [inline] __se_sys_getdents64+0x1e2/0x4b0 fs/readdir.c:389 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f </TASK> ---[ end trace ]--- Add a stblindex check for corruption. Reported-by: syzbot <syzbot+9120834fc227768625ba@syzkaller.appspotmail.com> Closes: https://syzkaller.appspot.com/bug?extid=9120834fc227768625ba Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Cc: stable@vger.kernel.org Signed-off-by: Roman Smirnov <r.smirnov@omp.ru> Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
2025-03-11bcachefs: Make sure trans is unlocked when submitting read IOKent Overstreet
We were still using the trans after the unlock, leading to this bug in the retry path: 00255 ------------[ cut here ]------------ 00255 kernel BUG at fs/bcachefs/btree_iter.c:3348! 00255 Internal error: Oops - BUG: 00000000f2000800 [#1] SMP 00255 bcachefs (0ca38fe8-0a26-41f9-9b5d-6a27796c7803): /fiotest offset 86048768: no device to read from: 00255 u64s 8 type extent 4098:168192:U32_MAX len 128 ver 0: durability: 0 crc: c_size 128 size 128 offset 0 nonce 0 csum crc32c 0:8040a368 compress none ec: idx 83 block 1 ptr: 0:302:128 gen 0 00255 bcachefs (0ca38fe8-0a26-41f9-9b5d-6a27796c7803): /fiotest offset 85983232: no device to read from: 00255 u64s 8 type extent 4098:168064:U32_MAX len 128 ver 0: durability: 0 crc: c_size 128 size 128 offset 0 nonce 0 csum crc32c 0:43311336 compress none ec: idx 83 block 1 ptr: 0:302:0 gen 0 00255 Modules linked in: 00255 CPU: 5 UID: 0 PID: 304 Comm: kworker/u70:2 Not tainted 6.14.0-rc6-ktest-g526aae23d67d #16040 00255 Hardware name: linux,dummy-virt (DT) 00255 Workqueue: events_unbound bch2_rbio_retry 00255 pstate: 60001005 (nZCv daif -PAN -UAO -TCO -DIT +SSBS BTYPE=--) 00255 pc : __bch2_trans_get+0x100/0x378 00255 lr : __bch2_trans_get+0xa0/0x378 00255 sp : ffffff80c865b760 00255 x29: ffffff80c865b760 x28: 0000000000000000 x27: ffffff80d76ed880 00255 x26: 0000000000000018 x25: 0000000000000000 x24: ffffff80f4ec3760 00255 x23: ffffff80f4010140 x22: 0000000000000056 x21: ffffff80f4ec0000 00255 x20: ffffff80f4ec3788 x19: ffffff80d75f8000 x18: 00000000ffffffff 00255 x17: 2065707974203820 x16: 7334367520200a3a x15: 0000000000000008 00255 x14: 0000000000000001 x13: 0000000000000100 x12: 0000000000000006 00255 x11: ffffffc080b47a40 x10: 0000000000000000 x9 : ffffffc08038dea8 00255 x8 : ffffff80d75fc018 x7 : 0000000000000000 x6 : 0000000000003788 00255 x5 : 0000000000003760 x4 : ffffff80c922de80 x3 : ffffff80f18f0000 00255 x2 : ffffff80c922de80 x1 : 0000000000000130 x0 : 0000000000000006 00255 Call trace: 00255 __bch2_trans_get+0x100/0x378 (P) 00255 bch2_read_io_err+0x98/0x260 00255 bch2_read_endio+0xb8/0x2d0 00255 __bch2_read_extent+0xce8/0xfe0 00255 __bch2_read+0x2a8/0x978 00255 bch2_rbio_retry+0x188/0x318 00255 process_one_work+0x154/0x390 00255 worker_thread+0x20c/0x3b8 00255 kthread+0xf0/0x1b0 00255 ret_from_fork+0x10/0x20 00255 Code: 6b01001f 54ffff01 79408460 3617fec0 (d4210000) 00255 ---[ end trace 0000000000000000 ]--- 00255 Kernel panic - not syncing: Oops - BUG: Fatal exception 00255 SMP: stopping secondary CPUs 00255 Kernel Offset: disabled 00255 CPU features: 0x000,00000070,00000010,8240500b 00255 Memory Limit: none 00255 ---[ end Kernel panic - not syncing: Oops - BUG: Fatal exception ]--- Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-11bcachefs: Initialize from_inode members for bch_io_optsRoxana Nicolescu
When there is no inode source, all "from_inode" members in the structure bhc_io_opts should be set false. Fixes: 7a7c43a0c1ecf ("bcachefs: Add bch_io_opts fields for indicating whether the opts came from the inode") Reported-by: syzbot+c17ad4b4367b72a853cb@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=c17ad4b4367b72a853cb Signed-off-by: Roxana Nicolescu <nicolescu.roxana@protonmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-11bcachefs: Fix b->written overflowAlan Huang
When bset past end of btree node, we should not add sectors to b->written, which will overflow b->written. Reported-by: syzbot+3cb3d9e8c3f197754825@syzkaller.appspotmail.com Tested-by: syzbot+3cb3d9e8c3f197754825@syzkaller.appspotmail.com Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-11vboxsf: Add __nonstring annotations for unterminated stringsKees Cook
When a character array without a terminating NUL character has a static initializer, GCC 15's -Wunterminated-string-initialization will only warn if the array lacks the "nonstring" attribute[1]. Mark the arrays with __nonstring to and correctly identify the char array as "not a C string" and thereby eliminate the warning. This effectively reverts the change in 4e7487245abc ("vboxsf: fix building with GCC 15"), to add the annotation that has other uses (i.e. warning if the string is ever used with C string APIs). Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117178 [1] Cc: Hans de Goede <hdegoede@redhat.com> Cc: Brahmajit Das <brahmajit.xyz@gmail.com> Cc: Christian Brauner <brauner@kernel.org> Cc: linux-fsdevel@vger.kernel.org Signed-off-by: Kees Cook <kees@kernel.org> Link: https://lore.kernel.org/r/20250310222530.work.374-kees@kernel.org Reviewed-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-11xfs: trigger zone GC when out of available rt blocksHans Holmberg
We periodically check the available rt blocks when filling up zones and start GC if needed, but we may run completely out in between filling zones, so start GC(unless already running) if we can't reserve writable space. This should only happen as a corner case in setups with very few backing zones. Fixes: 080d01c41d44 ("xfs: implement zoned garbage collection") Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-03-11Revert "f2fs: rebuild nat_bits during umount"Chao Yu
This reverts commit 94c821fb286b545d37549ff30a0c341e066f0d6c. It reports that there is potential corruption in node footer, the most suspious feature is nat_bits, let's revert recovery related code. Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-03-11f2fs: fix to avoid accessing uninitialized cursegChao Yu
syzbot reports a f2fs bug as below: F2FS-fs (loop3): Stopped filesystem due to reason: 7 kworker/u8:7: attempt to access beyond end of device BUG: unable to handle page fault for address: ffffed1604ea3dfa RIP: 0010:get_ckpt_valid_blocks fs/f2fs/segment.h:361 [inline] RIP: 0010:has_curseg_enough_space fs/f2fs/segment.h:570 [inline] RIP: 0010:__get_secs_required fs/f2fs/segment.h:620 [inline] RIP: 0010:has_not_enough_free_secs fs/f2fs/segment.h:633 [inline] RIP: 0010:has_enough_free_secs+0x575/0x1660 fs/f2fs/segment.h:649 <TASK> f2fs_is_checkpoint_ready fs/f2fs/segment.h:671 [inline] f2fs_write_inode+0x425/0x540 fs/f2fs/inode.c:791 write_inode fs/fs-writeback.c:1525 [inline] __writeback_single_inode+0x708/0x10d0 fs/fs-writeback.c:1745 writeback_sb_inodes+0x820/0x1360 fs/fs-writeback.c:1976 wb_writeback+0x413/0xb80 fs/fs-writeback.c:2156 wb_do_writeback fs/fs-writeback.c:2303 [inline] wb_workfn+0x410/0x1080 fs/fs-writeback.c:2343 process_one_work kernel/workqueue.c:3236 [inline] process_scheduled_works+0xa66/0x1840 kernel/workqueue.c:3317 worker_thread+0x870/0xd30 kernel/workqueue.c:3398 kthread+0x7a9/0x920 kernel/kthread.c:464 ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:148 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 Commit 8b10d3653735 ("f2fs: introduce FAULT_NO_SEGMENT") allows to trigger no free segment fault in allocator, then it will update curseg->segno to NULL_SEGNO, though, CP_ERROR_FLAG has been set, f2fs_write_inode() missed to check the flag, and access invalid curseg->segno directly in below call path, then resulting in panic: - f2fs_write_inode - f2fs_is_checkpoint_ready - has_enough_free_secs - has_not_enough_free_secs - __get_secs_required - has_curseg_enough_space - get_ckpt_valid_blocks : access invalid curseg->segno To avoid this issue, let's: - check CP_ERROR_FLAG flag in prior to f2fs_is_checkpoint_ready() in f2fs_write_inode(). - in has_curseg_enough_space(), save curseg->segno into a temp variable, and verify its validation before use. Fixes: 8b10d3653735 ("f2fs: introduce FAULT_NO_SEGMENT") Reported-by: syzbot+b6b347b7a4ea1b2e29b6@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/67973c2b.050a0220.11b1bb.0089.GAE@google.com Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-03-11f2fs: introduce FAULT_INCONSISTENT_FOOTERChao Yu
To simulate inconsistent node footer error. Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-03-11f2fs: do sanity check on xattr node footer in f2fs_get_xnode_page()Chao Yu
This patch introduces a new wrapper f2fs_get_xnode_page(), then, caller can use it to load xattr block to page cache, meanwhile it will do sanity check on xattr node footer. Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-03-11f2fs: do sanity check on inode footer in f2fs_get_inode_page()Chao Yu
This patch introduces a new wrapper f2fs_get_inode_page(), then, caller can use it to load inode block to page cache, meanwhile it will do sanity check on inode footer. Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-03-10ksmbd: prevent connection release during oplock break notificationNamjae Jeon
ksmbd_work could be freed when after connection release. Increment r_count of ksmbd_conn to indicate that requests are not finished yet and to not release the connection. Cc: stable@vger.kernel.org Reported-by: Norbert Szetei <norbert@doyensec.com> Tested-by: Norbert Szetei <norbert@doyensec.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-03-10ksmbd: fix use-after-free in ksmbd_free_work_structNamjae Jeon
->interim_entry of ksmbd_work could be deleted after oplock is freed. We don't need to manage it with linked list. The interim request could be immediately sent whenever a oplock break wait is needed. Cc: stable@vger.kernel.org Reported-by: Norbert Szetei <norbert@doyensec.com> Tested-by: Norbert Szetei <norbert@doyensec.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-03-10gfs2: Convert gfs2_meta_read_endio() to use a folioMatthew Wilcox (Oracle)
Switch from bio_for_each_segment_all() to bio_for_each_folio_all() which removes a call to page_buffers(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-03-10gfs2: Convert gfs2_end_log_write_bh() to work on a folioMatthew Wilcox (Oracle)
gfs2_end_log_write() has to handle bios which consist of both pages which belong to folios and pages which were allocated from a mempool and do not belong to a folio. It would be cleaner to have separate endio handlers which handle each type, but it's not clear to me whether that's even possible. This patch is slightly forward-looking in that page_folio() cannot currently return NULL, but it will return NULL in the future for pages which do not belong to a folio. This was the last user of page_has_buffers(), so remove it. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-03-10gfs2: Convert gfs2_find_jhead() to use a folioMatthew Wilcox (Oracle)
Remove a call to grab_cache_page() by using a folio throughout this function. [agruenba@redhat.com: Adjust to return value difference between bio_add_page() and bio_add_folio().] Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-03-10gfs2: Convert gfs2_jhead_pg_srch() to gfs2_jhead_folio_search()Matthew Wilcox (Oracle)
Pass in the folio instead of the page. Add an assert that this is not a large folio as we'd need a more complex solution if we wanted to kmap() each page out of a large folio. Removes a use of folio->page. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> [agruenba@redhat.com: Rename gfs2_jhead_folio_srch() to gfs2_jhead_folio_search().] Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-03-10gfs2: Use b_folio in gfs2_check_magic()Matthew Wilcox (Oracle)
We are preparing to remove bh->b_page. Use kmap_local_folio() instead of kmap_local_page(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-03-10gfs2: Use b_folio in gfs2_submit_bhs()Matthew Wilcox (Oracle)
Remove a reference to bh->b_page which is going to be removed soon. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-03-10gfs2: Use b_folio in gfs2_trans_add_meta()Matthew Wilcox (Oracle)
The lock bit is maintained on the folio, not on the page. Saves two calls to compound_head() as well as removing two references to bh->b_page. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-03-10gfs2: Use b_folio in gfs2_log_write_bh()Matthew Wilcox (Oracle)
We are preparing to remove bh->b_page. gfs2_log_write() should continue to operate on pages as some of the memory being logged does not come from folios, so convert from folio to page in this function. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-03-10gfs2: skip if we cannot defer deleteAndreas Gruenbacher
In gfs2_evict_inode(), in the unlikely case that we cannot defer deleting the inode, it is not safe to fall back to deleting the inode; the only valid choice we have is to skip the delete. In addition, in evict_should_delete(), if we cannot lock the inode glock exclusively, we are in a bad enough state that skipping the delete is likely a better choice than trying to recover from the failure later. Fixes: c5b7a2400edc ("gfs2: Only defer deletes when we have an iopen glock") Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-03-10gfs2: remove redundant warningsAndreas Gruenbacher
In glock_set_object() and glock_clear_object(), there is no need to print the glock type and number when we dump the entire glock, anyway. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-03-10gfs2: minor evict fixAndreas Gruenbacher
In evict_should_delete(), when gfs2_upgrade_iopen_glock() fails, we detach the iopen glock from the inode without calling glock_clear_object(). This leads to a warning in glock_set_object() when the same inode is recreated and the glock is reused. Fix that by only detaching the iopen glock in gfs2_evict_inode(). In addition, remove the dequeue code from evict_should_delete(); we already perform a conditional dequeue in gfs2_evict_inode(). Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-03-10gfs2: Prevent inode creation race (2)Andreas Gruenbacher
In gfs2_try_evict(), we try grabbing the inode to evict, we try to evict it, and then we try grabbing it again to see if it still exists. There is no guarantee that we will end up with the same inode both times; the inode validity check that commit ffd1cf0443a2 ("gfs2: Prevent inode creation race") added to the first grab is actually needed both times. (To avoid code duplication, add a grab_existing_inode() helper.) Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-03-10gfs2: Fix additional unlikely request cancelation raceAndreas Gruenbacher
In gfs2_glock_dq(), we must drop the glock spin lock before calling ->lm_cancel, but this means that in the meantime, the operation we are trying to cancel could complete. If the operation completes unsuccessfully, another holder can end up at the head of the queue and another ->lm_lock operation can get started. In this case, we would end up canceling that second operation by accident. To prevent that, introduce a new GLF_CANCELING flag. Set that flag in gfs2_glock_dq() when trying to cancel an operation. When seeing that flag, finish_xmote() will then keep the GLF_LOCK flag set to prevent other glock operations from taking place. gfs2_glock_dq() then completes the cancelation attempt by clearing GLF_LOCK and GLF_CANCELING. In addition, add a missing GLF_DEMOTE_IN_PROGRESS check in gfs2_glock_dq() to make sure that we won't accidentally cancel a demote request. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-03-10gfs2: Fix request cancelation bugAndreas Gruenbacher
In finish_xmote(), when a locking request is canceled, the corresponding holder is moved to the tail of the holders list instead of being dequeued immediately. When there is only a single holder, the canceled locking request is then immediately repeated. This makes no sense; it looks like another remnant of LM_FLAG_PRIORITY support. Instead, dequeue canceled holders and proceed with the next holder in finish_xmote(). We can then easily detect in gfs2_glock_dq() when a holder has been canceled. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-03-10gfs2: Check for empty queue in run_queueAndreas Gruenbacher
In run_queue(), check if the queue of pending requests is empty instead of blindly assuming that it won't be. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-03-10gfs2: Remove more dead code in add_to_queueAndreas Gruenbacher
Remove some more dead code in add_to_queue() that commit 0b93bac2271e ("gfs2: Remove LM_FLAG_PRIORITY flag") has rendered obsolete. This is a continuation of commit 3302764610057 ("gfs2: remove dead code in add_to_queue"); no functional change. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-03-10gfs2: Replace GIF_DEFER_DELETE with GLF_DEFER_DELETEAndreas Gruenbacher
Having this flag attached to the iopen glock instead of the inode is much simpler; it eliminates a protential weird race in gfs2_try_evict(). Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-03-10gfs2: glock holder GL_NOPID fixAndreas Gruenbacher
Glocks are always actively acquired by processes, but as indicated by the GL_NOPID holder flag, some of them are then associated with objects like cached inodes rather than the process that acquired them. As such, for those glock holders, it makes little sense to dump which processes originally acquired them. Therefore, gfs2 is trying to hide the identity of the processes that acquired those glocks. The code for doing that is incorrect though, so fix it. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-03-10gfs2: Add GLF_PENDING_REPLY flagAndreas Gruenbacher
Introduce a new GLF_PENDING_REPLY flag to indicate that a reply from DLM is expected. Include that flag in glock dumps to show more clearly what's going on. (When the GLF_PENDING_REPLY flag is set, the GLF_LOCK flag will also be set but the GLF_LOCK flag alone isn't sufficient to tell that we are waiting for a DLM reply.) Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-03-10gfs2: Decode missing glock flags in tracepointsAndreas Gruenbacher
Add a number of glock flags are currently not shown in the text form of glock tracepoints. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-03-10Merge 6.14-rc6 into driver-core-nextGreg Kroah-Hartman
We need the driver core fix in here as well. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-03-10xfs: trace what memory backs a bufferChristoph Hellwig
Add three trace points for the different backing memory allocators for buffers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-03-10xfs: cleanup mapping tmpfs folios into the buffer cacheChristoph Hellwig
Directly assign b_addr based on the tmpfs folios without a detour through pages, reuse the folio_put path used for non-tmpfs buffers and replace all references to pages in comments with folios. Partially based on a patch from Dave Chinner <dchinner@redhat.com>. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-03-10xfs: use vmalloc instead of vm_map_area for buffer backing memoryChristoph Hellwig
The fallback buffer allocation path currently open codes a suboptimal version of vmalloc to allocate pages that are then mapped into vmalloc space. Switch to using vmalloc instead, which uses all the optimizations in the common vmalloc code, and removes the need to track the backing pages in the xfs_buf structure. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-03-10xfs: buffer items don't straddle pages anymoreDave Chinner
Unmapped buffers don't exist anymore, so the page straddling detection and slow path code can go away now. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-03-10xfs: kill XBF_UNMAPPEDChristoph Hellwig
Unmapped buffer access is a pain, so kill it. The switch to large folios means we rarely pay a vmap penalty for large buffers, so this functionality is largely unnecessary now. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-03-10xfs: convert buffer cache to use high order foliosChristoph Hellwig
Now that we have the buffer cache using the folio API, we can extend the use of folios to allocate high order folios for multi-page buffers rather than an array of single pages that are then vmapped into a contiguous range. This creates a new type of single folio buffers that can have arbitrary order in addition to the existing multi-folio buffers made up of many single page folios that get vmapped. The single folio is for now stashed into the existing b_pages array, but that will go away entirely later in the series and remove the temporary page vs folio typing issues that only work because the two structures currently can be used largely interchangeable. The code that allocates buffers will optimistically attempt a high order folio allocation as a fast path if the buffer size is a power of two and thus fits into a folio. If this high order allocation fails, then we fall back to the existing multi-folio allocation code. This now forms the slow allocation path, and hopefully will be largely unused in normal conditions except for buffers with size that are not a power of two like larger remote xattrs. This should improve performance of large buffer operations (e.g. large directory block sizes) as we should now mostly avoid the expense of vmapping large buffers (and the vmap lock contention that can occur) as well as avoid the runtime pressure that frequently accessing kernel vmapped pages put on the TLBs. Based on a patch from Dave Chinner <dchinner@redhat.com>, but mutilated beyond recognition. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-03-10xfs: remove the kmalloc to page allocator fallbackChristoph Hellwig
Since commit 59bb47985c1d ("mm, sl[aou]b: guarantee natural alignment for kmalloc(power-of-two)", kmalloc and friends guarantee that power of two sized allocations are naturally aligned. Limit our use of kmalloc for buffers to these power of two sizes and remove the fallback to the page allocator for this case, but keep a check in addition to trusting the slab allocator to get the alignment right. Also refactor the kmalloc path to reuse various calculations for the size and gfp flags. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-03-10xfs: refactor backing memory allocations for buffersChristoph Hellwig
Lift handling of shmem and slab backed buffers into xfs_buf_alloc_pages and rename the result to xfs_buf_alloc_backing_mem. This shares more code and ensures uncached buffers can also use slab, which slightly reduces the memory usage of growfs on 512 byte sector size file systems, but more importantly means the allocation invariants are the same for cached and uncached buffers. Document these new invariants with a big fat comment mostly stolen from a patch by Dave Chinner. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-03-10xfs: remove xfs_buf_is_vmappedChristoph Hellwig
No need to look at the page count if we can simply call is_vmalloc_addr on bp->b_addr. This prepares for eventualy removing the b_page_count field. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-03-10xfs: remove xfs_buf.b_offsetChristoph Hellwig
b_offset is only set for slab backed buffers and always set to offset_in_page(bp->b_addr), which can be done just as easily in the only user of b_offset. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-03-10xfs: add a fast path to xfs_buf_zero when b_addr is setChristoph Hellwig
No need to walk the page list if bp->b_addr is valid. That also means b_offset doesn't need to be taken into account in the unmapped loop as b_offset is only set for kmem backed buffers which are always mapped. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-03-10xfs: unmapped buffer item size straddling mismatchDave Chinner
We never log large contiguous regions of unmapped buffers, so this bug is never triggered by the current code. However, the slowpath for formatting buffer straddling regions is broken. That is, the size and shape of the log vector calculated across a straddle does not match how the formatting code formats a straddle. This results in a log vector with an uninitialised iovec and this causes a crash when xlog_write_full() goes to copy the iovec into the journal. Whilst touching this code, don't bother checking mapped or single folio buffers for discontiguous regions because they don't have them. This significantly reduces the overhead of this check when logging large buffers as calling xfs_buf_offset() is not free and it occurs a *lot* in those cases. Fixes: 929f8b0deb83 ("xfs: optimise xfs_buf_item_size/format for contiguous regions") Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-03-10sysctl: Fixes nsm_local_state boundsNicolas Bouchinet
Bound nsm_local_state sysctl writings between SYSCTL_ZERO and SYSCTL_INT_MAX. The proc_handler has thus been updated to proc_dointvec_minmax. Signed-off-by: Nicolas Bouchinet <nicolas.bouchinet@ssi.gouv.fr> [ cel: updated to handle zero - UINT_MAX instead ] Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-10nfsd: use a long for the count in nfsd4_state_shrinker_count()Jeff Layton
If there are no courtesy clients then the return value from the atomic_long_read() could overflow an int. Use a long to store the value instead. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-10nfsd: remove obsolete comment from nfs4_alloc_stidJeff Layton
idr_alloc_cyclic() is what guarantees that now, not this long-gone trick. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>