summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2025-07-17fuse: use iomap for buffered writesJoanne Koong
Have buffered writes go through iomap. This has two advantages: * granular large folio synchronous reads * granular large folio dirty tracking If for example there is a 1 MB large folio and a write issued at pos 1 to pos 1 MB - 2, only the head and tail pages will need to be read in and marked uptodate instead of the entire folio needing to be read in. Non-relevant trailing pages are also skipped (eg if for a 1 MB large folio a write is issued at pos 1 to 4099, only the first two pages are read in and the ones after that are skipped). iomap also has granular dirty tracking. This is useful in that when it comes to writeback time, only the dirty portions of the large folio will be written instead of having to write out the entire folio. For example if there is a 1 MB large folio and only 2 bytes in it are dirty, only the page for those dirty bytes get written out. Please note that granular writeback is only done once fuse also uses iomap in writeback (separate commit). .release_folio needs to be set to iomap_release_folio so that any allocated iomap ifs structs get freed. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Link: https://lore.kernel.org/20250715202122.2282532-2-joannelkoong@gmail.com Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-16bcachefs: Fix bch2_maybe_casefold() when CONFIG_UTF8=nKent Overstreet
maybe_casefold() shouldn't have been nooped, just bch2_casefold(). Fixes: 94426e4201fb ("bcachefs: opts.casefold_disabled") Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-16bcachefs: Fix build when CONFIG_UNICODE=nKent Overstreet
94426e4201fb, which added the killswitch for casefolding, accidentally removed some of the ifdefs we need to avoid build errors. It appears we need better build testing for different configurations, it took two weeks for the robots to catch this one. Fixes: 94426e4201fb ("bcachefs: opts.casefold_disabled") Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-16bcachefs: Fix reference to invalid bucket in copygcKent Overstreet
Use bch2_dev_bucket_tryget() instead of bch2_dev_tryget() before checking the bucket bitmap. Reported-by: syzbot+3168625f36f4a539237e@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-16bcachefs: Don't build aux search tree when still repairing nodeKent Overstreet
bch2_btree_node_drop_keys_outside_node() will (re)build aux search trees, because it's also called by topology repair. bch2_btree_node_read_done() was calling it before validating individual keys; invalid ones have to be dropped. If we call drop_keys_outside_node() first, then bch2_bset_build_aux_tree() doesn't run because the node already has an aux search tree - which was invalidated by the repair. Reported-by: syzbot+c5e7a66b3b23ae65d44f@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-16bcachefs: Tweak threshold for allocator triggering discardsKent Overstreet
The allocator path has a "if we're really low on free buckets, check if we should issue discards" - tweak this to also trigger discards if more than 1/128th of the device is in need_discard state. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-16bcachefs: Fix triggering of discard by the journal pathKent Overstreet
It becomes possible to do discards after a journal flush, which naturally the journal code is reponsible for. A prior refactoring seems to have broken this - which went unnoticed because the foreground allocator path can also trigger discards. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-16gfs2: No more self recoveryAndreas Gruenbacher
When a node withdraws and it turns out that it is the only node that has the filesystem mounted, gfs2 currently tries to replay the local journal to bring the filesystem back into a consistent state. Not only is that a very bad idea, it has also never worked because gfs2_recover_func() will refuse to do anything during a withdraw. However, before even getting to this point, gfs2_recover_func() dereferences sdp->sd_jdesc->jd_inode. This was a use-after-free before commit 04133b607a78 ("gfs2: Prevent double iput for journal on error") and is a NULL pointer dereference since then. Simply get rid of self recovery to fix that. Fixes: 601ef0d52e96 ("gfs2: Force withdraw to replay journals and wait for it to finish") Reported-by: Chunjie Zhu <chunjie.zhu@cloud.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-07-16gfs2: Validate i_depth for exhash directoriesAndrew Price
A fuzzer test introduced corruption that ends up with a depth of 0 in dir_e_read(), causing an undefined shift by 32 at: index = hash >> (32 - dip->i_depth); As calculated in an open-coded way in dir_make_exhash(), the minimum depth for an exhash directory is ilog2(sdp->sd_hash_ptrs) and 0 is invalid as sdp->sd_hash_ptrs is fixed as sdp->bsize / 16 at mount time. So we can avoid the undefined behaviour by checking for depth values lower than the minimum in gfs2_dinode_in(). Values greater than the maximum are already being checked for there. Also switch the calculation in dir_make_exhash() to use ilog2() to clarify how the depth is calculated. Tested with the syzkaller repro.c and xfstests '-g quick'. Reported-by: syzbot+4708579bb230a0582a57@syzkaller.appspotmail.com Signed-off-by: Andrew Price <anprice@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-07-16ext4: support uncached buffered I/OTaotao Chen
Set FOP_DONTCACHE in ext4_file_operations to declare support for uncached buffered I/O. To handle this flag, update ext4_write_begin() and ext4_da_write_begin() to use write_begin_get_folio(), which encapsulates FGP_DONTCACHE logic based on iocb->ki_flags. Part of a series refactoring address_space_operations write_begin and write_end callbacks to use struct kiocb for passing write context and flags. Signed-off-by: Taotao Chen <chentaotao@didiglobal.com> Link: https://lore.kernel.org/20250716093559.217344-6-chentaotao@didiglobal.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-16fs: change write_begin/write_end interface to take struct kiocb *Taotao Chen
Change the address_space_operations callbacks write_begin() and write_end() to take struct kiocb * as the first argument instead of struct file *. Update all affected function prototypes, implementations, call sites, and related documentation across VFS, filesystems, and block layer. Part of a series refactoring address_space_operations write_begin and write_end callbacks to use struct kiocb for passing write context and flags. Signed-off-by: Taotao Chen <chentaotao@didiglobal.com> Link: https://lore.kernel.org/20250716093559.217344-4-chentaotao@didiglobal.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-16eventpoll: Fix semi-unbounded recursionJann Horn
Ensure that epoll instances can never form a graph deeper than EP_MAX_NESTS+1 links. Currently, ep_loop_check_proc() ensures that the graph is loop-free and does some recursion depth checks, but those recursion depth checks don't limit the depth of the resulting tree for two reasons: - They don't look upwards in the tree. - If there are multiple downwards paths of different lengths, only one of the paths is actually considered for the depth check since commit 28d82dc1c4ed ("epoll: limit paths"). Essentially, the current recursion depth check in ep_loop_check_proc() just serves to prevent it from recursing too deeply while checking for loops. A more thorough check is done in reverse_path_check() after the new graph edge has already been created; this checks, among other things, that no paths going upwards from any non-epoll file with a length of more than 5 edges exist. However, this check does not apply to non-epoll files. As a result, it is possible to recurse to a depth of at least roughly 500, tested on v6.15. (I am unsure if deeper recursion is possible; and this may have changed with commit 8c44dac8add7 ("eventpoll: Fix priority inversion problem").) To fix it: 1. In ep_loop_check_proc(), note the subtree depth of each visited node, and use subtree depths for the total depth calculation even when a subtree has already been visited. 2. Add ep_get_upwards_depth_proc() for similarly determining the maximum depth of an upwards walk. 3. In ep_loop_check(), use these values to limit the total path length between epoll nodes to EP_MAX_NESTS edges. Fixes: 22bacca48a17 ("epoll: prevent creating circular epoll structures") Cc: stable@vger.kernel.org Signed-off-by: Jann Horn <jannh@google.com> Link: https://lore.kernel.org/20250711-epoll-recursion-fix-v1-1-fb2457c33292@google.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-16fs: tighten a sanity check in file_attr_to_fileattr()Dan Carpenter
The fattr->fa_xflags is a u64 that comes from the user. This is a sanity check to ensure that the users are only setting allowed flags. The problem is that it doesn't check the upper 32 bits. It doesn't really affect anything but for more flexibility in the future, we want to enforce users zero out those bits. Fixes: be7efb2d20d6 ("fs: introduce file_getattr and file_setattr syscalls") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Link: https://lore.kernel.org/baf7b808-bcf2-4ac1-9313-882c91cc87b2@sabinyo.mountain Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-15fs: add a new remove_bdev() callbackQu Wenruo
Currently all filesystems which implement super_operations::shutdown() can not afford losing a device. Thus fs_bdev_mark_dead() will just call the ->shutdown() callback for the involved filesystem. But it will no longer be the case, as multi-device filesystems like btrfs and bcachefs can handle certain device loss without the need to shutdown the whole filesystem. To allow those multi-device filesystems to be integrated to use fs_holder_ops: - Add a new super_operations::remove_bdev() callback - Try ->remove_bdev() callback first inside fs_bdev_mark_dead() If the callback returned 0, meaning the fs can handling the device loss, then exit without doing anything else. If there is no such callback or the callback returned non-zero value, continue to shutdown the filesystem as usual. This means the new remove_bdev() should only do the check on whether the operation can continue, and if so do the fs specific handlings. The shutdown handling should still be handled by the existing ->shutdown() callback. For all existing filesystems with shutdown callback, there is no change to the code nor behavior. Btrfs is going to implement both the ->remove_bdev() and ->shutdown() callbacks soon. Signed-off-by: Qu Wenruo <wqu@suse.com> Link: https://lore.kernel.org/09909fcff7f2763cc037fec97ac2482bdc0a12cb.1752470276.git.wqu@suse.com Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-15gfs2: Set .migrate_folio in gfs2_{rgrp,meta}_aopsAndrew Price
Clears up the warning added in 7ee3647243e5 ("migrate: Remove call to ->writepage") that occurs in various xfstests, causing "something found in dmesg" failures. [ 341.136573] gfs2_meta_aops does not implement migrate_folio [ 341.136953] WARNING: CPU: 1 PID: 36 at mm/migrate.c:944 move_to_new_folio+0x2f8/0x300 Signed-off-by: Andrew Price <anprice@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-07-14binfmt_elf: Warn on missing or suspicious regset note namesDave Martin
Now that all regset definitions declare an explicit note name, warn if the note name is missing when generating a core dump. Simplify the fallback to always guess "LINUX", which is appropriate for all Linux-specific notes (i.e., all newly added notes, for a long time now). The one standard exception (PR_FPREG) will no longer have an "unexpected" note name overridden, but a warning will still be emitted. Also warn if the specified note name doesn't match the legacy pattern -- but don't bother to override the name in this case. This warning can be removed in future if new note types emerge that require a specific note name that is not "LINUX". No functional change, beyond the extra noise in dmesg and not overriding an unexpected note name for PR_FPREG any more. Now that all upstream arches are ported to use USER_REGSET_NOTE_TYPE(), new regsets created by copy-pasting existing code should end up correct by construction. Signed-off-by: Dave Martin <Dave.Martin@arm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Kees Cook <kees@kernel.org> Cc: Akihiko Odaki <akihiko.odaki@daynix.com> Reviewed-by: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp> Link: https://lore.kernel.org/r/20250701135616.29630-24-Dave.Martin@arm.com Signed-off-by: Kees Cook <kees@kernel.org>
2025-07-14binfmt_elf: Dump non-arch notes with strictly matching name and typeDave Martin
The note names for some arch-independent coredump notes are specified manually, albeit by referring to the NN_<foo> #define corresponding to the NT_<foo> #define that specifies the note type. Now that there are no exceptional cases, refactor fill_note() to pick the correct NN_ and NT_ macros implcitly for the requested note type. Signed-off-by: Dave Martin <Dave.Martin@arm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Kees Cook <kees@kernel.org> Cc: Akihiko Odaki <akihiko.odaki@daynix.com> Reviewed-by: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp> Link: https://lore.kernel.org/r/20250701135616.29630-4-Dave.Martin@arm.com Signed-off-by: Kees Cook <kees@kernel.org>
2025-07-14regset: Add explicit core note name in struct user_regsetDave Martin
There is currently hard-coded logic spread around the tree for determining the note name for regset notes emitted in coredumps. Now that the names are declared explicitly in <uapi/elf.h>, this can be simplified. In preparation for getting rid of the special-case logic, add an explicit core_note_name field in struct user_regset for specifying the note name explicitly. To help avoid mistakes, a convenience macro USER_REGSET_NOTE_TYPE() is provided to set .core_note_type and .core_note_name based on the note type. When dumping core, use the new field to set the note name, if the regset specifies it. Signed-off-by: Dave Martin <Dave.Martin@arm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Kees Cook <kees@kernel.org> Cc: Akihiko Odaki <akihiko.odaki@daynix.com> Acked-by: Alexander Gordeev <agordeev@linux.ibm.com> # s390 Reviewed-by: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp> Link: https://lore.kernel.org/r/20250701135616.29630-3-Dave.Martin@arm.com Signed-off-by: Kees Cook <kees@kernel.org>
2025-07-14ext4: limit the maximum folio orderZhang Yi
In environments with a page size of 64KB, the maximum size of a folio can reach up to 128MB. Consequently, during the write-back of folios, the 'rsv_blocks' will be overestimated to 1,577, which can make pressure on the journal space where the journal is small. This can easily exceed the limit of a single transaction. Besides, an excessively large folio is meaningless and will instead increase the overhead of traversing the bhs within the folio. Therefore, limit the maximum order of a folio to 2048 filesystem blocks. Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org> Reported-by: Joseph Qi <jiangqi903@gmail.com> Closes: https://lore.kernel.org/linux-ext4/CA+G9fYsyYQ3ZL4xaSg1-Tt5Evto7Zd+hgNWZEa9cQLbahA1+xg@mail.gmail.com/ Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Tested-by: Joseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20250707140814.542883-12-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-15gfs2: a minor finish_xmote cleanupAndreas Gruenbacher
As a minor clean-up to commit 1fc05c8d8426 ("gfs2: cancel timed-out glock requests"), when a demote request is in progress in finish_xmote(), there is no point in waking up the glock holder at the head of the queue because the reply from dlm cannot be on behalf of that glock holder. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Reviewed-by: Andrew Price <anprice@redhat.com>
2025-07-15gfs2: simplify finish_xmoteAndreas Gruenbacher
As a follow-up to commit a431d49243a0 ("gfs2: Fix request cancelation bug"), it turns out that any call to finish_xmote() is always followed by a call to run_queue(), either * directly when glock_work_func() calls finish_xmote() before calling run_queue(), or * indirectly when do_xmote() calls finish_xmote() before calling gfs2_glock_queue_work(), which queues a call to glock_work_func() in work queue context, so remove the code in finish_xmote() that duplicates the functionality of run_queue(). In addition, the code this commit removes is missing a check for the GLF_DEMOTE flag which indicates that no further promotes should be performed, so if that code didn't get removed, that check would have to be added. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Reviewed-by: Andrew Price <anprice@redhat.com>
2025-07-15gfs2: sanitize the gdlm_ast -> finish_xmote interfaceAndreas Gruenbacher
When gdlm_ast() is called with a non-zero status code, this means that the requested operation did not succeed and the current lock state didn't change. Turn that into a non-zero LM_OUT_* status code (with ret & ~LM_OUT_ST_MASK != 0) instead of pretending that dlm returned the current lock state. That way, we can easily change finish_xmote() to only update gl->gl_state when the state has actually changed. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Reviewed-by: Andrew Price <anprice@redhat.com>
2025-07-14pNFS: Fix disk addr range check in block/scsi layoutSergey Bashirov
At the end of the isect translation, disc_addr represents the physical disk offset. Thus, end calculated from disk_addr is also a physical disk offset. Therefore, range checking should be done using map->disk_offset, not map->start. Signed-off-by: Sergey Bashirov <sergeybashirov@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250702133226.212537-1-sergeybashirov@gmail.com Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-07-14pNFS: Fix stripe mapping in block/scsi layoutSergey Bashirov
Because of integer division, we need to carefully calculate the disk offset. Consider the example below for a stripe of 6 volumes, a chunk size of 4096, and an offset of 70000. chunk = div_u64(offset, dev->chunk_size) = 70000 / 4096 = 17 offset = chunk * dev->chunk_size = 17 * 4096 = 69632 disk_offset_wrong = div_u64(offset, dev->nr_children) = 69632 / 6 = 11605 disk_chunk = div_u64(chunk, dev->nr_children) = 17 / 6 = 2 disk_offset = disk_chunk * dev->chunk_size = 2 * 4096 = 8192 Signed-off-by: Sergey Bashirov <sergeybashirov@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250701122341.199112-1-sergeybashirov@gmail.com Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-07-14pNFS: Handle RPC size limit for layoutcommitsSergey Bashirov
When there are too many block extents for a layoutcommit, they may not all fit into the maximum-sized RPC. This patch allows the generic pnfs code to properly handle -ENOSPC returned by the block/scsi layout driver and trigger additional layoutcommits if necessary. Co-developed-by: Konstantin Evtushenko <koevtushenko@yandex.com> Signed-off-by: Konstantin Evtushenko <koevtushenko@yandex.com> Signed-off-by: Sergey Bashirov <sergeybashirov@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250630183537.196479-5-sergeybashirov@gmail.com Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-07-14pNFS: Add prepare commit trace to block/scsi layoutSergey Bashirov
Replace dprintk with trace event in ext_tree_prepare_commit() function. Co-developed-by: Konstantin Evtushenko <koevtushenko@yandex.com> Signed-off-by: Konstantin Evtushenko <koevtushenko@yandex.com> Signed-off-by: Sergey Bashirov <sergeybashirov@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250630183537.196479-4-sergeybashirov@gmail.com Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-07-14pNFS: Fix extent encoding in block/scsi layoutSergey Bashirov
The ext_tree_encode_commit() function may be called multiple times for the same file, layout, and last written byte if the provided buffer is not large enough to encode all extents in it. The first problem is that the last written byte field must be zeroed only on a successful call, otherwise we will lose its actual value and get an integer overflow on the next encoding attempt. The second problem is that we can't count and encode in one pass. The extent state changes during encoding, so if we return -ENOSPC but have already encoded some extents into a small buffer, they will not be re-encoded into a new larger buffer on the next try. As a result, the client never commits these extents to the server. Co-developed-by: Konstantin Evtushenko <koevtushenko@yandex.com> Signed-off-by: Konstantin Evtushenko <koevtushenko@yandex.com> Signed-off-by: Sergey Bashirov <sergeybashirov@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250630183537.196479-3-sergeybashirov@gmail.com Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-07-14pNFS: Fix uninited ptr deref in block/scsi layoutSergey Bashirov
The error occurs on the third attempt to encode extents. When function ext_tree_prepare_commit() reallocates a larger buffer to retry encoding extents, the "layoutupdate_pages" page array is initialized only after the retry loop. But ext_tree_free_commitdata() is called on every iteration and tries to put pages in the array, thus dereferencing uninitialized pointers. An additional problem is that there is no limit on the maximum possible buffer_size. When there are too many extents, the client may create a layoutcommit that is larger than the maximum possible RPC size accepted by the server. During testing, we observed two typical scenarios. First, one memory page for extents is enough when we work with small files, append data to the end of the file, or preallocate extents before writing. But when we fill a new large file without preallocating, the number of extents can be huge, and counting the number of written extents in ext_tree_encode_commit() does not help much. Since this number increases even more between unlocking and locking of ext_tree, the reallocated buffer may not be large enough again and again. Co-developed-by: Konstantin Evtushenko <koevtushenko@yandex.com> Signed-off-by: Konstantin Evtushenko <koevtushenko@yandex.com> Signed-off-by: Sergey Bashirov <sergeybashirov@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250630183537.196479-2-sergeybashirov@gmail.com Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-07-14NFS: Remove unused function nfs_umountDr. David Alan Gilbert
nfs_umount() has been unused since 2013's commit 4580a92d44e2 ("NFS: Use server-recommended security flavor by default (NFSv3)") Remove it. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Link: https://lore.kernel.org/r/20250218215250.263709-1-linux@treblig.org Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-07-14nfs: create a kernel keyringChristoph Hellwig
Create a kernel .nfs keyring similar to the nvme .nvme one. Unlike for a userspace-created keyrind, tlshd is a possesor of the keys with this and thus the keys don't need user read permissions. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Link: https://lore.kernel.org/r/20250515115107.33052-3-hch@lst.de Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-07-14NFS: support the kernel keyring for TLSChristoph Hellwig
Allow tlshd to use a per-mount key from the kernel keyring similar to NVMe over TCP. Note that tlshd expects keys and certificates stored in the kernel keyring to be in DER format, not the PEM format used for file based keys and certificates, so they need to be converted before they are added to the keyring, which is a bit unexpected. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Link: https://lore.kernel.org/r/20250515115107.33052-2-hch@lst.de Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-07-14NFS: Allow folio migration for the case of mode == MIGRATE_SYNCTrond Myklebust
When the mode is MIGRATE_SYNC, we are allowed to call nfs_wb_folio() under the folio lock. Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-07-14nfs: new tracepoint in match_stateid operationJeff Layton
Add new tracepoints in the NFSv4 match_stateid minorversion op that show the info in both stateids. Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/20250618-nfs-tracepoints-v2-4-540c9fb48da2@kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-07-14nfs: new tracepoint in nfs_delegation_need_returnJeff Layton
Add a tracepoint in the function that decides whether to return a delegation to the server. Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/20250618-nfs-tracepoints-v2-3-540c9fb48da2@kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-07-14nfs: add a tracepoint to nfs_inode_detach_delegation_lockedJeff Layton
We have tracepoints for setting a delegation and reclaiming them. Add a tracepoint for when the delegation is being detached from the inode. Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/20250618-nfs-tracepoints-v2-2-540c9fb48da2@kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-07-14nfs: add cache_validity to the nfs_inode_event tracepointsJeff Layton
Managing the cache_validity flags is the deep voodoo of NFS cache coherency. Let's have a little extra visibility into that value via the nfs_inode_event tracepoints. Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/20250618-nfs-tracepoints-v2-1-540c9fb48da2@kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-07-14NFS: remove unused time_delta field from struct nfs_serverAnthony Iliopoulos
The last code that was using this was removed via commit ca0daa277aca ("NFS: Cache aggressively when file is open for writing") which was merged in v4.8-rc1, so it can be removed completely. Signed-off-by: Anthony Iliopoulos <ailiop@suse.com> Link: https://lore.kernel.org/r/20250613094439.82338-3-ailiop@suse.com Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-07-14NFS: remove unused wpages field from struct nfs_serverAnthony Iliopoulos
The wpages field is not serving any purpose since commit c63c7b051395 ("NFS: Fix a race when doing NFS write coalescing") which was merged in v2.6.22-rc1. Remove it completely. Signed-off-by: Anthony Iliopoulos <ailiop@suse.com> Link: https://lore.kernel.org/r/20250613094439.82338-2-ailiop@suse.com Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-07-14pnfs: add pnfs_ds_connect trace pointTigran Mkrtchyan
This tracepoint aims to expose pnfs DS connect status Signed-off-by: Tigran Mkrtchyan <tigran.mkrtchyan@desy.de> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Link: https://lore.kernel.org/r/20250610151246.9147-1-tigran.mkrtchyan@desy.de Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-07-14nfs: use lock_two_nondirectories()NeilBrown
Rather than open-coding this function call it to make intention clear and to use "correct" nesting levels (parent and child are for directories). This is purely cosmetic with no expected change in behaviour. Signed-off-by: NeilBrown <neil@brown.name> Link: https://lore.kernel.org/r/174942091741.608730.3327223511347232829@noble.neil.brown.name Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-07-14NFS: Return the file btime in the statx results when appropriateTrond Myklebust
If the server supports the NFSv4.x "create_time" attribute, then return it as part of the statx results. Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/eae27d6467e08aaa67e0ac6ae7119263a0f83349.1748515333.git.bcodding@redhat.com Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-07-14nfs: Add timecreate to nfs inodeAnne Marie Merritt
Add tracking of the create time (a.k.a. btime) along with corresponding bitfields, request, and decode xdr routines. Signed-off-by: Anne Marie Merritt <annemarie.merritt@primarydata.com> Signed-off-by: Lance Shelton <lance.shelton@hammerspace.com> Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/1e3677b0655fa2bbaba0817b41d111d94a06e5ee.1748515333.git.bcodding@redhat.com Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-07-14Expand the type of nfs_fattr->validTrond Myklebust
We need to be able to track more than 32 attributes per inode. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Lance Shelton <lance.shelton@hammerspace.com> Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/1e3405fca54efd0be7c91c1da77917b94f5dfcc4.1748515333.git.bcodding@redhat.com Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-07-14jfs: stop using write_cache_pagesChristoph Hellwig
Stop using the obsolete write_cache_pages and use writeback_iter directly. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
2025-07-14jfs: truncate good inode pages when hard link is 0Lizhi Xu
The fileset value of the inode copy from the disk by the reproducer is AGGR_RESERVED_I. When executing evict, its hard link number is 0, so its inode pages are not truncated. This causes the bugon to be triggered when executing clear_inode() because nrpages is greater than 0. Reported-by: syzbot+6e516bb515d93230bc7b@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=6e516bb515d93230bc7b Signed-off-by: Lizhi Xu <lizhi.xu@windriver.com> Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
2025-07-14jfs: jfs_xtree: replace XT_GETPAGE macro with xt_getpage()Suchit Karunakaran
Replace legacy XT_GETPAGE macro with an inline function that returns a xtpage_t pointer and update all instances of XT_GETPAGE in jfs_xtree.c Signed-off-by: Suchit Karunakaran <suchitkarunakaran@gmail.com> Simplified xt_getpage by removing size and rc arguments. Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
2025-07-14jfs: Regular file corruption checkEdward Adam Davis
The reproducer builds a corrupted file on disk with a negative i_size value. Add a check when opening this file to avoid subsequent operation failures. Reported-by: syzbot+630f6d40b3ccabc8e96e@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=630f6d40b3ccabc8e96e Tested-by: syzbot+630f6d40b3ccabc8e96e@syzkaller.appspotmail.com Signed-off-by: Edward Adam Davis <eadavis@qq.com> Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
2025-07-14jfs: upper bound check of tree index in dbAllocAGArnaud Lecomte
When computing the tree index in dbAllocAG, we never check if we are out of bounds realative to the size of the stree. This could happen in a scenario where the filesystem metadata are corrupted. Reported-by: syzbot+cffd18309153948f3c3e@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=cffd18309153948f3c3e Tested-by: syzbot+cffd18309153948f3c3e@syzkaller.appspotmail.com Signed-off-by: Arnaud Lecomte <contact@arnaud-lcm.com> Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
2025-07-14fsverity: Switch from crypto_shash to SHA-2 libraryEric Biggers
fsverity supports two hash algorithms: SHA-256 and SHA-512. Since both of these have a library API now, just use the library API instead of crypto_shash. Even with multiple algorithms, the library-based code still ends up being quite a bit simpler, due to how clumsy the old-school crypto API is. The library-based code is also more efficient, since it avoids overheads such as indirect calls. Acked-by: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/20250630172224.46909-3-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@kernel.org>
2025-07-14fsverity: Explicitly include <linux/export.h>Eric Biggers
Fix build warnings with W=1 that started appearing after commit a934a57a42f6 ("scripts/misc-check: check missing #include <linux/export.h> when W=1"). Link: https://lore.kernel.org/r/20250614221723.131827-1-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@kernel.org>