linux.git - Linus' kernel tree

Age	Commit message (Collapse)	Author
2025-07-17	ilog2: add max_pow_of_two_factor()	John Garry
	Relocate the function max_pow_of_two_factor() to common ilog2.h from the xfs code, as it will be used elsewhere. Also simplify the function, as advised by Mikulas Patocka. Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20250711105258.3135198-2-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-17	fuse: refactor writeback to use iomap_writepage_ctx inode	Joanne Koong
	struct iomap_writepage_ctx includes a pointer to the file inode. In writeback, use that instead of also passing the inode into fuse_fill_wb_data. No functional changes. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Link: https://lore.kernel.org/20250715202122.2282532-6-joannelkoong@gmail.com Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-17	fuse: hook into iomap for invalidating and checking partial uptodateness	Joanne Koong
	Hook into iomap_invalidate_folio() so that if the entire folio is being invalidated during truncation, the dirty state is cleared and the folio doesn't get written back. As well the folio's corresponding ifs struct will get freed. Hook into iomap_is_partially_uptodate() since iomap tracks uptodateness granularly when it does buffered writes. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Link: https://lore.kernel.org/20250715202122.2282532-5-joannelkoong@gmail.com Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-17	fuse: use iomap for folio laundering	Joanne Koong
	Use iomap for folio laundering, which will do granular dirty writeback when laundering a large folio. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Link: https://lore.kernel.org/20250715202122.2282532-4-joannelkoong@gmail.com Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-17	fuse: use iomap for writeback	Joanne Koong
	Use iomap for dirty folio writeback in ->writepages(). This allows for granular dirty writeback of large folios. Only the dirty portions of the large folio will be written instead of having to write out the entire folio. For example if there is a 1 MB large folio and only 2 bytes in it are dirty, only the page for those dirty bytes will be written out. .dirty_folio needs to be set to iomap_dirty_folio so that the bitmap iomap uses for dirty tracking correctly reflects dirty regions that need to be written back. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Link: https://lore.kernel.org/20250715202122.2282532-3-joannelkoong@gmail.com Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-17	fuse: use iomap for buffered writes	Joanne Koong
	Have buffered writes go through iomap. This has two advantages: * granular large folio synchronous reads * granular large folio dirty tracking If for example there is a 1 MB large folio and a write issued at pos 1 to pos 1 MB - 2, only the head and tail pages will need to be read in and marked uptodate instead of the entire folio needing to be read in. Non-relevant trailing pages are also skipped (eg if for a 1 MB large folio a write is issued at pos 1 to 4099, only the first two pages are read in and the ones after that are skipped). iomap also has granular dirty tracking. This is useful in that when it comes to writeback time, only the dirty portions of the large folio will be written instead of having to write out the entire folio. For example if there is a 1 MB large folio and only 2 bytes in it are dirty, only the page for those dirty bytes get written out. Please note that granular writeback is only done once fuse also uses iomap in writeback (separate commit). .release_folio needs to be set to iomap_release_folio so that any allocated iomap ifs structs get freed. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Link: https://lore.kernel.org/20250715202122.2282532-2-joannelkoong@gmail.com Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-16	bcachefs: Fix bch2_maybe_casefold() when CONFIG_UTF8=n	Kent Overstreet
	maybe_casefold() shouldn't have been nooped, just bch2_casefold(). Fixes: 94426e4201fb ("bcachefs: opts.casefold_disabled") Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-16	bcachefs: Fix build when CONFIG_UNICODE=n	Kent Overstreet
	94426e4201fb, which added the killswitch for casefolding, accidentally removed some of the ifdefs we need to avoid build errors. It appears we need better build testing for different configurations, it took two weeks for the robots to catch this one. Fixes: 94426e4201fb ("bcachefs: opts.casefold_disabled") Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-16	bcachefs: Fix reference to invalid bucket in copygc	Kent Overstreet
	Use bch2_dev_bucket_tryget() instead of bch2_dev_tryget() before checking the bucket bitmap. Reported-by: syzbot+3168625f36f4a539237e@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-16	bcachefs: Don't build aux search tree when still repairing node	Kent Overstreet
	bch2_btree_node_drop_keys_outside_node() will (re)build aux search trees, because it's also called by topology repair. bch2_btree_node_read_done() was calling it before validating individual keys; invalid ones have to be dropped. If we call drop_keys_outside_node() first, then bch2_bset_build_aux_tree() doesn't run because the node already has an aux search tree - which was invalidated by the repair. Reported-by: syzbot+c5e7a66b3b23ae65d44f@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-16	bcachefs: Tweak threshold for allocator triggering discards	Kent Overstreet
	The allocator path has a "if we're really low on free buckets, check if we should issue discards" - tweak this to also trigger discards if more than 1/128th of the device is in need_discard state. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-16	bcachefs: Fix triggering of discard by the journal path	Kent Overstreet
	It becomes possible to do discards after a journal flush, which naturally the journal code is reponsible for. A prior refactoring seems to have broken this - which went unnoticed because the foreground allocator path can also trigger discards. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-16	gfs2: No more self recovery	Andreas Gruenbacher
	When a node withdraws and it turns out that it is the only node that has the filesystem mounted, gfs2 currently tries to replay the local journal to bring the filesystem back into a consistent state. Not only is that a very bad idea, it has also never worked because gfs2_recover_func() will refuse to do anything during a withdraw. However, before even getting to this point, gfs2_recover_func() dereferences sdp->sd_jdesc->jd_inode. This was a use-after-free before commit 04133b607a78 ("gfs2: Prevent double iput for journal on error") and is a NULL pointer dereference since then. Simply get rid of self recovery to fix that. Fixes: 601ef0d52e96 ("gfs2: Force withdraw to replay journals and wait for it to finish") Reported-by: Chunjie Zhu <chunjie.zhu@cloud.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-07-16	gfs2: Validate i_depth for exhash directories	Andrew Price
	A fuzzer test introduced corruption that ends up with a depth of 0 in dir_e_read(), causing an undefined shift by 32 at: index = hash >> (32 - dip->i_depth); As calculated in an open-coded way in dir_make_exhash(), the minimum depth for an exhash directory is ilog2(sdp->sd_hash_ptrs) and 0 is invalid as sdp->sd_hash_ptrs is fixed as sdp->bsize / 16 at mount time. So we can avoid the undefined behaviour by checking for depth values lower than the minimum in gfs2_dinode_in(). Values greater than the maximum are already being checked for there. Also switch the calculation in dir_make_exhash() to use ilog2() to clarify how the depth is calculated. Tested with the syzkaller repro.c and xfstests '-g quick'. Reported-by: syzbot+4708579bb230a0582a57@syzkaller.appspotmail.com Signed-off-by: Andrew Price <anprice@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-07-16	ext4: support uncached buffered I/O	Taotao Chen
	Set FOP_DONTCACHE in ext4_file_operations to declare support for uncached buffered I/O. To handle this flag, update ext4_write_begin() and ext4_da_write_begin() to use write_begin_get_folio(), which encapsulates FGP_DONTCACHE logic based on iocb->ki_flags. Part of a series refactoring address_space_operations write_begin and write_end callbacks to use struct kiocb for passing write context and flags. Signed-off-by: Taotao Chen <chentaotao@didiglobal.com> Link: https://lore.kernel.org/20250716093559.217344-6-chentaotao@didiglobal.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-16	fs: change write_begin/write_end interface to take struct kiocb *	Taotao Chen
	Change the address_space_operations callbacks write_begin() and write_end() to take struct kiocb * as the first argument instead of struct file *. Update all affected function prototypes, implementations, call sites, and related documentation across VFS, filesystems, and block layer. Part of a series refactoring address_space_operations write_begin and write_end callbacks to use struct kiocb for passing write context and flags. Signed-off-by: Taotao Chen <chentaotao@didiglobal.com> Link: https://lore.kernel.org/20250716093559.217344-4-chentaotao@didiglobal.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-16	eventpoll: Fix semi-unbounded recursion	Jann Horn
	Ensure that epoll instances can never form a graph deeper than EP_MAX_NESTS+1 links. Currently, ep_loop_check_proc() ensures that the graph is loop-free and does some recursion depth checks, but those recursion depth checks don't limit the depth of the resulting tree for two reasons: - They don't look upwards in the tree. - If there are multiple downwards paths of different lengths, only one of the paths is actually considered for the depth check since commit 28d82dc1c4ed ("epoll: limit paths"). Essentially, the current recursion depth check in ep_loop_check_proc() just serves to prevent it from recursing too deeply while checking for loops. A more thorough check is done in reverse_path_check() after the new graph edge has already been created; this checks, among other things, that no paths going upwards from any non-epoll file with a length of more than 5 edges exist. However, this check does not apply to non-epoll files. As a result, it is possible to recurse to a depth of at least roughly 500, tested on v6.15. (I am unsure if deeper recursion is possible; and this may have changed with commit 8c44dac8add7 ("eventpoll: Fix priority inversion problem").) To fix it: 1. In ep_loop_check_proc(), note the subtree depth of each visited node, and use subtree depths for the total depth calculation even when a subtree has already been visited. 2. Add ep_get_upwards_depth_proc() for similarly determining the maximum depth of an upwards walk. 3. In ep_loop_check(), use these values to limit the total path length between epoll nodes to EP_MAX_NESTS edges. Fixes: 22bacca48a17 ("epoll: prevent creating circular epoll structures") Cc: stable@vger.kernel.org Signed-off-by: Jann Horn <jannh@google.com> Link: https://lore.kernel.org/20250711-epoll-recursion-fix-v1-1-fb2457c33292@google.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-16	fs: tighten a sanity check in file_attr_to_fileattr()	Dan Carpenter
	The fattr->fa_xflags is a u64 that comes from the user. This is a sanity check to ensure that the users are only setting allowed flags. The problem is that it doesn't check the upper 32 bits. It doesn't really affect anything but for more flexibility in the future, we want to enforce users zero out those bits. Fixes: be7efb2d20d6 ("fs: introduce file_getattr and file_setattr syscalls") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Link: https://lore.kernel.org/baf7b808-bcf2-4ac1-9313-882c91cc87b2@sabinyo.mountain Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-15	fs: add a new remove_bdev() callback	Qu Wenruo
	Currently all filesystems which implement super_operations::shutdown() can not afford losing a device. Thus fs_bdev_mark_dead() will just call the ->shutdown() callback for the involved filesystem. But it will no longer be the case, as multi-device filesystems like btrfs and bcachefs can handle certain device loss without the need to shutdown the whole filesystem. To allow those multi-device filesystems to be integrated to use fs_holder_ops: - Add a new super_operations::remove_bdev() callback - Try ->remove_bdev() callback first inside fs_bdev_mark_dead() If the callback returned 0, meaning the fs can handling the device loss, then exit without doing anything else. If there is no such callback or the callback returned non-zero value, continue to shutdown the filesystem as usual. This means the new remove_bdev() should only do the check on whether the operation can continue, and if so do the fs specific handlings. The shutdown handling should still be handled by the existing ->shutdown() callback. For all existing filesystems with shutdown callback, there is no change to the code nor behavior. Btrfs is going to implement both the ->remove_bdev() and ->shutdown() callbacks soon. Signed-off-by: Qu Wenruo <wqu@suse.com> Link: https://lore.kernel.org/09909fcff7f2763cc037fec97ac2482bdc0a12cb.1752470276.git.wqu@suse.com Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-15	gfs2: Set .migrate_folio in gfs2_{rgrp,meta}_aops	Andrew Price
	Clears up the warning added in 7ee3647243e5 ("migrate: Remove call to ->writepage") that occurs in various xfstests, causing "something found in dmesg" failures. [ 341.136573] gfs2_meta_aops does not implement migrate_folio [ 341.136953] WARNING: CPU: 1 PID: 36 at mm/migrate.c:944 move_to_new_folio+0x2f8/0x300 Signed-off-by: Andrew Price <anprice@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-07-14	binfmt_elf: Warn on missing or suspicious regset note names	Dave Martin
	Now that all regset definitions declare an explicit note name, warn if the note name is missing when generating a core dump. Simplify the fallback to always guess "LINUX", which is appropriate for all Linux-specific notes (i.e., all newly added notes, for a long time now). The one standard exception (PR_FPREG) will no longer have an "unexpected" note name overridden, but a warning will still be emitted. Also warn if the specified note name doesn't match the legacy pattern -- but don't bother to override the name in this case. This warning can be removed in future if new note types emerge that require a specific note name that is not "LINUX". No functional change, beyond the extra noise in dmesg and not overriding an unexpected note name for PR_FPREG any more. Now that all upstream arches are ported to use USER_REGSET_NOTE_TYPE(), new regsets created by copy-pasting existing code should end up correct by construction. Signed-off-by: Dave Martin <Dave.Martin@arm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Kees Cook <kees@kernel.org> Cc: Akihiko Odaki <akihiko.odaki@daynix.com> Reviewed-by: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp> Link: https://lore.kernel.org/r/20250701135616.29630-24-Dave.Martin@arm.com Signed-off-by: Kees Cook <kees@kernel.org>
2025-07-14	binfmt_elf: Dump non-arch notes with strictly matching name and type	Dave Martin
	The note names for some arch-independent coredump notes are specified manually, albeit by referring to the NN_<foo> #define corresponding to the NT_<foo> #define that specifies the note type. Now that there are no exceptional cases, refactor fill_note() to pick the correct NN_ and NT_ macros implcitly for the requested note type. Signed-off-by: Dave Martin <Dave.Martin@arm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Kees Cook <kees@kernel.org> Cc: Akihiko Odaki <akihiko.odaki@daynix.com> Reviewed-by: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp> Link: https://lore.kernel.org/r/20250701135616.29630-4-Dave.Martin@arm.com Signed-off-by: Kees Cook <kees@kernel.org>
2025-07-14	regset: Add explicit core note name in struct user_regset	Dave Martin
	There is currently hard-coded logic spread around the tree for determining the note name for regset notes emitted in coredumps. Now that the names are declared explicitly in <uapi/elf.h>, this can be simplified. In preparation for getting rid of the special-case logic, add an explicit core_note_name field in struct user_regset for specifying the note name explicitly. To help avoid mistakes, a convenience macro USER_REGSET_NOTE_TYPE() is provided to set .core_note_type and .core_note_name based on the note type. When dumping core, use the new field to set the note name, if the regset specifies it. Signed-off-by: Dave Martin <Dave.Martin@arm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Kees Cook <kees@kernel.org> Cc: Akihiko Odaki <akihiko.odaki@daynix.com> Acked-by: Alexander Gordeev <agordeev@linux.ibm.com> # s390 Reviewed-by: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp> Link: https://lore.kernel.org/r/20250701135616.29630-3-Dave.Martin@arm.com Signed-off-by: Kees Cook <kees@kernel.org>
2025-07-14	ext4: limit the maximum folio order	Zhang Yi
	In environments with a page size of 64KB, the maximum size of a folio can reach up to 128MB. Consequently, during the write-back of folios, the 'rsv_blocks' will be overestimated to 1,577, which can make pressure on the journal space where the journal is small. This can easily exceed the limit of a single transaction. Besides, an excessively large folio is meaningless and will instead increase the overhead of traversing the bhs within the folio. Therefore, limit the maximum order of a folio to 2048 filesystem blocks. Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org> Reported-by: Joseph Qi <jiangqi903@gmail.com> Closes: https://lore.kernel.org/linux-ext4/CA+G9fYsyYQ3ZL4xaSg1-Tt5Evto7Zd+hgNWZEa9cQLbahA1+xg@mail.gmail.com/ Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Tested-by: Joseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20250707140814.542883-12-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-15	gfs2: a minor finish_xmote cleanup	Andreas Gruenbacher
	As a minor clean-up to commit 1fc05c8d8426 ("gfs2: cancel timed-out glock requests"), when a demote request is in progress in finish_xmote(), there is no point in waking up the glock holder at the head of the queue because the reply from dlm cannot be on behalf of that glock holder. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Reviewed-by: Andrew Price <anprice@redhat.com>
2025-07-15	gfs2: simplify finish_xmote	Andreas Gruenbacher
	As a follow-up to commit a431d49243a0 ("gfs2: Fix request cancelation bug"), it turns out that any call to finish_xmote() is always followed by a call to run_queue(), either * directly when glock_work_func() calls finish_xmote() before calling run_queue(), or * indirectly when do_xmote() calls finish_xmote() before calling gfs2_glock_queue_work(), which queues a call to glock_work_func() in work queue context, so remove the code in finish_xmote() that duplicates the functionality of run_queue(). In addition, the code this commit removes is missing a check for the GLF_DEMOTE flag which indicates that no further promotes should be performed, so if that code didn't get removed, that check would have to be added. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Reviewed-by: Andrew Price <anprice@redhat.com>
2025-07-15	gfs2: sanitize the gdlm_ast -> finish_xmote interface	Andreas Gruenbacher
	When gdlm_ast() is called with a non-zero status code, this means that the requested operation did not succeed and the current lock state didn't change. Turn that into a non-zero LM_OUT_* status code (with ret & ~LM_OUT_ST_MASK != 0) instead of pretending that dlm returned the current lock state. That way, we can easily change finish_xmote() to only update gl->gl_state when the state has actually changed. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Reviewed-by: Andrew Price <anprice@redhat.com>
2025-07-14	pNFS: Fix disk addr range check in block/scsi layout	Sergey Bashirov
	At the end of the isect translation, disc_addr represents the physical disk offset. Thus, end calculated from disk_addr is also a physical disk offset. Therefore, range checking should be done using map->disk_offset, not map->start. Signed-off-by: Sergey Bashirov <sergeybashirov@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250702133226.212537-1-sergeybashirov@gmail.com Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-07-14	pNFS: Fix stripe mapping in block/scsi layout	Sergey Bashirov
	Because of integer division, we need to carefully calculate the disk offset. Consider the example below for a stripe of 6 volumes, a chunk size of 4096, and an offset of 70000. chunk = div_u64(offset, dev->chunk_size) = 70000 / 4096 = 17 offset = chunk * dev->chunk_size = 17 * 4096 = 69632 disk_offset_wrong = div_u64(offset, dev->nr_children) = 69632 / 6 = 11605 disk_chunk = div_u64(chunk, dev->nr_children) = 17 / 6 = 2 disk_offset = disk_chunk * dev->chunk_size = 2 * 4096 = 8192 Signed-off-by: Sergey Bashirov <sergeybashirov@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250701122341.199112-1-sergeybashirov@gmail.com Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-07-14	pNFS: Handle RPC size limit for layoutcommits	Sergey Bashirov
	When there are too many block extents for a layoutcommit, they may not all fit into the maximum-sized RPC. This patch allows the generic pnfs code to properly handle -ENOSPC returned by the block/scsi layout driver and trigger additional layoutcommits if necessary. Co-developed-by: Konstantin Evtushenko <koevtushenko@yandex.com> Signed-off-by: Konstantin Evtushenko <koevtushenko@yandex.com> Signed-off-by: Sergey Bashirov <sergeybashirov@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250630183537.196479-5-sergeybashirov@gmail.com Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-07-14	pNFS: Add prepare commit trace to block/scsi layout	Sergey Bashirov
	Replace dprintk with trace event in ext_tree_prepare_commit() function. Co-developed-by: Konstantin Evtushenko <koevtushenko@yandex.com> Signed-off-by: Konstantin Evtushenko <koevtushenko@yandex.com> Signed-off-by: Sergey Bashirov <sergeybashirov@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250630183537.196479-4-sergeybashirov@gmail.com Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-07-14	pNFS: Fix extent encoding in block/scsi layout	Sergey Bashirov
	The ext_tree_encode_commit() function may be called multiple times for the same file, layout, and last written byte if the provided buffer is not large enough to encode all extents in it. The first problem is that the last written byte field must be zeroed only on a successful call, otherwise we will lose its actual value and get an integer overflow on the next encoding attempt. The second problem is that we can't count and encode in one pass. The extent state changes during encoding, so if we return -ENOSPC but have already encoded some extents into a small buffer, they will not be re-encoded into a new larger buffer on the next try. As a result, the client never commits these extents to the server. Co-developed-by: Konstantin Evtushenko <koevtushenko@yandex.com> Signed-off-by: Konstantin Evtushenko <koevtushenko@yandex.com> Signed-off-by: Sergey Bashirov <sergeybashirov@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250630183537.196479-3-sergeybashirov@gmail.com Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-07-14	pNFS: Fix uninited ptr deref in block/scsi layout	Sergey Bashirov
	The error occurs on the third attempt to encode extents. When function ext_tree_prepare_commit() reallocates a larger buffer to retry encoding extents, the "layoutupdate_pages" page array is initialized only after the retry loop. But ext_tree_free_commitdata() is called on every iteration and tries to put pages in the array, thus dereferencing uninitialized pointers. An additional problem is that there is no limit on the maximum possible buffer_size. When there are too many extents, the client may create a layoutcommit that is larger than the maximum possible RPC size accepted by the server. During testing, we observed two typical scenarios. First, one memory page for extents is enough when we work with small files, append data to the end of the file, or preallocate extents before writing. But when we fill a new large file without preallocating, the number of extents can be huge, and counting the number of written extents in ext_tree_encode_commit() does not help much. Since this number increases even more between unlocking and locking of ext_tree, the reallocated buffer may not be large enough again and again. Co-developed-by: Konstantin Evtushenko <koevtushenko@yandex.com> Signed-off-by: Konstantin Evtushenko <koevtushenko@yandex.com> Signed-off-by: Sergey Bashirov <sergeybashirov@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250630183537.196479-2-sergeybashirov@gmail.com Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-07-14	NFS: Remove unused function nfs_umount	Dr. David Alan Gilbert
	nfs_umount() has been unused since 2013's commit 4580a92d44e2 ("NFS: Use server-recommended security flavor by default (NFSv3)") Remove it. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Link: https://lore.kernel.org/r/20250218215250.263709-1-linux@treblig.org Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-07-14	nfs: create a kernel keyring	Christoph Hellwig
	Create a kernel .nfs keyring similar to the nvme .nvme one. Unlike for a userspace-created keyrind, tlshd is a possesor of the keys with this and thus the keys don't need user read permissions. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Link: https://lore.kernel.org/r/20250515115107.33052-3-hch@lst.de Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-07-14	NFS: support the kernel keyring for TLS	Christoph Hellwig
	Allow tlshd to use a per-mount key from the kernel keyring similar to NVMe over TCP. Note that tlshd expects keys and certificates stored in the kernel keyring to be in DER format, not the PEM format used for file based keys and certificates, so they need to be converted before they are added to the keyring, which is a bit unexpected. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Link: https://lore.kernel.org/r/20250515115107.33052-2-hch@lst.de Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-07-14	NFS: Allow folio migration for the case of mode == MIGRATE_SYNC	Trond Myklebust
	When the mode is MIGRATE_SYNC, we are allowed to call nfs_wb_folio() under the folio lock. Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-07-14	nfs: new tracepoint in match_stateid operation	Jeff Layton
	Add new tracepoints in the NFSv4 match_stateid minorversion op that show the info in both stateids. Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/20250618-nfs-tracepoints-v2-4-540c9fb48da2@kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-07-14	nfs: new tracepoint in nfs_delegation_need_return	Jeff Layton
	Add a tracepoint in the function that decides whether to return a delegation to the server. Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/20250618-nfs-tracepoints-v2-3-540c9fb48da2@kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-07-14	nfs: add a tracepoint to nfs_inode_detach_delegation_locked	Jeff Layton
	We have tracepoints for setting a delegation and reclaiming them. Add a tracepoint for when the delegation is being detached from the inode. Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/20250618-nfs-tracepoints-v2-2-540c9fb48da2@kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-07-14	nfs: add cache_validity to the nfs_inode_event tracepoints	Jeff Layton
	Managing the cache_validity flags is the deep voodoo of NFS cache coherency. Let's have a little extra visibility into that value via the nfs_inode_event tracepoints. Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/20250618-nfs-tracepoints-v2-1-540c9fb48da2@kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-07-14	NFS: remove unused time_delta field from struct nfs_server	Anthony Iliopoulos
	The last code that was using this was removed via commit ca0daa277aca ("NFS: Cache aggressively when file is open for writing") which was merged in v4.8-rc1, so it can be removed completely. Signed-off-by: Anthony Iliopoulos <ailiop@suse.com> Link: https://lore.kernel.org/r/20250613094439.82338-3-ailiop@suse.com Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-07-14	NFS: remove unused wpages field from struct nfs_server	Anthony Iliopoulos
	The wpages field is not serving any purpose since commit c63c7b051395 ("NFS: Fix a race when doing NFS write coalescing") which was merged in v2.6.22-rc1. Remove it completely. Signed-off-by: Anthony Iliopoulos <ailiop@suse.com> Link: https://lore.kernel.org/r/20250613094439.82338-2-ailiop@suse.com Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-07-14	pnfs: add pnfs_ds_connect trace point	Tigran Mkrtchyan
	This tracepoint aims to expose pnfs DS connect status Signed-off-by: Tigran Mkrtchyan <tigran.mkrtchyan@desy.de> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Link: https://lore.kernel.org/r/20250610151246.9147-1-tigran.mkrtchyan@desy.de Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-07-14	nfs: use lock_two_nondirectories()	NeilBrown
	Rather than open-coding this function call it to make intention clear and to use "correct" nesting levels (parent and child are for directories). This is purely cosmetic with no expected change in behaviour. Signed-off-by: NeilBrown <neil@brown.name> Link: https://lore.kernel.org/r/174942091741.608730.3327223511347232829@noble.neil.brown.name Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-07-14	NFS: Return the file btime in the statx results when appropriate	Trond Myklebust
	If the server supports the NFSv4.x "create_time" attribute, then return it as part of the statx results. Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/eae27d6467e08aaa67e0ac6ae7119263a0f83349.1748515333.git.bcodding@redhat.com Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-07-14	nfs: Add timecreate to nfs inode	Anne Marie Merritt
	Add tracking of the create time (a.k.a. btime) along with corresponding bitfields, request, and decode xdr routines. Signed-off-by: Anne Marie Merritt <annemarie.merritt@primarydata.com> Signed-off-by: Lance Shelton <lance.shelton@hammerspace.com> Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/1e3677b0655fa2bbaba0817b41d111d94a06e5ee.1748515333.git.bcodding@redhat.com Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-07-14	Expand the type of nfs_fattr->valid	Trond Myklebust
	We need to be able to track more than 32 attributes per inode. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Lance Shelton <lance.shelton@hammerspace.com> Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/1e3405fca54efd0be7c91c1da77917b94f5dfcc4.1748515333.git.bcodding@redhat.com Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-07-14	jfs: stop using write_cache_pages	Christoph Hellwig
	Stop using the obsolete write_cache_pages and use writeback_iter directly. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
2025-07-14	jfs: truncate good inode pages when hard link is 0	Lizhi Xu
	The fileset value of the inode copy from the disk by the reproducer is AGGR_RESERVED_I. When executing evict, its hard link number is 0, so its inode pages are not truncated. This causes the bugon to be triggered when executing clear_inode() because nrpages is greater than 0. Reported-by: syzbot+6e516bb515d93230bc7b@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=6e516bb515d93230bc7b Signed-off-by: Lizhi Xu <lizhi.xu@windriver.com> Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>