summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2025-03-06orangefs: Pass mapping to orangefs_writepages_work()Matthew Wilcox (Oracle)
Remove two accesses to page->mapping by passing the mapping from orangefs_writepages() to orangefs_writepages_callback() and then orangefs_writepages_work(). That makes it obvious that all folios come from the same mapping, so we can hoist the call to mapping_set_error() outside the loop. While I'm here, switch from write_cache_pages() to writeback_iter() which removes an indirect function call. Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org> Link: https://lore.kernel.org/r/20250305204734.1475264-7-willy@infradead.org Tested-by: Mike Marshall <hubcap@omnibond.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-06orangefs: Convert orangefs_writepage_locked() to take a folioMatthew Wilcox (Oracle)
Both callers have a folio, pass it in and use it inside orangefs_writepage_locked(). Removes a few hidden calls to compound_head() and accesses to page->mapping. Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org> Link: https://lore.kernel.org/r/20250305204734.1475264-6-willy@infradead.org Tested-by: Mike Marshall <hubcap@omnibond.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-06orangefs: Remove orangefs_writepage()Matthew Wilcox (Oracle)
If we add a migrate_folio operation, we can remove orangefs_writepage (as there is already a writepages operation). filemap_migrate_folio() will do fine as struct orangefs_write_range does not need to be adjusted when the folio is migrated. Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org> Link: https://lore.kernel.org/r/20250305204734.1475264-5-willy@infradead.org Tested-by: Mike Marshall <hubcap@omnibond.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-06orangefs: make open_for_read and open_for_write booleanMatthew Wilcox (Oracle)
sparse currently warns: fs/orangefs/file.c:119:32: warning: incorrect type in assignment (different base types) fs/orangefs/file.c:119:32: expected int open_for_write fs/orangefs/file.c:119:32: got restricted fmode_t Turning open_for_write and open_for_read into booleans (which is how they're used) removes this warning. Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org> Link: https://lore.kernel.org/r/20250305204734.1475264-4-willy@infradead.org Tested-by: Mike Marshall <hubcap@omnibond.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-06orangefs: Move s_kmod_keyword_mask_map to orangefs-debugfs.cMatthew Wilcox (Oracle)
Attempting to build orangefs with W=1 currently reports errors like: In file included from ../fs/orangefs/protocol.h:287, from ../fs/orangefs/waitqueue.c:16: ../fs/orangefs/orangefs-debug.h:86:18: error: ‘num_kmod_keyword_mask_map’ defined but not used [-Werror=unused-const-variable=] Move num_kmod_keyword_mask_map, s_kmod_keyword_mask_map and struct __keyword_mask_s to orangefs-debugfs.c which is the only file they're used in. Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org> Link: https://lore.kernel.org/r/20250305204734.1475264-3-willy@infradead.org Tested-by: Mike Marshall <hubcap@omnibond.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-06orangefs: Do not truncate file sizeMatthew Wilcox (Oracle)
'len' is used to store the result of i_size_read(), so making 'len' a size_t results in truncation to 4GiB on 32-bit systems. Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org> Link: https://lore.kernel.org/r/20250305204734.1475264-2-willy@infradead.org Tested-by: Mike Marshall <hubcap@omnibond.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-05NFS: fix nfs_release_folio() to not deadlock via kcompactd writebackMike Snitzer
Add PF_KCOMPACTD flag and current_is_kcompactd() helper to check for it so nfs_release_folio() can skip calling nfs_wb_folio() from kcompactd. Otherwise NFS can deadlock waiting for kcompactd enduced writeback which recurses back to NFS (which triggers writeback to NFSD via NFS loopback mount on the same host, NFSD blocks waiting for XFS's call to __filemap_get_folio): 6070.550357] INFO: task kcompactd0:58 blocked for more than 4435 seconds. {--- [58] "kcompactd0" [<0>] folio_wait_bit+0xe8/0x200 [<0>] folio_wait_writeback+0x2b/0x80 [<0>] nfs_wb_folio+0x80/0x1b0 [nfs] [<0>] nfs_release_folio+0x68/0x130 [nfs] [<0>] split_huge_page_to_list_to_order+0x362/0x840 [<0>] migrate_pages_batch+0x43d/0xb90 [<0>] migrate_pages_sync+0x9a/0x240 [<0>] migrate_pages+0x93c/0x9f0 [<0>] compact_zone+0x8e2/0x1030 [<0>] compact_node+0xdb/0x120 [<0>] kcompactd+0x121/0x2e0 [<0>] kthread+0xcf/0x100 [<0>] ret_from_fork+0x31/0x40 [<0>] ret_from_fork_asm+0x1a/0x30 ---} [akpm@linux-foundation.org: fix build] Link: https://lkml.kernel.org/r/20250225022002.26141-1-snitzer@kernel.org Fixes: 96780ca55e3c ("NFS: fix up nfs_release_folio() to try to release the page") Signed-off-by: Mike Snitzer <snitzer@kernel.org> Cc: Anna Schumaker <anna.schumaker@oracle.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-05ext4: protect ext4_release_dquot against freezingOjaswin Mujoo
Protect ext4_release_dquot against freezing so that we don't try to start a transaction when FS is frozen, leading to warnings. Further, avoid taking the freeze protection if a transaction is already running so that we don't need end up in a deadlock as described in 46e294efc355 ext4: fix deadlock with fs freezing and EA inodes Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20241121123855.645335-3-ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-03-05fs: use fput_close() in path_openat()Mateusz Guzik
This bumps failing open rate by 1.7% on Sapphire Rapids by avoiding one atomic. Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://lore.kernel.org/r/20250305123644.554845-5-mjguzik@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-05fs: use fput_close() in filp_close()Mateusz Guzik
When tracing a kernel build over refcounts seen this is a wash: @[kprobe:filp_close]: [0] 32195 |@@@@@@@@@@ | [1] 164567 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| I verified vast majority of the skew comes from do_close_on_exec() which could be changed to use a different variant instead. Even without changing that, the 19.5% of calls which got here still can save the extra atomic. Calls here are borderline non-existent compared to fput (over 3.2 mln!), so they should not negatively affect scalability. Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://lore.kernel.org/r/20250305123644.554845-4-mjguzik@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-05fs: use fput_close_sync() in close()Mateusz Guzik
This bumps open+close rate by 1% on Sapphire Rapids by eliding one atomic. It would be higher if it was not for several other slowdowns of the same nature. Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://lore.kernel.org/r/20250305123644.554845-3-mjguzik@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-05file: add fput and file_ref_put routines optimized for use when closing a fdMateusz Guzik
Vast majority of the time closing a file descriptor also operates on the last reference, where a regular fput usage will result in 2 atomics. This can be changed to only suffer 1. See commentary above file_ref_put_close() for more information. Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://lore.kernel.org/r/20250305123644.554845-2-mjguzik@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-05fs: predict no error in close()Mateusz Guzik
Vast majority of the time the system call returns 0. Letting the compiler know shortens the routine (119 -> 116) and the fast path. Disasm starting at the call to __fput_sync(): before: <+55>: call 0xffffffff816b0da0 <__fput_sync> <+60>: lea 0x201(%rbx),%eax <+66>: cmp $0x1,%eax <+69>: jbe 0xffffffff816ab707 <__x64_sys_close+103> <+71>: mov %ebx,%edx <+73>: movslq %ebx,%rax <+76>: and $0xfffffffd,%edx <+79>: cmp $0xfffffdfc,%edx <+85>: mov $0xfffffffffffffffc,%rdx <+92>: cmove %rdx,%rax <+96>: pop %rbx <+97>: pop %rbp <+98>: jmp 0xffffffff82242fa0 <__x86_return_thunk> <+103>: mov $0xfffffffffffffffc,%rax <+110>: jmp 0xffffffff816ab700 <__x64_sys_close+96> <+112>: mov $0xfffffffffffffff7,%rax <+119>: jmp 0xffffffff816ab700 <__x64_sys_close+96> after: <+56>: call 0xffffffff816b0da0 <__fput_sync> <+61>: xor %eax,%eax <+63>: test %ebp,%ebp <+65>: jne 0xffffffff816ab6ea <__x64_sys_close+74> <+67>: pop %rbx <+68>: pop %rbp <+69>: jmp 0xffffffff82242fa0 <__x86_return_thunk> # the jmp out <+74>: lea 0x201(%rbp),%edx <+80>: mov $0xfffffffffffffffc,%rax <+87>: cmp $0x1,%edx <+90>: jbe 0xffffffff816ab6e3 <__x64_sys_close+67> <+92>: mov %ebp,%edx <+94>: and $0xfffffffd,%edx <+97>: cmp $0xfffffdfc,%edx <+103>: cmovne %rbp,%rax <+107>: jmp 0xffffffff816ab6e3 <__x64_sys_close+67> <+109>: mov $0xfffffffffffffff7,%rax <+116>: jmp 0xffffffff816ab6e3 <__x64_sys_close+67> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://lore.kernel.org/r/20250301104356.246031-1-mjguzik@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-05f2fs: set highest IO priority for checkpoint threadJaegeuk Kim
The checkpoint is the top priority thread which can stop all the filesystem operations. Let's make it RT priority. Reviewed-by: Daeho Jeong <daehojeong@google.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-03-05exfat: add a check for invalid data sizeYuezhang Mo
Add a check for invalid data size to avoid corrupted filesystem from being further corrupted. Signed-off-by: Yuezhang Mo <Yuezhang.Mo@sony.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
2025-03-05exfat: short-circuit zero-byte writes in exfat_file_write_iterEric Sandeen
When generic_write_checks() returns zero, it means that iov_iter_count() is zero, and there is no work to do. Simply return success like all other filesystems do, rather than proceeding down the write path, which today yields an -EFAULT in generic_perform_write() via the (fault_in_iov_iter_readable(i, bytes) == bytes) check when bytes == 0. Fixes: 11a347fb6cef ("exfat: change to get file size from DataLength") Reported-by: Noah <kernel-org-10@maxgrass.eu> Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Yuezhang Mo <Yuezhang.Mo@sony.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
2025-03-05exfat: fix soft lockup in exfat_clear_bitmapNamjae Jeon
bitmap clear loop will take long time in __exfat_free_cluster() if data size of file/dir enty is invalid. If cluster bit in bitmap is already clear, stop clearing bitmap go to out of loop. Fixes: 31023864e67a ("exfat: add fat entry operations") Reported-by: Kun Hu <huk23@m.fudan.edu.cn>, Jiaji Qin <jjtan24@m.fudan.edu.cn> Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
2025-03-05exfat: fix just enough dentries but allocate a new cluster to dirYuezhang Mo
This commit fixes the condition for allocating cluster to parent directory to avoid allocating new cluster to parent directory when there are just enough empty directory entries at the end of the parent directory. Fixes: af02c72d0b62 ("exfat: convert exfat_find_empty_entry() to use dentry cache") Signed-off-by: Yuezhang Mo <Yuezhang.Mo@sony.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
2025-03-05pidfs: allow to retrieve exit informationChristian Brauner
Some tools like systemd's jounral need to retrieve the exit and cgroup information after a process has already been reaped. This can e.g., happen when retrieving a pidfd via SCM_PIDFD or SCM_PEERPIDFD. Link: https://lore.kernel.org/r/20250305-work-pidfs-kill_on_last_close-v3-6-c8c3d8361705@kernel.org Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-05pidfs: record exit code and cgroupid at exitChristian Brauner
Record the exit code and cgroupid in release_task() and stash in struct pidfs_exit_info so it can be retrieved even after the task has been reaped. Link: https://lore.kernel.org/r/20250305-work-pidfs-kill_on_last_close-v3-5-c8c3d8361705@kernel.org Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-05pidfs: use private inode slab cacheChristian Brauner
Introduce a private inode slab cache for pidfs. In follow-up patches pidfs will gain the ability to provide exit information to userspace after the task has been reaped. This means storing exit information even after the task has already been released and struct pid's task linkage is gone. Store that information alongside the inode. Link: https://lore.kernel.org/r/20250305-work-pidfs-kill_on_last_close-v3-4-c8c3d8361705@kernel.org Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-05pidfs: move setting flags into pidfs_alloc_file()Christian Brauner
Instead od adding it into __pidfd_prepare() place it where the actual file allocation happens and update the outdated comment. Link: https://lore.kernel.org/r/20250305-work-pidfs-kill_on_last_close-v3-3-c8c3d8361705@kernel.org Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-05pidfs: switch to copy_struct_to_user()Christian Brauner
We have a helper that deals with all the required logic. Link: https://lore.kernel.org/r/20250305-work-pidfs-kill_on_last_close-v3-1-c8c3d8361705@kernel.org Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-05fscrypt: Change fscrypt_encrypt_pagecache_blocks() to take a folioMatthew Wilcox (Oracle)
ext4 and ceph already have a folio to pass; f2fs needs to be properly converted but this will do for now. This removes a reference to page->index and page->mapping as well as removing a call to compound_head(). Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org> Link: https://lore.kernel.org/r/20250304170224.523141-1-willy@infradead.org Acked-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-05VFS: Change vfs_mkdir() to return the dentry.NeilBrown
vfs_mkdir() does not guarantee to leave the child dentry hashed or make it positive on success, and in many such cases the filesystem had to use a different dentry which it can now return. This patch changes vfs_mkdir() to return the dentry provided by the filesystems which is hashed and positive when provided. This reduces the number of cases where the resulting dentry is not positive to a handful which don't deserve extra efforts. The only callers of vfs_mkdir() which are interested in the resulting inode are in-kernel filesystem clients: cachefiles, nfsd, smb/server. The only filesystems that don't reliably provide the inode are: - kernfs, tracefs which these clients are unlikely to be interested in - cifs in some configurations would need to do a lookup to find the created inode, but doesn't. cifs cannot be exported via NFS, is unlikely to be used by cachefiles, and smb/server only has a soft requirement for the inode, so this is unlikely to be a problem in practice. - hostfs, nfs, cifs may need to do a lookup (rarely for NFS) and it is possible for a race to make that lookup fail. Actual failure is unlikely and providing callers handle negative dentries graceful they will fail-safe. So this patch removes the lookup code in nfsd and smb/server and adjusts them to fail safe if a negative dentry is provided: - cache-files already fails safe by restarting the task from the top - it still does with this change, though it no longer calls cachefiles_put_directory() as that will crash if the dentry is negative. - nfsd reports "Server-fault" which it what it used to do if the lookup failed. This will never happen on any file-systems that it can actually export, so this is of no consequence. I removed the fh_update() call as that is not needed and out-of-place. A subsequent nfsd_create_setattr() call will call fh_update() when needed. - smb/server only wants the inode to call ksmbd_smb_inherit_owner() which updates ->i_uid (without calling notify_change() or similar) which can be safely skipping on cifs (I hope). If a different dentry is returned, the first one is put. If necessary the fact that it is new can be determined by comparing pointers. A new dentry will certainly have a new pointer (as the old is put after the new is obtained). Similarly if an error is returned (via ERR_PTR()) the original dentry is put. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de> Link: https://lore.kernel.org/r/20250227013949.536172-7-neilb@suse.de Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-05nfs: change mkdir inode_operation to return alternate dentry if needed.NeilBrown
mkdir now allows a different dentry to be returned which is sometimes relevant for nfs. This patch changes the nfs_rpc_ops mkdir op to return a dentry, and passes that back to the caller. The mkdir nfs_rpc_op will return NULL if the original dentry should be used. This matches the mkdir inode_operation. nfs4_do_create() is duplicated to nfs4_do_mkdir() which is changed to handle the specifics of directories. Consequently the current special handling for directories is removed from nfs4_do_create() Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de> Link: https://lore.kernel.org/r/20250227013949.536172-6-neilb@suse.de Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-05fuse: return correct dentry for ->mkdirNeilBrown
fuse already uses d_splice_alias() to ensure an appropriate dentry is found for a newly created dentry. Now that ->mkdir can return that dentry we do so. This requires changing create_new_entry() to return a dentry and handling that change in all callers. Note that when create_new_entry() is asked to create anything other than a directory we can be sure it will NOT return an alternate dentry as d_splice_alias() only returns an alternate dentry for directories. So we don't need to check for that case when passing one the result. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de> Link: https://lore.kernel.org/r/174112490070.33508.15852253149143067890@noble.neil.brown.name Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-05ceph: Fix error handling in fill_readdir_cache()Matthew Wilcox (Oracle)
__filemap_get_folio() returns an ERR_PTR, not NULL. There are extensive assumptions that ctl->folio is NULL, not an error pointer, so it seems better to fix this one place rather than change all the places which check ctl->folio. Fixes: baff9740bc8f ("ceph: Convert ceph_readdir_cache_control to store a folio") Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org> Link: https://lore.kernel.org/r/20250304154818.250757-1-willy@infradead.org Cc: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-04fs/pipe: Read pipe->{head,tail} atomically outside pipe->mutexLinus Torvalds
pipe_readable(), pipe_writable(), and pipe_poll() can read "pipe->head" and "pipe->tail" outside of "pipe->mutex" critical section. When the head and the tail are read individually in that order, there is a window for interruption between the two reads in which both the head and the tail can be updated by concurrent readers and writers. One of the problematic scenarios observed with hackbench running multiple groups on a large server on a particular pipe inode is as follows: pipe->head = 36 pipe->tail = 36 hackbench-118762 [057] ..... 1029.550548: pipe_write: *wakes up: pipe not full* hackbench-118762 [057] ..... 1029.550548: pipe_write: head: 36 -> 37 [tail: 36] hackbench-118762 [057] ..... 1029.550548: pipe_write: *wake up next reader 118740* hackbench-118762 [057] ..... 1029.550548: pipe_write: *wake up next writer 118768* hackbench-118768 [206] ..... 1029.55055X: pipe_write: *writer wakes up* hackbench-118768 [206] ..... 1029.55055X: pipe_write: head = READ_ONCE(pipe->head) [37] ... CPU 206 interrupted (exact wakeup was not traced but 118768 did read head at 37 in traces) hackbench-118740 [057] ..... 1029.550558: pipe_read: *reader wakes up: pipe is not empty* hackbench-118740 [057] ..... 1029.550558: pipe_read: tail: 36 -> 37 [head = 37] hackbench-118740 [057] ..... 1029.550559: pipe_read: *pipe is empty; wakeup writer 118768* hackbench-118740 [057] ..... 1029.550559: pipe_read: *sleeps* hackbench-118766 [185] ..... 1029.550592: pipe_write: *New writer comes in* hackbench-118766 [185] ..... 1029.550592: pipe_write: head: 37 -> 38 [tail: 37] hackbench-118766 [185] ..... 1029.550592: pipe_write: *wakes up reader 118766* hackbench-118740 [185] ..... 1029.550598: pipe_read: *reader wakes up; pipe not empty* hackbench-118740 [185] ..... 1029.550599: pipe_read: tail: 37 -> 38 [head: 38] hackbench-118740 [185] ..... 1029.550599: pipe_read: *pipe is empty* hackbench-118740 [185] ..... 1029.550599: pipe_read: *reader sleeps; wakeup writer 118768* ... CPU 206 switches back to writer hackbench-118768 [206] ..... 1029.550601: pipe_write: tail = READ_ONCE(pipe->tail) [38] hackbench-118768 [206] ..... 1029.550601: pipe_write: pipe_full()? (u32)(37 - 38) >= 16? Yes hackbench-118768 [206] ..... 1029.550601: pipe_write: *writer goes back to sleep* [ Tasks 118740 and 118768 can then indefinitely wait on each other. ] The unsigned arithmetic in pipe_occupancy() wraps around when "pipe->tail > pipe->head" leading to pipe_full() returning true despite the pipe being empty. The case of genuine wraparound of "pipe->head" is handled since pipe buffer has data allowing readers to make progress until the pipe->tail wraps too after which the reader will wakeup a sleeping writer, however, mistaking the pipe to be full when it is in fact empty can lead to readers and writers waiting on each other indefinitely. This issue became more problematic and surfaced as a hang in hackbench after the optimization in commit aaec5a95d596 ("pipe_read: don't wake up the writer if the pipe is still full") significantly reduced the number of spurious wakeups of writers that had previously helped mask the issue. To avoid missing any updates between the reads of "pipe->head" and "pipe->write", unionize the two with a single unsigned long "pipe->head_tail" member that can be loaded atomically. Using "pipe->head_tail" to read the head and the tail ensures the lockless checks do not miss any updates to the head or the tail and since those two are only updated under "pipe->mutex", it ensures that the head is always ahead of, or equal to the tail resulting in correct calculations. [ prateek: commit log, testing on x86 platforms. ] Reported-and-debugged-by: Swapnil Sapkal <swapnil.sapkal@amd.com> Closes: https://lore.kernel.org/lkml/e813814e-7094-4673-bc69-731af065a0eb@amd.com/ Reported-by: Alexey Gladkov <legion@kernel.org> Closes: https://lore.kernel.org/all/Z8Wn0nTvevLRG_4m@example.org/ Fixes: 8cefc107ca54 ("pipe: Use head and tail pointers for the ring, not cursor and length") Tested-by: Swapnil Sapkal <swapnil.sapkal@amd.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Tested-by: Alexey Gladkov <legion@kernel.org> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2025-03-04f2fs: Add f2fs_find_data_folio()Matthew Wilcox (Oracle)
Convert f2fs_find_data_page() to f2fs_find_data_folio() and add a compatibility wrapper. Saves six hidden calls to compound_head(). This was the last caller of f2fs_get_read_data_page(), so remove the compatibility wrapper. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-03-04f2fs: Convert gc_data_segment() to use a folioMatthew Wilcox (Oracle)
Use f2fs_get_read_data_folio() instead of f2fs_get_read_data_page(). Saves a hidden call to compound_head() in f2fs_put_page(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-03-04f2fs: Convert truncate_partial_data_page() to use a folioMatthew Wilcox (Oracle)
Retrieve a folio from the page cache and use it throughout. Saves five hidden calls to compound_head(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-03-04f2fs: Convert move_data_page() to use a folioMatthew Wilcox (Oracle)
Fetch a folio from the page cache and use it throughout, saving eight hidden calls to compound_head(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-03-04f2fs: Add f2fs_get_lock_data_folio()Matthew Wilcox (Oracle)
Convert f2fs_get_lock_data_page() to f2fs_get_lock_data_folio() and add a compatibility wrapper. Removes three hidden calls to compound_head(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-03-04f2fs: Add f2fs_get_read_data_folio()Matthew Wilcox (Oracle)
Convert f2fs_get_read_data_page() into f2fs_get_read_data_folio() and add a compatibility wrapper. Saves seven hidden calls to compound_head(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-03-04f2fs: Hoist the page_folio() call to the start of f2fs_merge_page_bio()Matthew Wilcox (Oracle)
Remove one call to compound_head() and a reference to page->mapping by calling page_folio() early on. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-03-04f2fs: Use a folio throughout __get_meta_page()Matthew Wilcox (Oracle)
Use f2fs_grab_cache_folio() to get a folio and use it throughout, removing seven calls to compound_head() and a reference to page->mapping. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-03-04f2fs: Use a folio throughout f2fs_truncate_inode_blocks()Matthew Wilcox (Oracle)
Use f2fs_get_node_folio() to get a folio and use it throughout. Remove a few calls to compound_head() and a reference to page->mapping. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-03-04f2fs: Add f2fs_get_node_folio()Matthew Wilcox (Oracle)
Change __get_node_page() to return a folio and convert back to a page in f2fs_get_node_page() and f2fs_get_node_page_ra(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-03-04f2fs: Convert f2fs_in_warm_node_list() to take a folioMatthew Wilcox (Oracle)
All its callers now have access to a folio, so pass it in. Removes an access to page->mapping. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-03-04f2fs: Mark some functions as taking a const page pointerMatthew Wilcox (Oracle)
The compiler can make some optimisations if we tell it that a function call doesn't change this memory. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-03-04f2fs: Convert f2fs_write_end_io() to use a folio_iterMatthew Wilcox (Oracle)
Iterate over each folio in the bio instead of each page. Follow the pattern in ext4 for handling bounce folios. Removes a few calls to compound_head() and references to page->mapping. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-03-04f2fs: Use a folio in do_write_page()Matthew Wilcox (Oracle)
Convert fio->page to a folio then use it where folio APIs exist. Removes a reference to page->mapping and a hidden call to compound_head(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-03-04f2fs: Use a folio in __get_node_page()Matthew Wilcox (Oracle)
Retrieve a folio from the page cache and use it throughout. Saves six hidden calls to compound_head() and removes a reference to page->mapping. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-03-04f2fs: Add f2fs_grab_cache_folio()Matthew Wilcox (Oracle)
Convert f2fs_grab_cache_page() into f2fs_grab_cache_folio() and add a wrapper. Removes several calls to deprecated functions. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-03-04f2fs: Return a folio from last_fsync_dnode()Matthew Wilcox (Oracle)
Convert last_page to last_folio in f2fs_fsync_node_pages() and use folio APIs where they exist. Saves a few hidden calls to compound_head(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-03-04f2fs: Convert last_fsync_dnode() to use a folioMatthew Wilcox (Oracle)
Use the folio APIs where they exist. Saves several hidden calls to compound_head(). Also removes a reference to page->mapping. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-03-04f2fs: Convert f2fs_fsync_node_pages() to use a folioMatthew Wilcox (Oracle)
Use the folio APIs where they exist. Saves several hidden calls to compound_head(). Also removes a reference to page->mapping. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-03-04f2fs: Pass a folio to flush_dirty_inode()Matthew Wilcox (Oracle)
Its one caller now has a folio; pass it in and do page conversions where necessary inside flush_dirty_inode(). Saves two hidden calls to compound_head(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-03-04f2fs: Convert f2fs_sync_node_pages() to use a folioMatthew Wilcox (Oracle)
Use the folio APIs where they exist. Saves several hidden calls to compound_head(). Also removes a reference to page->mapping. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>