summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2025-01-06nfsd: add support for freeing unused session-DRC slotsNeilBrown
Reducing the number of slots in the session slot table requires confirmation from the client. This patch adds reduce_session_slots() which starts the process of getting confirmation, but never calls it. That will come in a later patch. Before we can free a slot we need to confirm that the client won't try to use it again. This involves returning a lower cr_maxrequests in a SEQUENCE reply and then seeing a ca_maxrequests on the same slot which is not larger than we limit we are trying to impose. So for each slot we need to remember that we have sent a reduced cr_maxrequests. To achieve this we introduce a concept of request "generations". Each time we decide to reduce cr_maxrequests we increment the generation number, and record this when we return the lower cr_maxrequests to the client. When a slot with the current generation reports a low ca_maxrequests, we commit to that level and free extra slots. We use an 16 bit generation number (64 seems wasteful) and if it cycles we iterate all slots and reset the generation number to avoid false matches. When we free a slot we store the seqid in the slot pointer so that it can be restored when we reactivate the slot. The RFC can be read as suggesting that the slot number could restart from one after a slot is retired and reactivated, but also suggests that retiring slots is not required. So when we reactive a slot we accept with the next seqid in sequence, or 1. When decoding sa_highest_slotid into maxslots we need to add 1 - this matches how it is encoded for the reply. se_dead is moved in struct nfsd4_session to remove a hole. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-01-06nfsd: allocate new session-based DRC slots on demand.NeilBrown
If a client ever uses the highest available slot for a given session, attempt to allocate more slots so there is room for the client to use them if wanted. GFP_NOWAIT is used so if there is not plenty of free memory, failure is expected - which is what we want. It also allows the allocation while holding a spinlock. Each time we increase the number of slots by 20% (rounded up). This allows fairly quick growth while avoiding excessive over-shoot. We would expect to stablise with around 10% more slots available than the client actually uses. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-01-06nfsd: add session slot count to /proc/fs/nfsd/clients/*/infoNeilBrown
Each client now reports the number of slots allocated in each session. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-01-06nfsd: remove artificial limits on the session-based DRCNeilBrown
Rather than guessing how much space it might be safe to use for the DRC, simply try allocating slots and be prepared to accept failure. The first slot for each session is allocated with GFP_KERNEL which is unlikely to fail. Subsequent slots are allocated with the addition of __GFP_NORETRY which is expected to fail if there isn't much free memory. This is probably too aggressive but clears the way for adding a shrinker interface to free extra slots when memory is tight. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-01-06nfsd: use an xarray to store v4.1 session slotsNeilBrown
Using an xarray to store session slots will make it easier to change the number of active slots based on demand, and removes an unnecessary limit. To achieve good throughput with a high-latency server it can be helpful to have hundreds of concurrent writes, which means hundreds of slots. So increase the limit to 2048 (twice what the Linux client will currently use). This limit is only a sanity check, not a hard limit. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-01-06sunrpc: remove all connection limit configurationNeilBrown
Now that the connection limit only apply to unconfirmed connections, there is no need to configure it. So remove all the configuration and fix the number of unconfirmed connections as always 64 - which is now given a name: XPT_MAX_TMP_CONN Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-01-06nfsd: don't use sv_nrthreads in connection limiting calculations.NeilBrown
The heuristic for limiting the number of incoming connections to nfsd currently uses sv_nrthreads - allowing more connections if more threads were configured. A future patch will allow number of threads to grow dynamically so that there will be no need to configure sv_nrthreads. So we need a different solution for limiting connections. It isn't clear what problem is solved by limiting connections (as mentioned in a code comment) but the most likely problem is a connection storm - many connections that are not doing productive work. These will be closed after about 6 minutes already but it might help to slow down a storm. This patch adds a per-connection flag XPT_PEER_VALID which indicates that the peer has presented a filehandle for which it has some sort of access. i.e the peer is known to be trusted in some way. We now only count connections which have NOT been determined to be valid. There should be relative few of these at any given time. If the number of non-validated peer exceed a limit - currently 64 - we close the oldest non-validated peer to avoid having too many of these useless connections. Note that this patch significantly changes the meaning of the various configuration parameters for "max connections". The next patch will remove all of these. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-01-06nfsd: fix legacy client tracking initializationScott Mayhew
Get rid of the nfsd4_legacy_tracking_ops->init() call in check_for_legacy_methods(). That will be handled in the caller (nfsd4_client_tracking_init()). Otherwise, we'll wind up calling nfsd4_legacy_tracking_ops->init() twice, and the second time we'll trigger the BUG_ON() in nfsd4_init_recdir(). Fixes: 74fd48739d04 ("nfsd: new Kconfig option for legacy client tracking") Reported-by: Jur van der Burg <jur@avtware.com> Link: https://bugzilla.kernel.org/show_bug.cgi?id=219580 Signed-off-by: Scott Mayhew <smayhew@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Tested-by: Salvatore Bonaccorso <carnil@debian.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-01-06NFSD: Clean up unused variableChuck Lever
@sb should have been removed by commit 7e64c5bc497c ("NLM/NFSD: Fix lock notifications for async-capable filesystems"). Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-01-06nfsd: use new wake_up_var interfaces.NeilBrown
The wake_up_var interface is fragile as barriers are sometimes needed. There are now new interfaces so that most wake-ups can use an interface that is guaranteed to have all barriers needed. This patch changes the wake up on cl_cb_inflight to use atomic_dec_and_wake_up(). It also changes the wake up on rp_locked to use store_release_wake_up(). This involves changing rp_locked from atomic_t to int. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-01-06nfsd: trace: remove redundant stateid even deleg_recallChen Hanxiao
Since commit e56dc9e2949e ("nfsd: remove fault injection code") remove all nfsd_recall_delegations codes, we don't need trace_nfsd_deleg_recall any more. Signed-off-by: Chen Hanxiao <chenhx.fnst@fujitsu.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-01-06Merge tag 'exfat-for-6.13-rc7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat Pull exfat fixes from Namjae Jeon: "All fixes are for issues reported by syzbot: - Fix wrong error return in exfat_find_empty_entry() - Fix a endless loop by self-linked chain - fix a KMSAN uninit-value issue in exfat_extend_valid_size()" * tag 'exfat-for-6.13-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat: exfat: fix the infinite loop in __exfat_free_cluster() exfat: fix the new buffer was not zeroed before writing exfat: fix the infinite loop in exfat_readdir() exfat: fix exfat_find_empty_entry() not returning error on failure
2025-01-06btrfs: don't read from userspace twice in btrfs_uring_encoded_read()Mark Harmstone
If we return -EAGAIN the first time because we need to block, btrfs_uring_encoded_read() will get called twice. Take a copy of args, the iovs, and the iter the first time, as by the time we are called the second time these may have gone out of scope. Reported-by: Jens Axboe <axboe@kernel.dk> Fixes: 34310c442e17 ("btrfs: add io_uring command for encoded reads (ENCODED_READ ioctl)") Signed-off-by: Mark Harmstone <maharmstone@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-06fanotify: Fix crash in fanotify_init(2)Jan Kara
The rrror handling in fanotify_init(2) is buggy and overwrites 'fd' before calling put_unused_fd() leading to possible access beyond the end of fd bitmap. Fix it. Reported-by: syzbot+6a3aa63412255587b21b@syzkaller.appspotmail.com Fixes: ebe559609d78 ("fs: get rid of __FMODE_NONOTIFY kludge") Signed-off-by: Jan Kara <jack@suse.cz>
2025-01-05ksmbd: Remove unneeded if check in ksmbd_rdma_capable_netdev()Thorsten Blum
Remove the unnecessary if check and assign the result directly. Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-01-05ksmbd: fix a missing return value check bugWentao Liang
In the smb2_send_interim_resp(), if ksmbd_alloc_work_struct() fails to allocate a node, it returns a NULL pointer to the in_work pointer. This can lead to an illegal memory write of in_work->response_buf when allocate_interim_rsp_buf() attempts to perform a kzalloc() on it. To address this issue, incorporating a check for the return value of ksmbd_alloc_work_struct() ensures that the function returns immediately upon allocation failure, thereby preventing the aforementioned illegal memory access. Fixes: 041bba4414cd ("ksmbd: fix wrong interim response on compound") Signed-off-by: Wentao Liang <liangwentao@iscas.ac.cn> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-01-05Merge tag 'mm-hotfixes-stable-2025-01-04-18-02' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull hotfixes from Andrew Morton: "25 hotfixes. 16 are cc:stable. 18 are MM and 7 are non-MM. The usual bunch of singletons and two doubletons - please see the relevant changelogs for details" * tag 'mm-hotfixes-stable-2025-01-04-18-02' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (25 commits) MAINTAINERS: change Arınç _NAL's name and email address scripts/sorttable: fix orc_sort_cmp() to maintain symmetry and transitivity mm/util: make memdup_user_nul() similar to memdup_user() mm, madvise: fix potential workingset node list_lru leaks mm/damon/core: fix ignored quota goals and filters of newly committed schemes mm/damon/core: fix new damon_target objects leaks on damon_commit_targets() mm/list_lru: fix false warning of negative counter vmstat: disable vmstat_work on vmstat_cpu_down_prep() mm: shmem: fix the update of 'shmem_falloc->nr_unswapped' mm: shmem: fix incorrect index alignment for within_size policy percpu: remove intermediate variable in PERCPU_PTR() mm: zswap: fix race between [de]compression and CPU hotunplug ocfs2: fix slab-use-after-free due to dangling pointer dqi_priv fs/proc/task_mmu: fix pagemap flags with PMD THP entries on 32bit kcov: mark in_softirq_really() as __always_inline docs: mm: fix the incorrect 'FileHugeMapped' field mailmap: modify the entry for Mathieu Othacehe mm/kmemleak: fix sleeping function called from invalid context at print message mm: hugetlb: independent PMD page table shared count maple_tree: reload mas before the second call for mas_empty_area ...
2025-01-04libfs: Use d_children list to iterate simple_offset directoriesChuck Lever
The mtree mechanism has been effective at creating directory offsets that are stable over multiple opendir instances. However, it has not been able to handle the subtleties of renames that are concurrent with readdir. Instead of using the mtree to emit entries in the order of their offset values, use it only to map incoming ctx->pos to a starting entry. Then use the directory's d_children list, which is already maintained properly by the dcache, to find the next child to emit. One of the sneaky things about this is that when the mtree-allocated offset value wraps (which is very rare), looking up ctx->pos++ is not going to find the next entry; it will return NULL. Instead, by following the d_children list, the offset values can appear in any order but all of the entries in the directory will be visited eventually. Note also that the readdir() is guaranteed to reach the tail of this list. Entries are added only at the head of d_children, and readdir walks from its current position in that list towards its tail. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Link: https://lore.kernel.org/r/20241228175522.1854234-6-cel@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-01-04libfs: Replace simple_offset end-of-directory detectionChuck Lever
According to getdents(3), the d_off field in each returned directory entry points to the next entry in the directory. The d_off field in the last returned entry in the readdir buffer must contain a valid offset value, but if it points to an actual directory entry, then readdir/getdents can loop. This patch introduces a specific fixed offset value that is placed in the d_off field of the last entry in a directory. Some user space applications assume that the EOD offset value is larger than the offsets of real directory entries, so the largest valid offset value is reserved for this purpose. This new value is never allocated by simple_offset_add(). When ->iterate_dir() returns, getdents{64} inserts the ctx->pos value into the d_off field of the last valid entry in the readdir buffer. When it hits EOD, offset_readdir() sets ctx->pos to the EOD offset value so the last entry is updated to point to the EOD marker. When trying to read the entry at the EOD offset, offset_readdir() terminates immediately. It is worth noting that using a Maple tree for directory offset value allocation does not guarantee a 63-bit range of values -- on platforms where "long" is a 32-bit type, the directory offset value range is still 0..(2^31 - 1). For broad compatibility with 32-bit user space, the largest tmpfs directory cookie value is now S32_MAX. Fixes: 796432efab1e ("libfs: getdents() should return 0 after reaching EOD") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Link: https://lore.kernel.org/r/20241228175522.1854234-5-cel@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-01-04Revert "libfs: fix infinite directory reads for offset dir"Chuck Lever
The current directory offset allocator (based on mtree_alloc_cyclic) stores the next offset value to return in octx->next_offset. This mechanism typically returns values that increase monotonically over time. Eventually, though, the newly allocated offset value wraps back to a low number (say, 2) which is smaller than other already- allocated offset values. Yu Kuai <yukuai3@huawei.com> reports that, after commit 64a7ce76fb90 ("libfs: fix infinite directory reads for offset dir"), if a directory's offset allocator wraps, existing entries are no longer visible via readdir/getdents because offset_readdir() stops listing entries once an entry's offset is larger than octx->next_offset. These entries vanish persistently -- they can be looked up, but will never again appear in readdir(3) output. The reason for this is that the commit treats directory offsets as monotonically increasing integer values rather than opaque cookies, and introduces this comparison: if (dentry2offset(dentry) >= last_index) { On 64-bit platforms, the directory offset value upper bound is 2^63 - 1. Directory offsets will monotonically increase for millions of years without wrapping. On 32-bit platforms, however, LONG_MAX is 2^31 - 1. The allocator can wrap after only a few weeks (at worst). Revert commit 64a7ce76fb90 ("libfs: fix infinite directory reads for offset dir") to prepare for a fix that can work properly on 32-bit systems and might apply to recent LTS kernels where shmem employs the simple_offset mechanism. Reported-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Link: https://lore.kernel.org/r/20241228175522.1854234-4-cel@kernel.org Reviewed-by: Yang Erkun <yangerkun@huawei.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-01-04Revert "libfs: Add simple_offset_empty()"Chuck Lever
simple_empty() and simple_offset_empty() perform the same task. The latter's use as a canary to find bugs has not found any new issues. A subsequent patch will remove the use of the mtree for iterating directory contents, so revert back to using a similar mechanism for determining whether a directory is indeed empty. Only one such mechanism is ever needed. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Link: https://lore.kernel.org/r/20241228175522.1854234-3-cel@kernel.org Reviewed-by: Yang Erkun <yangerkun@huawei.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-01-04libfs: Return ENOSPC when the directory offset range is exhaustedChuck Lever
Testing shows that the EBUSY error return from mtree_alloc_cyclic() leaks into user space. The ERRORS section of "man creat(2)" says: > EBUSY O_EXCL was specified in flags and pathname refers > to a block device that is in use by the system > (e.g., it is mounted). ENOSPC is closer to what applications expect in this situation. Note that the normal range of simple directory offset values is 2..2^63, so hitting this error is going to be rare to impossible. Fixes: 6faddda69f62 ("libfs: Add directory operations for stable offsets") Cc: stable@vger.kernel.org # v6.9+ Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Yang Erkun <yangerkun@huawei.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Link: https://lore.kernel.org/r/20241228175522.1854234-2-cel@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-01-04Merge patch series "fix reading ESP during coredump"Christian Brauner
Nam Cao <namcao@linutronix.de> says: In /proc/PID/stat, there is the kstkesp field which is the stack pointer of a thread. While the thread is active, this field reads zero. But during a coredump, it should have a valid value. However, at the moment, kstkesp is zero even during coredump. The first commit fixes this problem, and the second commit adds a selftest to detect if this problem appears again in the future. * patches from https://lore.kernel.org/r/cover.1735805772.git.namcao@linutronix.de: selftests: coredump: Add stackdump test fs/proc: do_task_stat: Fix ESP not readable during coredump Link: https://lore.kernel.org/r/cover.1735805772.git.namcao@linutronix.de Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-01-04pipe_read: don't wake up the writer if the pipe is still fullOleg Nesterov
wake_up(pipe->wr_wait) makes no sense if pipe_full() is still true after the reading, the writer sleeping in wait_event(wr_wait, pipe_writable()) will check the pipe_writable() == !pipe_full() condition and sleep again. Only wake the writer if we actually released a pipe buf, and the pipe was full before we did so. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/all/20241229135737.GA3293@redhat.com/ Link: https://lore.kernel.org/r/20250102140715.GA7091@redhat.com Reported-by: WangYuli <wangyuli@uniontech.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-01-04fs/proc: do_task_stat: Fix ESP not readable during coredumpNam Cao
The field "eip" (instruction pointer) and "esp" (stack pointer) of a task can be read from /proc/PID/stat. These fields can be interesting for coredump. However, these fields were disabled by commit 0a1eb2d474ed ("fs/proc: Stop reporting eip and esp in /proc/PID/stat"), because it is generally unsafe to do so. But it is safe for a coredumping process, and therefore exceptions were made: - for a coredumping thread by commit fd7d56270b52 ("fs/proc: Report eip/esp in /prod/PID/stat for coredumping"). - for all other threads in a coredumping process by commit cb8f381f1613 ("fs/proc/array.c: allow reporting eip/esp for all coredumping threads"). The above two commits check the PF_DUMPCORE flag to determine a coredump thread and the PF_EXITING flag for the other threads. Unfortunately, commit 92307383082d ("coredump: Don't perform any cleanups before dumping core") moved coredump to happen earlier and before PF_EXITING is set. Thus, checking PF_EXITING is no longer the correct way to determine threads in a coredumping process. Instead of PF_EXITING, use PF_POSTCOREDUMP to determine the other threads. Checking of PF_EXITING was added for coredumping, so it probably can now be removed. But it doesn't hurt to keep. Fixes: 92307383082d ("coredump: Don't perform any cleanups before dumping core") Cc: stable@vger.kernel.org Cc: Eric W. Biederman <ebiederm@xmission.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Kees Cook <kees@kernel.org> Signed-off-by: Nam Cao <namcao@linutronix.de> Link: https://lore.kernel.org/r/d89af63d478d6c64cc46a01420b46fd6eb147d6f.1735805772.git.namcao@linutronix.de Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-01-04fuse: respect FOPEN_KEEP_CACHE on opendirAmir Goldstein
The re-factoring of fuse_dir_open() missed the need to invalidate directory inode page cache with open flag FOPEN_KEEP_CACHE. Fixes: 7de64d521bf92 ("fuse: break up fuse_open_common()") Reported-by: Prince Kumar <princer@google.com> Closes: https://lore.kernel.org/linux-fsdevel/CAEW=TRr7CYb4LtsvQPLj-zx5Y+EYBmGfM24SuzwyDoGVNoKm7w@mail.gmail.com/ Signed-off-by: Amir Goldstein <amir73il@gmail.com> Link: https://lore.kernel.org/r/20250101130037.96680-1-amir73il@gmail.com Reviewed-by: Bernd Schubert <bernd.schubert@fastmail.fm> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-12-31exfat: fix the infinite loop in __exfat_free_cluster()Yuezhang Mo
In __exfat_free_cluster(), the cluster chain is traversed until the EOF cluster. If the cluster chain includes a loop due to file system corruption, the EOF cluster cannot be traversed, resulting in an infinite loop. This commit uses the total number of clusters to prevent this infinite loop. Reported-by: syzbot+1de5a37cb85a2d536330@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=1de5a37cb85a2d536330 Tested-by: syzbot+1de5a37cb85a2d536330@syzkaller.appspotmail.com Fixes: 31023864e67a ("exfat: add fat entry operations") Signed-off-by: Yuezhang Mo <Yuezhang.Mo@sony.com> Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
2024-12-31exfat: fix the new buffer was not zeroed before writingYuezhang Mo
Before writing, if a buffer_head marked as new, its data must be zeroed, otherwise uninitialized data in the page cache will be written. So this commit uses folio_zero_new_buffers() to zero the new buffers before ->write_end(). Fixes: 6630ea49103c ("exfat: move extend valid_size into ->page_mkwrite()") Reported-by: syzbot+91ae49e1c1a2634d20c0@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=91ae49e1c1a2634d20c0 Tested-by: syzbot+91ae49e1c1a2634d20c0@syzkaller.appspotmail.com Signed-off-by: Yuezhang Mo <Yuezhang.Mo@sony.com> Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
2024-12-31exfat: fix the infinite loop in exfat_readdir()Yuezhang Mo
If the file system is corrupted so that a cluster is linked to itself in the cluster chain, and there is an unused directory entry in the cluster, 'dentry' will not be incremented, causing condition 'dentry < max_dentries' unable to prevent an infinite loop. This infinite loop causes s_lock not to be released, and other tasks will hang, such as exfat_sync_fs(). This commit stops traversing the cluster chain when there is unused directory entry in the cluster to avoid this infinite loop. Reported-by: syzbot+205c2644abdff9d3f9fc@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=205c2644abdff9d3f9fc Tested-by: syzbot+205c2644abdff9d3f9fc@syzkaller.appspotmail.com Fixes: ca06197382bd ("exfat: add directory operations") Signed-off-by: Yuezhang Mo <Yuezhang.Mo@sony.com> Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
2024-12-30ocfs2: fix slab-use-after-free due to dangling pointer dqi_privDennis Lam
When mounting ocfs2 and then remounting it as read-only, a slab-use-after-free occurs after the user uses a syscall to quota_getnextquota. Specifically, sb_dqinfo(sb, type)->dqi_priv is the dangling pointer. During the remounting process, the pointer dqi_priv is freed but is never set as null leaving it to be accessed. Additionally, the read-only option for remounting sets the DQUOT_SUSPENDED flag instead of setting the DQUOT_USAGE_ENABLED flags. Moreover, later in the process of getting the next quota, the function ocfs2_get_next_id is called and only checks the quota usage flags and not the quota suspended flags. To fix this, I set dqi_priv to null when it is freed after remounting with read-only and put a check for DQUOT_SUSPENDED in ocfs2_get_next_id. [akpm@linux-foundation.org: coding-style cleanups] Link: https://lkml.kernel.org/r/20241218023924.22821-2-dennis.lamerice@gmail.com Fixes: 8f9e8f5fcc05 ("ocfs2: Fix Q_GETNEXTQUOTA for filesystem without quotas") Signed-off-by: Dennis Lam <dennis.lamerice@gmail.com> Reported-by: syzbot+d173bf8a5a7faeede34c@syzkaller.appspotmail.com Tested-by: syzbot+d173bf8a5a7faeede34c@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/6731d26f.050a0220.1fb99c.014b.GAE@google.com/T/ Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Jun Piao <piaojun@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-30fs/proc/task_mmu: fix pagemap flags with PMD THP entries on 32bitDavid Hildenbrand
Entries (including flags) are u64, even on 32bit. So right now we are cutting of the flags on 32bit. This way, for example the cow selftest complains about: # ./cow ... Bail Out! read and ioctl return unmatched results for populated: 0 1 Link: https://lkml.kernel.org/r/20241217195000.1734039-1-david@redhat.com Fixes: 2c1f057e5be6 ("fs/proc/task_mmu: properly detect PM_MMAP_EXCLUSIVE per page of PMD-mapped THPs") Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-30fs/ntfs3: Unify inode corruption marking with _ntfs_bad_inode()Konstantin Komarov
Also reworked error handling in a couple of places. Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
2024-12-30fs/ntfs3: Mark inode as bad as soon as error detected in mi_enum_attr()Konstantin Komarov
Extended the `mi_enum_attr()` function interface with an additional parameter, `struct ntfs_inode *ni`, to allow marking the inode as bad as soon as an error is detected. Reported-by: syzbot+73d8fc29ec7cba8286fa@syzkaller.appspotmail.com Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
2024-12-29bcachefs: bcachefs_metadata_version_inode_depthKent Overstreet
This adds a new inode field, bi_depth, for directory inodes: this allows us to make the check_directory_structure pass much more efficient. Currently, to ensure the filesystem is fully connect and has no loops, for every directory we follow backpointers until we find the root. But by adding a depth counter, it sufficies to only check the parent of each directory, and check that the parent's bi_depth is smaller. (fsck doesn't require that bi_depth = parent->bi_depth + 1; if a rename causes bi_depth off, but the chain to the root is still strictly decreasing, then the algorithm still works and there's no need for fsck to fixup the bi_depth fields). We've already checked backpointers, so we know that every directory (excluding the root)has a valid parent: if bi_depth is always decreasing, every chain must terminate, and terminate at the root directory. bi_depth will not necessarily be correct when fsck runs, due to directory renames - we can't change bi_depth on every child directory when renaming a directory. That's ok; fsck will silently fix the bi_depth field as needed, and future fsck runs will be much faster. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-29bcachefs: Option changes now get propagated to reflinked dataKent Overstreet
Now that bch2_move_get_io_opts() re-propagates changed inode io options to bch_extent_rebalance, we can properly suport changing IO path options for reflinked data. Changing a per-file IO path option, either via the xattr interface or via the BCHFS_IOC_REINHERIT_ATTRS ioctl, will now trigger a scan (the inode number is marked as needing a scan, via bch2_set_rebalance_needs_scan()), and rebalance will use bch2_move_data(), which will walk the inode number and pick up the new options. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-29bcachefs: bcachefs_metadata_version_reflink_p_may_update_optsKent Overstreet
Previously, io path option changes on a file would be picked up automatically and applied to existing data - but not for reflinked data, as we had no way of doing this safely. A user may have had permission to copy (and reflink) a given file, but not write to it, and if so they shouldn't be allowed to change e.g. nr_replicas or other options. This uses the incompat feature mechanism in the previous patch to add a new incompatible flag to bch_reflink_p, indicating whether a given reflink pointer may propagate io path option changes back to the indirect extent. In this initial patch we're only setting it for the source extents. We'd like to set it for the destination in a reflink copy, when the user has write access to the source, but that requires mnt_idmap which is not curretly plumbed up to remap_file_range. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-29bcachefs: BCH_SB_VERSION_INCOMPATKent Overstreet
We've been getting away from feature bits: they don't have any kind of ordering, and thus it's possible for people to enable weird combinations of features that were never tested or intended to be run. Much better to just give every new feature, compatible or incompatible, a version number. Additionally, we probably won't ever rev the major version number: major version numbers represent incompatible versions, but that doesn't really fit with how we actually roll out incompatible features - we need a better way of rolling out incompatible features. So, this patch adds two new superblock fields: - BCH_SB_VERSION_INCOMPAT - BCH_SB_VERSION_INCOMPAT_ALLOWED BCH_SB_VERSION_INCOMPAT_ALLOWED indicates that incompatible features up to version number x are allowed to be used without user prompting, but it does not by itself deny old versions from mounting. BCH_SB_VERSION_INCOMPAT does deny old versions from mounting, and must be <= BCH_SB_VERSION_INCOMPAT_ALLOWED. BCH_SB_VERSION_INCOMPAT will only be set when a codepath attempts to use an incompatible feature, so as to not unnecessarily break compatibility with old versions. bch2_request_incompat_feature() is the new interface to check if an incompatible feature may be used. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-29bcachefs: Only run check_backpointers_to_extents in debug modeKent Overstreet
The backpointers passes, check_backpointers_to_extents() and check_extents_to_backpointers() are the most expensive fsck passes. Now that we're running the same check and repair code when using a backpointer at runtime (via bch2_backpointer_get_key()) that fsck does, there's no reason fsck needs to - except to verify that the filesystem really has no errors in debug mode. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-29bcachefs: better backpointer_target_not_found() error messageKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-29bcachefs: bch2_backpointer_get_key() now repairs dangling backpointersKent Overstreet
Continuing on with the self healing theme, we should be running any check and repair code at runtime that we can - instead of declaring the filesystemt inconsistent. This will also let us skip running the backpointers -> extents fsck pass except in debug mode. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-29bcachefs: check_extents_to_backpointers() now only checks buckets with ↵Kent Overstreet
mismatches Instead of walking every extent and every backpointer it points to, first sum up backpointers in each bucket and check for mismatches, and only look for missing backpointers if mismatches were detected, and only check extents in those buckets. This is a major fsck scalability improvement, since the two backpointers passes (backpointers -> extents and extents -> backpointers) are the most expensive fsck passes by far. Additionally, to speed up the upgrade for backpointer bucket gens, or in situations when we have to rebuild alloc info, add a special case for when no backpointers are found in a bucket - don't check each individual backpointer (in particular, avoiding the write buffer flushes), just recreate them. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-29bcachefs: Add write buffer flush param to backpointer_get_key()Kent Overstreet
In an upcoming patch bch2_backpointer_get_key() will be repairing when it finds a dangling backpointer; it will need to flush the btree write buffer before it can definitively say there's an error. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-29bcachefs: kill __bch2_extent_ptr_to_bp()Kent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-29bcachefs: bch2_extent_ptr_to_bp() no longer depends on deviceKent Overstreet
bch_backpointer no longer contains the bucket_offset field, it's just a direct LBA mapping (with low bits to account for compressed extent splitting), so we don't need to refer to the device to construct it anymore. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-29bcachefs: bcachefs_metadata_version_disk_accounting_big_endianKent Overstreet
Fix sort order for disk accounting keys, in order to fix a regression on mount times. The typetag is now the most significant byte of the key, meaning disk accounting keys of the same type now sort together. This lets us skip over disk accounting keys that aren't mirrored in memory when reading accounting at startup, instead of having them interleaved with other counter types. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-29bcachefs: bcachefs_metadata_version_backpointer_bucket_genKent Overstreet
New on disk format version: backpointers new include the generation number of the bucket they refer to, and the obsolete bucket_offset field (no longer needed because we no longer store backpointers in alloc keys) is gone. This is an expensive forced upgrade - hopefully the last; we have to run the extents_to_backpointers recovery pass to regenerate backpointers. It's a forced incompatible upgrade because the alternative would've been permamently making backpointers bigger, and as one of the biggest btrees (along with the extents btree) that's not an ideal option. It's worth it though, because this allows us to make the check_extents_to_backpointers pass drastically cheaper: an upcoming patch changes it to sum up backpointers in a bucket and check the sum against the sector counts for that bucket, only looking for missing backpointers if they don't match (and then only for specific buckets). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-29bcachefs: bch2_btree_path_peek_slot() doesn't return errorsKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-29bcachefs: trace_key_cache_fillKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-29bcachefs: Log message in journal for snapshot deletionKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-29bcachefs: bch2_trans_log_msg()Kent Overstreet
Export a helper for logging to the journal when we're already in a transaction context. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>