summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2024-03-10bcachefs: fix split brain messageKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-10bcachefs: Set path->uptodate when no node at levelKent Overstreet
We were failing to set path->uptodate when reaching the end of a btree node iterator, causing the new prefetch code for backpointers gc to go into an infinite loop. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-10bcachefs: Correctly validate k->u64s in btree node read pathKent Overstreet
validate_bset_keys() never properly validated k->u64s; it checked if it was 0, but not if it was smaller than keys for the given packed format; this fixes that small oversight. This patch was backported, so it's adding quite a few error enums so that they don't get renumbered and we don't have confusing gaps. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-10bcachefs: Fix degraded mode fsckKent Overstreet
We don't know where the superblock and journal lives on offline devices; that means if a device is offline fsck can't check those buckets. Previously, fsck would incorrectly clear bucket data types for those buckets on offline devices; now we just use the previous state. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-10bcachefs: Fix journal replay with unreadable btree rootsKent Overstreet
When a btree root is unreadable, we still might be able to get some data back by replaying what's in the journal. Previously though, we got confused when journal replay would attempt to replay a key for a level that didn't exist. This adds bch2_btree_increase_depth(), so that journal replay can handle this. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-10bcachefs: fix check_inode_deleted_list()Kent Overstreet
check_inode_deleted_list() returns true if the inode is on the deleted list; check_inode() was checking the return code incorrectly. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-10bcachefs: no_splitbrain_check optionKent Overstreet
This adds an option to disable kicking out devices when splitbrain is detected - it seems there's some issues with splitbrain detection and we're kicking out devices erronously. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-10bcachefs: extent_entry_next_safe()Kent Overstreet
We need to be able to iterate over extent ptrs that may be corrupted in order to print them - this fixes a bug where we'd pop an assert in bch2_bkey_durability_safe(). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-10bcachefs: journal_seq_blacklist_add() now handles entries being added out of ↵Kent Overstreet
order bch2_journal_seq_blacklist_add() was bugged when the new entry overlapped with multiple existing entries, and it also assumed new entries are being added in increasing order. This is true on any sane filesystem, but when trying to recover from very badly mangled filesystems we might end up with the journal sequence number rewinding vs. what the blacklist list knows about - easiest to just handle that here. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-10bcachefs: Fix null-ptr-deref in bch2_fs_alloc()Li Zetao
There is a null-ptr-deref issue reported by kasan: KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007] Call Trace: <TASK> bch2_fs_alloc+0x1092/0x2170 [bcachefs] bch2_fs_open+0x683/0xe10 [bcachefs] ... When initializing the name of bch_fs, it needs to dynamically alloc memory to meet the length of the name. However, when name allocation failed, it will cause a null-ptr-deref access exception in subsequent string copy. Fix this issue by checking if name allocation is successful. Fixes: 401ec4db6308 ("bcachefs: Printbuf rework") Signed-off-by: Li Zetao <lizetao1@huawei.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-10erofs: support compressed inodes over fscacheJingbo Xu
Since fscache can utilize iov_iter to write dest buffers, bio_vec can be used in this way too. To simplify this, pseudo bios are prepared and bio_vec will be filled with bio_add_page(). And a common .bi_end_io will be called directly to handle I/O completions. Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20240308094159.40547-2-jefflexu@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2024-03-10erofs: make iov_iter describe target buffers over fscacheJingbo Xu
So far the fscache mode supports uncompressed data only, and the data read from fscache is put directly into the target page cache. As the support for compressed data in fscache mode is going to be introduced, rework the fscache internals so that the following compressed part could make the raw data read from fscache be directed to the target buffer it wants, decompress the raw data, and finally fill the page cache with the decompressed data. As the first step, a new structure, i.e. erofs_fscache_io (io), is introduced to describe a generic read request from the fscache, while the caller can specify the target buffer it wants in the iov_iter structure (io->iter). Besides, the caller can also specify its completion callback and private data through erofs_fscache_io, which will be called to make further handling, e.g. unlocking the page cache for uncompressed data or decompressing the read raw data, when the read request from the fscache completes. Now erofs_fscache_read_io_async() serves as a generic interface for reading raw data from fscache for both compressed and uncompressed data. The erofs_fscache_rq structure is kept to describe a request to fill the page cache in the specified range. Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20240308094159.40547-1-jefflexu@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2024-03-10erofs: fix lockdep false positives on initializing erofs_pseudo_mntBaokun Li
Lockdep reported the following issue when mounting erofs with a domain_id: ============================================ WARNING: possible recursive locking detected 6.8.0-rc7-xfstests #521 Not tainted -------------------------------------------- mount/396 is trying to acquire lock: ffff907a8aaaa0e0 (&type->s_umount_key#50/1){+.+.}-{3:3}, at: alloc_super+0xe3/0x3d0 but task is already holding lock: ffff907a8aaa90e0 (&type->s_umount_key#50/1){+.+.}-{3:3}, at: alloc_super+0xe3/0x3d0 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&type->s_umount_key#50/1); lock(&type->s_umount_key#50/1); *** DEADLOCK *** May be due to missing lock nesting notation 2 locks held by mount/396: #0: ffff907a8aaa90e0 (&type->s_umount_key#50/1){+.+.}-{3:3}, at: alloc_super+0xe3/0x3d0 #1: ffffffffc00e6f28 (erofs_domain_list_lock){+.+.}-{3:3}, at: erofs_fscache_register_fs+0x3d/0x270 [erofs] stack backtrace: CPU: 1 PID: 396 Comm: mount Not tainted 6.8.0-rc7-xfstests #521 Call Trace: <TASK> dump_stack_lvl+0x64/0xb0 validate_chain+0x5c4/0xa00 __lock_acquire+0x6a9/0xd50 lock_acquire+0xcd/0x2b0 down_write_nested+0x45/0xd0 alloc_super+0xe3/0x3d0 sget_fc+0x62/0x2f0 vfs_get_super+0x21/0x90 vfs_get_tree+0x2c/0xf0 fc_mount+0x12/0x40 vfs_kern_mount.part.0+0x75/0x90 kern_mount+0x24/0x40 erofs_fscache_register_fs+0x1ef/0x270 [erofs] erofs_fc_fill_super+0x213/0x380 [erofs] This is because the file_system_type of both erofs and the pseudo-mount point of domain_id is erofs_fs_type, so two successive calls to alloc_super() are considered to be using the same lock and trigger the warning above. Therefore add a nodev file_system_type called erofs_anon_fs_type in fscache.c to silence this complaint. Because kern_mount() takes a pointer to struct file_system_type, not its (string) name. So we don't need to call register_filesystem(). In addition, call init_pseudo() in erofs_anon_init_fs_context() as suggested by Al Viro, so that we can remove erofs_fc_fill_pseudo_super(), erofs_fc_anon_get_tree(), and erofs_anon_context_ops. Suggested-by: Al Viro <viro@zeniv.linux.org.uk> Fixes: a9849560c55e ("erofs: introduce a pseudo mnt to manage shared cookies") Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-and-tested-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Yang Erkun <yangerkun@huawei.com> Link: https://lore.kernel.org/r/20240307101018.2021925-1-libaokun1@huawei.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2024-03-10erofs: refine managed cache operations to foliosGao Xiang
Convert erofs_try_to_free_all_cached_pages() and z_erofs_cache_release_folio(). Besides, erofs_page_is_managed() is moved to zdata.c and renamed as erofs_folio_is_managed(). Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20240305091448.1384242-6-hsiangkao@linux.alibaba.com
2024-03-10erofs: convert z_erofs_submissionqueue_endio() to foliosGao Xiang
Use bio_for_each_folio() to iterate over each folio in the bio and there is no large folios for now. Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20240305091448.1384242-5-hsiangkao@linux.alibaba.com
2024-03-10erofs: convert z_erofs_fill_bio_vec() to foliosGao Xiang
Introduce a folio member to `struct z_erofs_bvec` and convert most of z_erofs_fill_bio_vec() to folios, which is still straight-forward. Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20240305091448.1384242-4-hsiangkao@linux.alibaba.com
2024-03-10erofs: get rid of `justfound` debugging tagGao Xiang
`justfound` is introduced to identify cached folios that are just added to compressed bvecs so that more checks can be applied in the I/O submission path. EROFS is quite now stable compared to the codebase at that stage. `justfound` becomes a burden for upcoming features. Drop it. Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20240305091448.1384242-3-hsiangkao@linux.alibaba.com
2024-03-10erofs: convert z_erofs_do_read_page() to foliosGao Xiang
It is a straight-forward conversion. Besides, it's renamed as z_erofs_scan_folio(). Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20240305091448.1384242-2-hsiangkao@linux.alibaba.com
2024-03-10erofs: convert z_erofs_onlinepage_.* to foliosGao Xiang
Online folios are locked file-backed folios which will eventually keep decoded (e.g. decompressed) data of each inode for end users to utilize. It may belong to a few pclusters and contain other data (e.g. compressed data for inplace I/Os) temporarily in a time-sharing manner to reduce memory footprints for low-ended storage devices with high latencies under heary I/O pressure. Apart from folio_end_read() usage, it's a straight-forward conversion. Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20240305091448.1384242-1-hsiangkao@linux.alibaba.com
2024-03-09exec: Simplify remove_arg_zero() error pathKees Cook
We don't need the "out" label any more, so remove "ret" and return directly on error. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Kees Cook <keescook@chromium.org> --- Cc: Eric Biederman <ebiederm@xmission.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: linux-mm@kvack.org Cc: linux-fsdevel@vger.kernel.org
2024-03-09pstore/zone: Don't clear memory twiceChristophe JAILLET
There is no need to call memset(..., 0, ...) on memory allocated by kcalloc(). It is already zeroed. Remove the redundant call. Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Link: https://lore.kernel.org/r/fa2597400051c18c6ca11187b0e4b906729991b2.1709972649.git.christophe.jaillet@wanadoo.fr Signed-off-by: Kees Cook <keescook@chromium.org>
2024-03-09NFSD: Clean up nfsd4_encode_replay()Chuck Lever
Replace open-coded encoding logic with the use of conventional XDR utility functions. Add a tracepoint to make replays observable in field troubleshooting situations. The WARN_ON is removed. A stack trace is of little use, as there is only one call site for nfsd4_encode_replay(), and a buffer length shortage here is unlikely. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-09NFS: trace the uniquifier of fscacheChen Hanxiao
Trace the mount option fsc=xxx. Signed-off-by: Chen Hanxiao <chenhx.fnst@fujitsu.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2024-03-09NFS: Read unlock folio on nfs_page_create_from_folio() errorBenjamin Coddington
The netfs conversion lost a folio_unlock() for the case where nfs_page_create_from_folio() returns an error (usually -ENOMEM). Restore it. Reported-by: David Jeffery <djeffery@redhat.com> Cc: <stable@vger.kernel.org> # 6.4+ Fixes: 000dbe0bec05 ("NFS: Convert buffered read paths to use netfs when fscache is enabled") Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Acked-by: Dave Wysochanski <dwysocha@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2024-03-09NFS: remove unused variable nfs_rpcstatTrond Myklebust
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2024-03-09nfs: fix UAF in direct writesJosef Bacik
In production we have been hitting the following warning consistently ------------[ cut here ]------------ refcount_t: underflow; use-after-free. WARNING: CPU: 17 PID: 1800359 at lib/refcount.c:28 refcount_warn_saturate+0x9c/0xe0 Workqueue: nfsiod nfs_direct_write_schedule_work [nfs] RIP: 0010:refcount_warn_saturate+0x9c/0xe0 PKRU: 55555554 Call Trace: <TASK> ? __warn+0x9f/0x130 ? refcount_warn_saturate+0x9c/0xe0 ? report_bug+0xcc/0x150 ? handle_bug+0x3d/0x70 ? exc_invalid_op+0x16/0x40 ? asm_exc_invalid_op+0x16/0x20 ? refcount_warn_saturate+0x9c/0xe0 nfs_direct_write_schedule_work+0x237/0x250 [nfs] process_one_work+0x12f/0x4a0 worker_thread+0x14e/0x3b0 ? ZSTD_getCParams_internal+0x220/0x220 kthread+0xdc/0x120 ? __btf_name_valid+0xa0/0xa0 ret_from_fork+0x1f/0x30 This is because we're completing the nfs_direct_request twice in a row. The source of this is when we have our commit requests to submit, we process them and send them off, and then in the completion path for the commit requests we have if (nfs_commit_end(cinfo.mds)) nfs_direct_write_complete(dreq); However since we're submitting asynchronous requests we sometimes have one that completes before we submit the next one, so we end up calling complete on the nfs_direct_request twice. The only other place we use nfs_generic_commit_list() is in __nfs_commit_inode, which wraps this call in a nfs_commit_begin(); nfs_commit_end(); Which is a common pattern for this style of completion handling, one that is also repeated in the direct code with get_dreq()/put_dreq() calls around where we process events as well as in the completion paths. Fix this by using the same pattern for the commit requests. Before with my 200 node rocksdb stress running this warning would pop every 10ish minutes. With my patch the stress test has been running for several hours without popping. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Cc: stable@vger.kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2024-03-09nfs: properly protect nfs_direct_req fieldsJosef Bacik
We protect accesses to the nfs_direct_req fields with the dreq->lock ever where except nfs_direct_commit_complete. This isn't a huge deal, but it does lead to confusion, and we could potentially end up setting NFS_ODIRECT_RESCHED_WRITES in one thread where we've had an error in another. Clean this up to properly protect ->error and ->flags in the commit completion path. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2024-03-09NFS: enable nconnect for RDMATrond Myklebust
It appears that in certain cases, RDMA capable transports can benefit from the ability to establish multiple connections to increase their throughput. This patch therefore enables the use of the "nconnect" mount option for those use cases. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2024-03-09NFSv4: nfs4_do_open() is incorrectly triggering state recoveryTrond Myklebust
We're seeing spurious calls to nfs4_schedule_stateid_recovery() from nfs4_do_open() in situations where there is no trigger coming from the server. In theory the code path being triggered is supposed to notice that state recovery happened while we were processing the open call result from the server, before the open stateid is published. However in the years since that code was added, we've also added the 'session draining' mechanism, which ensures that the state recovery will wait until all the session slots have been returned. In nfs4_do_open() the session slot is only returned on exit of the function, so we don't need the legacy mechanism. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2024-03-09NFS: avoid infinite loop in pnfs_update_layout.NeilBrown
If pnfsd_update_layout() is called on a file for which recovery has failed it will enter a tight infinite loop. NFS_LAYOUT_INVALID_STID will be set, nfs4_select_rw_stateid() will return -EIO, and nfs4_schedule_stateid_recovery() will do nothing, so nfs4_client_recover_expired_lease() will not wait. So the code will loop indefinitely. Break the loop by testing the validity of the open stateid at the top of the loop. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2024-03-09NFS: remove sync_mode test from nfs_writepage_locked()NeilBrown
nfs_writepage_locked() is only called from nfs_wb_folio() (since Commit 12fc0a963128 ("nfs: Remove writepage")) so ->sync_mode is always WB_SYNC_ALL. This means the test for WB_SYNC_NONE is dead code and can be removed. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2024-03-09NFSv4.1/pnfs: fix NFS with TLS in pnfsOlga Kornievskaia
Currently, even though xprtsec=tls is specified and used for operations to MDS, any operations that go to DS travel over unencrypted connection. Or additionally, if more than 1 DS can serve the data, then trunked connections are also done unencrypted. IN GETDEVINCEINFO, we get an entry for the DS which carries a protocol type (which is TCP), then nfs4_set_ds_client() gets called with TCP instead of TCP with TLS. Currently, each trunked connection is created and uses clp->cl_hostname value which if TLS is used would get passed up in the handshake upcall, but instead we need to pass in the appropriate trunked address value. Fixes: c8407f2e560c ("NFS: Add an "xprtsec=" NFS mount option") Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2024-03-09NFS: Fix an off by one in root_nfs_cat()Christophe JAILLET
The intent is to check if 'dest' is truncated or not. So, >= should be used instead of >, because strlcat() returns the length of 'dest' and 'src' excluding the trailing NULL. Fixes: 56463e50d1fc ("NFS: Use super.c for NFSROOT mount option parsing") Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2024-03-09nfs: make the rpc_stat per net namespaceJosef Bacik
Now that we're exposing the rpc stats on a per-network namespace basis, move this struct into struct nfs_net and use that to make sure only the per-network namespace stats are exposed. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2024-03-09nfs: expose /proc/net/sunrpc/nfs in net namespacesJosef Bacik
We're using nfs mounts inside of containers in production and noticed that the nfs stats are not exposed in /proc. This is a problem for us as we use these stats for monitoring, and have to do this awkward bind mount from the main host into the container in order to get to these states. Add the rpc_proc_register call to the pernet operations entry and exit points so these stats can be exposed inside of network namespaces. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2024-03-09NFSv4.1: add tracepoint to trunked nfs4_exchange_id callsOlga Kornievskaia
Add a tracepoint to track when the client sends EXCHANGE_ID to test a new transport for session trunking. nfs4_detect_session_trunking() tests for trunking and returns EINVAL if trunking can't be done, add EINVAL mapping to show_nfs4_status() in tracepoints. Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2024-03-09NFS: Fix nfs_netfs_issue_read() xarray locking for writeback interruptDave Wysochanski
The loop inside nfs_netfs_issue_read() currently does not disable interrupts while iterating through pages in the xarray to submit for NFS read. This is not safe though since after taking xa_lock, another page in the mapping could be processed for writeback inside an interrupt, and deadlock can occur. The fix is simple and clean if we use xa_for_each_range(), which handles the iteration with RCU while reducing code complexity. The problem is easily reproduced with the following test: mount -o vers=3,fsc 127.0.0.1:/export /mnt/nfs dd if=/dev/zero of=/mnt/nfs/file1.bin bs=4096 count=1 echo 3 > /proc/sys/vm/drop_caches dd if=/mnt/nfs/file1.bin of=/dev/null umount /mnt/nfs On the console with a lockdep-enabled kernel a message similar to the following will be seen: ================================ WARNING: inconsistent lock state 6.7.0-lockdbg+ #10 Not tainted -------------------------------- inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage. test5/1708 [HC0[0]:SC0[0]:HE1:SE1] takes: ffff888127baa598 (&xa->xa_lock#4){+.?.}-{3:3}, at: nfs_netfs_issue_read+0x1b2/0x4b0 [nfs] {IN-SOFTIRQ-W} state was registered at: lock_acquire+0x144/0x380 _raw_spin_lock_irqsave+0x4e/0xa0 __folio_end_writeback+0x17e/0x5c0 folio_end_writeback+0x93/0x1b0 iomap_finish_ioend+0xeb/0x6a0 blk_update_request+0x204/0x7f0 blk_mq_end_request+0x30/0x1c0 blk_complete_reqs+0x7e/0xa0 __do_softirq+0x113/0x544 __irq_exit_rcu+0xfe/0x120 irq_exit_rcu+0xe/0x20 sysvec_call_function_single+0x6f/0x90 asm_sysvec_call_function_single+0x1a/0x20 pv_native_safe_halt+0xf/0x20 default_idle+0x9/0x20 default_idle_call+0x67/0xa0 do_idle+0x2b5/0x300 cpu_startup_entry+0x34/0x40 start_secondary+0x19d/0x1c0 secondary_startup_64_no_verify+0x18f/0x19b irq event stamp: 176891 hardirqs last enabled at (176891): [<ffffffffa67a0be4>] _raw_spin_unlock_irqrestore+0x44/0x60 hardirqs last disabled at (176890): [<ffffffffa67a0899>] _raw_spin_lock_irqsave+0x79/0xa0 softirqs last enabled at (176646): [<ffffffffa515d91e>] __irq_exit_rcu+0xfe/0x120 softirqs last disabled at (176633): [<ffffffffa515d91e>] __irq_exit_rcu+0xfe/0x120 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&xa->xa_lock#4); <Interrupt> lock(&xa->xa_lock#4); *** DEADLOCK *** 2 locks held by test5/1708: #0: ffff888127baa498 (&sb->s_type->i_mutex_key#22){++++}-{4:4}, at: nfs_start_io_read+0x28/0x90 [nfs] #1: ffff888127baa650 (mapping.invalidate_lock#3){.+.+}-{4:4}, at: page_cache_ra_unbounded+0xa4/0x280 stack backtrace: CPU: 6 PID: 1708 Comm: test5 Kdump: loaded Not tainted 6.7.0-lockdbg+ Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-1.fc39 04/01/2014 Call Trace: dump_stack_lvl+0x5b/0x90 mark_lock+0xb3f/0xd20 __lock_acquire+0x77b/0x3360 _raw_spin_lock+0x34/0x80 nfs_netfs_issue_read+0x1b2/0x4b0 [nfs] netfs_begin_read+0x77f/0x980 [netfs] nfs_netfs_readahead+0x45/0x60 [nfs] nfs_readahead+0x323/0x5a0 [nfs] read_pages+0xf3/0x5c0 page_cache_ra_unbounded+0x1c8/0x280 filemap_get_pages+0x38c/0xae0 filemap_read+0x206/0x5e0 nfs_file_read+0xb7/0x140 [nfs] vfs_read+0x2a9/0x460 ksys_read+0xb7/0x140 Fixes: 000dbe0bec05 ("NFS: Convert buffered read paths to use netfs when fscache is enabled") Suggested-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Dave Wysochanski <dwysocha@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: David Howells <dhowells@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2024-03-07debugfs: fix wait/cancellation handling during removeJohannes Berg
Ben Greear further reports deadlocks during concurrent debugfs remove while files are being accessed, even though the code in question now uses debugfs cancellations. Turns out that despite all the review on the locking, we missed completely that the logic is wrong: if the refcount hits zero we can finish (and need not wait for the completion), but if it doesn't we have to trigger all the cancellations. As written, we can _never_ get into the loop triggering the cancellations. Fix this, and explain it better while at it. Cc: stable@vger.kernel.org Fixes: 8c88a474357e ("debugfs: add API to allow debugfs operations cancellation") Reported-by: Ben Greear <greearb@candelatech.com> Closes: https://lore.kernel.org/r/1c9fa9e5-09f1-0522-fdbc-dbcef4d255ca@candelatech.com Tested-by: Madhan Sai <madhan.singaraju@candelatech.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com> Link: https://lore.kernel.org/r/20240229153635.6bfab7eb34d3.I6c7aeff8c9d6628a8bc1ddcf332205a49d801f17@changeid Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-03-07sysfs:Addresses documentation in sysfs_merge_group and sysfs_unmerge_group.Rohan Kollambalath
These functions take a struct attribute_group as an input which has an optional .name field. These functions rely on the .name field being populated and do not check if its null. They pass this name into other functions, eventually leading to a null pointer dereference. This change simply updates the documentation of the function to make this requirement clear. Signed-off-by: Rohan Kollambalath <rkollamb@digi.com> Link: https://lore.kernel.org/r/20240211223634.2103665-1-rohankollambalath@gmail.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-03-07ext4: initialize sbi->s_freeclusters_counter and ↵Kemeng Shi
sbi->s_dirtyclusters_counter before use in kunit test Fix that sbi->s_freeclusters_counter and sbi->s_dirtyclusters_counter are used before initialization. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reported-by: Guenter Roeck <linux@roeck-us.net> Tested-by: Guenter Roeck <linux@roeck-us.net> Link: https://lore.kernel.org/r/20240304163543.6700-4-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-03-07ext4: hold group lock in ext4 kunit testKemeng Shi
Although there is no concurrent block allocation/free in unit test, internal functions mb_mark_used and mb_free_blocks assert group lock is always held. Acquire group before calling mb_mark_used and mb_free_blocks in unit test to avoid the assertion. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reported-by: Guenter Roeck <linux@roeck-us.net> Tested-by: Guenter Roeck <linux@roeck-us.net> Link: https://lore.kernel.org/r/20240304163543.6700-3-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-03-07ext4: alloc test super block from sgetKemeng Shi
This fix the oops in ext4 unit test which is cuased by NULL sb.s_user_ns as following: <4>[ 14.344565] map_id_range_down (kernel/user_namespace.c:318) <4>[ 14.345378] make_kuid (kernel/user_namespace.c:415) <4>[ 14.345998] inode_init_always (include/linux/fs.h:1375 fs/inode.c:174) <4>[ 14.346696] alloc_inode (fs/inode.c:268) <4>[ 14.347353] new_inode_pseudo (fs/inode.c:1007) <4>[ 14.348016] new_inode (fs/inode.c:1033) <4>[ 14.348644] ext4_mb_init (fs/ext4/mballoc.c:3404 fs/ext4/mballoc.c:3719) <4>[ 14.349312] mbt_kunit_init (fs/ext4/mballoc-test.c:57 fs/ext4/mballoc-test.c:314) <4>[ 14.349983] kunit_try_run_case (lib/kunit/test.c:388 lib/kunit/test.c:443) <4>[ 14.350696] kunit_generic_run_threadfn_adapter (lib/kunit/try-catch.c:30) <4>[ 14.351530] kthread (kernel/kthread.c:388) <4>[ 14.352168] ret_from_fork (arch/arm64/kernel/entry.S:861) <0>[ 14.353385] Code: 52808004 b8236ae7 72be5e44 b90004c4 (38e368a1) Alloc test super block from sget to properly initialize test super block to fix the issue. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reported-by: Linux Kernel Functional Testing <lkft@linaro.org> Reported-by: Guenter Roeck <linux@roeck-us.net> Tested-by: Guenter Roeck <linux@roeck-us.net> Link: https://lore.kernel.org/r/20240304163543.6700-2-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-03-07ext4: kunit: use dynamic inode allocationArnd Bergmann
Storing an inode structure on the stack pushes some functions over the warning limit for stack frame size: In file included from fs/ext4/mballoc.c:7039: fs/ext4/mballoc-test.c:506:13: error: stack frame size (1032) exceeds limit (1024) in 'test_mark_diskspace_used' [-Werror,-Wframe-larger-than] 506 | static void test_mark_diskspace_used(struct kunit *test) | ^ Use kunit_kzalloc() for all inodes. There may be a better way to do it by preallocating the inode, which would result in a larger rework. Fixes: 2b81493f8eb6 ("ext4: Add unit test for ext4_mb_mark_diskspace_used") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Kemeng Shi <shikemeng@huaweicloud.com> Link: https://lore.kernel.org/r/20240227161548.2929881-1-arnd@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-03-07ext4: enable meta_bg only when new desc blocks are neededSrivathsa Dara
This patch addresses an issue observed when resize_inode is disabled and an online extension of a filesysyem is performed. When a filesystem is expanded to a size that does not require a addition of a new descriptor block, the meta_bg feature is being enabled even though no part of the filesystem uses this layout. This patch ensures that the meta_bg feature is only enabled if any of the added block groups utilize meta_bg layout. Signed-off-by: Srivathsa Dara <srivathsa.d.dara@oracle.com> Link: https://lore.kernel.org/r/20240227131329.2608466-1-srivathsa.d.dara@oracle.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-03-07ext4: remove unused parameter biop in ext4_issue_discard()Wenchao Hao
all caller of ext4_issue_discard() would set biop to NULL since 'commit 55cdd0af2bc5 ("ext4: get discard out of jbd2 commit kthread contex")', it's unnecessary to keep this parameter any more. Signed-off-by: Wenchao Hao <haowenchao2@huawei.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://lore.kernel.org/r/20240226081731.3224470-1-haowenchao2@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-03-07ext4: remove SLAB_MEM_SPREAD flag usageChengming Zhou
The SLAB_MEM_SPREAD flag used to be implemented in SLAB, which was removed as of v6.8-rc1, so it became a dead flag since the commit 16a1d968358a ("mm/slab: remove mm/slab.c and slab_def.h"). And the series[1] went on to mark it obsolete to avoid confusion for users. Here we can just remove all its users, which has no functional change. [1] https://lore.kernel.org/all/20240223-slab-cleanup-flags-v2-1-02f1753e8303@suse.cz/ Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Link: https://lore.kernel.org/r/20240224134822.829456-1-chengming.zhou@linux.dev Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-03-07ext4: verify s_clusters_per_group even without bigallocJan Kara
Currently we ignore s_clusters_per_group field in the on-disk superblock if bigalloc feature is not enabled. However e2fsprogs don't even open the filesystem if s_clusters_per_group is invalid. This results in an odd state where kernel happily works with the filesystem while even e2fsck refuses to touch it. Verify that s_clusters_per_group is valid even if bigalloc feature is not enabled to make things consistent. Due to current e2fsprogs behavior it is unlikely there are filesystems out in the wild (except for intentionally fuzzed ones) with invalid s_clusters_per_group counts. Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://lore.kernel.org/r/20240219171033.22882-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-03-07ext4: fix corruption during on-line resizeMaximilian Heyne
We observed a corruption during on-line resize of a file system that is larger than 16 TiB with 4k block size. With having more then 2^32 blocks resize_inode is turned off by default by mke2fs. The issue can be reproduced on a smaller file system for convenience by explicitly turning off resize_inode. An on-line resize across an 8 GiB boundary (the size of a meta block group in this setup) then leads to a corruption: dev=/dev/<some_dev> # should be >= 16 GiB mkdir -p /corruption /sbin/mke2fs -t ext4 -b 4096 -O ^resize_inode $dev $((2 * 2**21 - 2**15)) mount -t ext4 $dev /corruption dd if=/dev/zero bs=4096 of=/corruption/test count=$((2*2**21 - 4*2**15)) sha1sum /corruption/test # 79d2658b39dcfd77274e435b0934028adafaab11 /corruption/test /sbin/resize2fs $dev $((2*2**21)) # drop page cache to force reload the block from disk echo 1 > /proc/sys/vm/drop_caches sha1sum /corruption/test # 3c2abc63cbf1a94c9e6977e0fbd72cd832c4d5c3 /corruption/test 2^21 = 2^15*2^6 equals 8 GiB whereof 2^15 is the number of blocks per block group and 2^6 are the number of block groups that make a meta block group. The last checksum might be different depending on how the file is laid out across the physical blocks. The actual corruption occurs at physical block 63*2^15 = 2064384 which would be the location of the backup of the meta block group's block descriptor. During the on-line resize the file system will be converted to meta_bg starting at s_first_meta_bg which is 2 in the example - meaning all block groups after 16 GiB. However, in ext4_flex_group_add we might add block groups that are not part of the first meta block group yet. In the reproducer we achieved this by substracting the size of a whole block group from the point where the meta block group would start. This must be considered when updating the backup block group descriptors to follow the non-meta_bg layout. The fix is to add a test whether the group to add is already part of the meta block group or not. Fixes: 01f795f9e0d67 ("ext4: add online resizing support for meta_bg and 64-bit file systems") Cc: <stable@vger.kernel.org> Signed-off-by: Maximilian Heyne <mheyne@amazon.de> Tested-by: Srivathsa Dara <srivathsa.d.dara@oracle.com> Reviewed-by: Srivathsa Dara <srivathsa.d.dara@oracle.com> Link: https://lore.kernel.org/r/20240215155009.94493-1-mheyne@amazon.de Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-03-07ext4: don't report EOPNOTSUPP errors from discardJan Kara
When ext4 is mounted without journal, with discard mount option, and on a device not supporting trim, we print error for each and every freed extent. This is not only useless but actively harmful. Instead ignore the EOPNOTSUPP error. Trim is only advisory anyway and when the filesystem has journal we silently ignore trim error as well. Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://lore.kernel.org/r/20240213101601.17463-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-03-07ext4: drop duplicate ea_inode handling in ext4_xattr_block_set()Jan Kara
ext4_xattr_block_set() drops ea_inode reference in two places. Handling it just under the 'cleanup' label is enough so drop the second occurence. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240209112107.10585-3-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>