summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2025-03-24bcachefs: zero init journal biosKent Overstreet
fix a kmsan splat Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24bcachefs: Eliminate padding in move_bucket_keyKent Overstreet
We appear to be tripping over a compiler/kmsan bug with padding fields - this is an easy workaround. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24bcachefs: Fix a KMSAN splat in btree_update_nodes_written()Kent Overstreet
We may sometimes read from uninitialized memory; we know, and that's ok. We check if a btree node has been reused before waiting on any outstanding IO. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24bcachefs: kmsan assertsKent Overstreet
Catching these early makes them a lot easier to track down. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24bcachefs: Fix kmsan warnings in bch2_extent_crc_pack()Kent Overstreet
We store to all fields, so the kmsan warnings were spurious - but initializing via stores to bitfields appear to have been giving the compiler/kmsan trouble, and they're not necessary. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24bcachefs: Disable asm memcpys when kmsan enabledKent Overstreet
kmsan doesn't know about inline assembly, obviously; this will close a ton of syzbot bugs. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24bcachefs: Handle backpointers with unknown data typesKent Overstreet
New data types might be added later, so we don't want to disallow unknown data types - that'll be a compatibility hassle later. Instead, ignore them. Reported-by: syzbot+3a290f5ff67ca3023834@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24bcachefs: Count BCH_DATA_parity backpointers correctlyKent Overstreet
These are counted as stripe data in the corresponding alloc keys. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24bcachefs: Run bch2_check_dirent_target() at lookup timeKent Overstreet
More on the "full online self healing" project: We now run most of the dirent <-> inode consistency checks, with repair code, at runtime - the exact same check and repair code that fsck runs. This will allow us to repair the "dirent points to inode that does not point back" inconsistencies that have been popping up at runtime. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24bcachefs: Refactor bch2_check_dirent_target()Kent Overstreet
Prep work for calling bch2_check_dirent_target() from bch2_lookup(). - Add an inline wrapper, if the target and backpointer match we can skip the function call. - We don't (yet?) want to remove the dirent we did the lookup from (when we find a directory or subvol with multiple valid dirents pointing to it), we can defer on that until later. For now, add an "are we in fsck?" parameter. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24bcachefs: Move bch2_check_dirent_target() to namei.cKent Overstreet
We're gradually running more and more fsck.c checks at runtime, whereever applicable; when we do so they get moved out of fsck.c. Next patch will call bch2_check_dirent_target() from bch2_lookup(). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24bcachefs: fs-common.c -> namei.cKent Overstreet
name <-> inode, code for managing the relationships between inodes and dirents. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24bcachefs: EIO cleanupKent Overstreet
Replace these with proper private error codes, so that when we get an error message we're not sifting through the entire codebase to see where it came from. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24bcachefs: bch2_write_prep_encoded_data() now returns errcodeKent Overstreet
Prep work for killing off EIO and replacing them with proper private error codes. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24bcachefs: Simplify bch2_write_op_error()Kent Overstreet
There's no reason for the caller to do the actual logging, it's all done the same. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24bcachefs: Fix block/btree node size defaultsKent Overstreet
We're fixing option parsing in userspace, it now obeys OPT_SB_FIELD_SECTORS Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24bcachefs: Add missing smp_rmb()Alan Huang
The smp_rmb() guarantees that reads from reservations.counter occur before accessing cur_entry_u64s. It's paired with the atomic64_try_cmpxchg in journal_entry_open. Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24bcachefs: Kill JOURNAL_ERRORS()Kent Overstreet
Convert these to standard error codes, which means we can pass them outside the journal code, they're easier to pass to tracepoints, etc. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24bcachefs: Filesystem discard option now propagates to devicesKent Overstreet
the discard option is special, because it's both a filesystem and a device option. When set at the filesytsem level, it's supposed to propagate to (if set persistently via sysfs) or override (if non persistently as a mount option) the devices - that now works correctly. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24bcachefs: Device state is now a runtime optionKent Overstreet
Other options can normally be set at runtime via sysfs, no reason for this one not to be as well - it just doesn't support the degraded flags argument this way, that requires the ioctl. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24bcachefs: Setting foreground_target at runtime now triggers rebalanceKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24bcachefs: Device options now use standard sysfs codeKent Overstreet
Device options now use the common code for sysfs, and can superblock fields (in a struct bch_member). This replaces BCH_DEV_OPT_SETTERS(), which was weird and easy to miss. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24bcachefs: Kill BCH_DEV_OPT_SETTERS()Kent Overstreet
Previously, device options had their superblock option field listed separately, which was weird and easy to miss when defining options. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24bcachefs: Remove spurious smp_mb()Alan Huang
The smp_mb() is paired with nothing. Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24bcachefs: Fix incorrect state countAlan Huang
atomic64_read(&j->seq) - j->seq_write_started == JOURNAL_STATE_BUF_NR is the condition in journal_entry_open where we return JOURNAL_ERR_max_open, so journal_cur_seq(j) - seq == JOURNAL_STATE_BUF_NR means that the buf corresponding to seq has started to write. Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24bcachefs: Fix btree iter flags in data moveKent Overstreet
Rebalance requires a not_extents iterator. This wasn't hit before because all_snapshots disableds is_extents on snapshots btrees - but has no effect on the reflink btree. Reported-by: Maël Kerbiriou <mael.kerbiriou@free.fr> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24bcachefs: Validate bch_sb.offset fieldKent Overstreet
This was missed - but it needs to be correct for the superblock recovery tool that scans the start and end of the device for backup superblocks: we don't want to pick up superblocks that belong to a different partition that starts at a different offset. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24bcachefs: bch2_sb_validate() doesn't need bch_sb_handleKent Overstreet
Minor refactoring, so that bch2_sb_validate() can be used in the new userspace superblock recovery tool. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24bcachefs: Add missing random.h includesKent Overstreet
Fix build in userspace, and good hygeine. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24bcachefs: Better incompat version/feature error messagesKent Overstreet
If we can't mount because of an incompatibility, print what's supported and unsupported - to help solve PEBKAC issues. Reported-by: Roland Vet <vet.roland@protonmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24bcachefs: Fix offset_into_extent in data move pathKent Overstreet
Fixes the following: [ 17.607394] kernel BUG at fs/bcachefs/reflink.c:261! [ 17.608316] Oops: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI [ 17.608485] CPU: 0 UID: 0 PID: 564 Comm: bch-rebalance/3 Tainted: G OE 6.14.0-rc6-arch1-gfcb0bd9609d2 #7 0efd7a8f4a00afeb2c5fb6e7ecb1aec8ddcbb1e1 [ 17.608616] Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE [ 17.608736] Hardware name: Micro-Star International Co., Ltd. MS-7D75/MAG B650 TOMAHAWK WIFI (MS-7D75), BIOS 1.74 08/01/2023 [ 17.608855] RIP: 0010:bch2_lookup_indirect_extent+0x252/0x290 [bcachefs] [ 17.609006] Code: 00 00 00 00 e8 7f 51 f5 ff 89 c3 85 c0 74 52 48 8b 7d b0 4c 89 ee e8 4d 4b f4 ff 48 63 d3 48 89 d0 31 d2 e9 2e ff ff ff 0f 0b <0f> 0b 48 8b 7d b0 4c 89 ee 48 89 55 a8 e8 2c 4b f4 ff 4c 8b 55 a8 [ 17.609136] RSP: 0018:ffffa3714455f850 EFLAGS: 00010246 [ 17.609261] RAX: 0000000000000080 RBX: ffff895891098790 RCX: 0000000000000000 [ 17.609387] RDX: 0000000000000080 RSI: ffffa3714455fa90 RDI: ffff895889550000 [ 17.609511] RBP: ffffa3714455f8c0 R08: ffff895891098790 R09: 0000000000000001 [ 17.609637] R10: ffffa3714455f8d8 R11: ffffa3714455f950 R12: ffffa3714455fa58 [ 17.609763] R13: ffff895891098790 R14: ffffa3714455fa58 R15: ffff895889550000 [ 17.609888] FS: 0000000000000000(0000) GS:ffff896757c00000(0000) knlGS:0000000000000000 [ 17.610015] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 17.610143] CR2: 0000716b8cda2750 CR3: 0000000914e22000 CR4: 0000000000f50ef0 [ 17.610272] PKRU: 55555554 [ 17.610403] Call Trace: [ 17.610535] <TASK> [ 17.610662] ? __die_body.cold+0x19/0x27 [ 17.610791] ? die+0x2e/0x50 [ 17.610918] ? do_trap+0xca/0x110 [ 17.611049] ? do_error_trap+0x6a/0x90 [ 17.611178] ? bch2_lookup_indirect_extent+0x252/0x290 [bcachefs c42b95c23facdfe11d39755520127cd771dddec2] [ 17.611331] ? exc_invalid_op+0x50/0x70 [ 17.611468] ? bch2_lookup_indirect_extent+0x252/0x290 [bcachefs c42b95c23facdfe11d39755520127cd771dddec2] [ 17.611620] ? asm_exc_invalid_op+0x1a/0x20 [ 17.611757] ? bch2_lookup_indirect_extent+0x252/0x290 [bcachefs c42b95c23facdfe11d39755520127cd771dddec2] [ 17.611911] ? bch2_move_data_btree+0x58a/0x6c0 [bcachefs c42b95c23facdfe11d39755520127cd771dddec2] [ 17.612084] bch2_move_data_btree+0x58a/0x6c0 [bcachefs c42b95c23facdfe11d39755520127cd771dddec2] [ 17.612256] ? __pfx_rebalance_pred+0x10/0x10 [bcachefs c42b95c23facdfe11d39755520127cd771dddec2] [ 17.612431] ? bch2_move_extent+0x3d7/0x6e0 [bcachefs c42b95c23facdfe11d39755520127cd771dddec2] [ 17.612607] ? __bch2_move_data+0xea/0x200 [bcachefs c42b95c23facdfe11d39755520127cd771dddec2] [ 17.612782] __bch2_move_data+0xea/0x200 [bcachefs c42b95c23facdfe11d39755520127cd771dddec2] [ 17.612959] ? __pfx_rebalance_pred+0x10/0x10 [bcachefs c42b95c23facdfe11d39755520127cd771dddec2] [ 17.613149] do_rebalance+0x517/0x8d0 [bcachefs c42b95c23facdfe11d39755520127cd771dddec2] [ 17.613342] ? local_clock_noinstr+0xd/0xd0 [ 17.613518] ? local_clock+0x15/0x30 [ 17.613693] ? __bch2_trans_get+0x152/0x300 [bcachefs c42b95c23facdfe11d39755520127cd771dddec2] [ 17.613890] ? __pfx_bch2_rebalance_thread+0x10/0x10 [bcachefs c42b95c23facdfe11d39755520127cd771dddec2] [ 17.614090] bch2_rebalance_thread+0x66/0xb0 [bcachefs c42b95c23facdfe11d39755520127cd771dddec2] The offset_into_extent bit was copied from the read path, but it's unnecessary here, where we always want to read and move the entire indirect extent, and it causes the assertion pop - because we're using a non-extents iterator, which always points to the end of the reflink pointer. Reported-by: Maël Kerbiriou <mael.kerbiriou@free.fr> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24bcachefs: use sha256() instead of crypto_shash APIEric Biggers
Just use sha256() instead of the clunky crypto API. This is much simpler. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24bcachefs: Remove unnecessary softdeps on crc32c and crc64Eric Biggers
Since bcachefs does not access crc32c and crc64 through the crypto API, there is no need to use module softdeps to ensure they are loaded. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24bcachefs: #if 0 out (enable|disable)_encryption()Kent Overstreet
These weren't hooked up, but they probably should be - add some comments for context. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24bcachefs: Improve can_write_extent()Kent Overstreet
This fixes another "rebalance spinning and doing no work" issue; rebalance was reading extents it wanted to move, but then failing in bch2_write() -> bch2_alloc_sectors_start() due to being unable to allocate sufficient replicas. This was triggered by a user playing with the durability settings, the foreground device was an NVME device with durability=2, and originally he'd set the background device to durability=2 as well, but changed it back to 1 (the default) after seeing IO errors. That meant that with replicas=2, we want to move data off the NVME device which satisfies that constraint, but with a single durability=1 device on the background target there's no way to move the extent to that target while satisfiying the "required replicas" constraint. The solution for now is for bch2_data_update_init() to check for this, and return an error - before kicking off the read. bch2_data_update_init() already had two different checks for "will we be able to write this extent", with partially duplicated code, so this patch combines and improves that logic. Additionally, we now always bail out and return an error if there's insufficient space on the destination target. Previously, we only did this for BCH_WRITE_alloc_nowait moves, because it might be the case that copygc just needs to free up space on the destination target. But we really shouldn't kick off a move if the destination is full, we can't currently distinguish between "really full" and "just need to wait for copygc", and if we are going to wait on copygc it'd be better to do that before kicking off the move. This will additionally fix "rebalance spinning" issues caused by a filesystem that has more data than can fit in background_target - which is a valid scenario, since we don't exclude foreground/cache devices when calculating filesystem capacity. Reported-by: Maël Kerbiriou <mael.kerbiriou@free.fr> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24bcachefs: trace_io_move_write_failKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24bcachefs: Increase blacklist rangeAlan Huang
Now there are 16 journal buffers, 8 is too small to be enough. Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24bcachefs: __bch2_read() now takes a btree_transKent Overstreet
Next patch will be checking if the extent we're reading from matches the IO failure we saw before marking the failure. For this to work, __bch2_read() needs to take the same transaction context that bch2_rbio_retry() uses to do that check. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24bcachefs: BCH_READ_data_update -> bch_read_bio.data_updateKent Overstreet
Read flags are codepath dependent and change as they're passed around, while the fields in rbio._state are mostly fixed properties of that particular object. Losing track of BCH_READ_data_update would be bad, and previously it was not obvious if it was always correctly set in the rbio, so this is a safety cleanup. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-21fs/procfs: fix the comment above proc_pid_wchan()Bart Van Assche
proc_pid_wchan() used to report kernel addresses to user space but that is no longer the case today. Bring the comment above proc_pid_wchan() in sync with the implementation. Link: https://lkml.kernel.org/r/20250319210222.1518771-1-bvanassche@acm.org Fixes: b2f73922d119 ("fs/proc, core/debug: Don't expose absolute kernel addresses via wchan") Signed-off-by: Bart Van Assche <bvanassche@acm.org> Cc: Kees Cook <kees@kernel.org> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-21fs/dax: don't disassociate zero page entriesAlistair Popple
Prior to commit 38607c62b34b ("fs/dax: properly refcount fs dax pages") dax_associate_entry() and dax_disassociate_entry() would implicitly skip zero and empty dax entries using the for_each_mapped_pfn() macro. The use of compound ZONE_DEVICE folios removed the need for this macro and so it was removed, leading dax_folio_put() to be called on zero pages. This lead to the below warning. To fix this explicitly skip zero and empty entries in dax_associate/disassociate_entry(). [ 27.536963] ------------[ cut here ]------------ [ 27.537674] WARNING: CPU: 11 PID: 874 at fs/dax.c:415 dax_folio_put.isra.0+0x10d/0x170 [ 27.538844] Modules linked in: nd_pmem nd_btt nd_e820 libnvdimm [ 27.539732] CPU: 11 UID: 0 PID: 874 Comm: ctl_prefault Tainted: G W 6.14.0-rc2+ #1104 [ 27.541093] Tainted: [W]=WARN [ 27.541549] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/204 [ 27.543197] RIP: 0010:dax_folio_put.isra.0+0x10d/0x170 [ 27.543970] Code: 20 48 85 c0 0f 84 29 ff ff ff 48 83 e8 01 48 89 47 20 0f 84 1b ff ff ff 48 83 c4 10 5b 5d 41 5c c3 cc cc4 [ 27.546723] RSP: 0000:ffff961e4102fae0 EFLAGS: 00010002 [ 27.547505] RAX: 0000000000000001 RBX: ffffc9cce4e18000 RCX: 0000000000000009 [ 27.548564] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8a2a7badca40 [ 27.549630] RBP: ffffc9cce4e18000 R08: 0000000000009ffb R09: 00000000ffffdfff [ 27.550691] R10: 00000000ffffdfff R11: ffffffffa4e823a0 R12: 0000000000000000 [ 27.551748] R13: 0000000000000000 R14: 0000000010f10005 R15: 0000000000000004 [ 27.552819] FS: 00007f5f539d74c0(0000) GS:ffff8a2a7bac0000(0000) knlGS:0000000000000000 [ 27.554015] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 27.554873] CR2: 00007f5f52e00000 CR3: 0000000909340000 CR4: 00000000000006f0 [ 27.555938] Call Trace: [ 27.556318] <TASK> [ 27.556650] ? __warn+0x91/0x190 [ 27.557146] ? dax_folio_put.isra.0+0x10d/0x170 [ 27.557824] ? report_bug+0x164/0x190 [ 27.558378] ? handle_bug+0x54/0x90 [ 27.558898] ? exc_invalid_op+0x17/0x70 [ 27.559489] ? asm_exc_invalid_op+0x1a/0x20 [ 27.560125] ? dax_folio_put.isra.0+0x10d/0x170 [ 27.560808] dax_insert_entry+0x1e1/0x420 [ 27.561419] dax_fault_iter+0x252/0x860 [ 27.561995] dax_iomap_pmd_fault+0x23c/0x4a0 [ 27.562651] ext4_dax_huge_fault+0x1e2/0x450 [ 27.563296] __handle_mm_fault+0x6c8/0x12b0 [ 27.563920] ? do_user_addr_fault+0x1ca/0x670 [ 27.564577] ? lock_vma_under_rcu+0x178/0x3b0 [ 27.565235] handle_mm_fault+0xe5/0x290 [ 27.565816] do_user_addr_fault+0x208/0x670 [ 27.566446] exc_page_fault+0x6d/0x230 [ 27.567008] asm_exc_page_fault+0x26/0x30 [ 27.567610] RIP: 0033:0x7f5f543bcb4f [ 27.568152] Code: 45 f0 48 8b 45 f0 48 8b 4d f8 48 03 41 18 48 89 45 e8 48 8b 45 f0 48 3b 45 e8 0f 83 97 00 00 00 48 8b 458 [ 27.570895] RSP: 002b:00007ffc2d774460 EFLAGS: 00010287 [ 27.571672] RAX: 00007f5f52e00000 RBX: 0000000000200000 RCX: 000055760153fc00 [ 27.572731] RDX: 0000000000000000 RSI: 0000557601542a20 RDI: 000055760153fc00 [ 27.573787] RBP: 00007ffc2d774460 R08: 0000000000000000 R09: 0000000000000073 [ 27.574840] R10: 0000000000000000 R11: 0000000000000202 R12: 00007ffc2d77534b [ 27.575897] R13: 00007ffc2d774aa0 R14: 0000000000800000 R15: 0000000000800000 [ 27.576961] </TASK> [ 27.577301] irq event stamp: 13394 [ 27.577810] hardirqs last enabled at (13393): [<ffffffffa3485780>] flush_tlb_mm_range+0x1c0/0x220 [ 27.579138] hardirqs last disabled at (13394): [<ffffffffa450d0c7>] _raw_spin_lock_irq+0x47/0x50 [ 27.580428] softirqs last enabled at (12530): [<ffffffffa433941a>] xs_tcp_send_request+0x22a/0x2e0 [ 27.581762] softirqs last disabled at (12528): [<ffffffffa40a60fd>] release_sock+0x1d/0xb0 [ 27.582986] ---[ end trace 0000000000000000 ]--- Link: https://lkml.kernel.org/r/20250319013301.369822-1-apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Fixes: 38607c62b34b ("fs/dax: properly refcount fs dax pages") Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202503102229.122fbd6c-lkp@intel.com Cc: Dan Williams <dan.j.williams@intel.com> Cc: Alison Schofield <alison.schofield@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Balbir Singh <balbirs@nvidia.com> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-21meminfo: add a per node counter for balloon driversNico Pache
Patch series "track memory used by balloon drivers", v2. This series introduces a way to track memory used by balloon drivers. Add a NR_BALLOON_PAGES counter to track how many pages are reclaimed by the balloon drivers. First add the accounting, then updates the balloon drivers (virtio, Hyper-V, VMware, Pseries-cmm, and Xen) to maintain this counter. The virtio, Vmware, and pseries-cmm balloon drivers utilize the balloon_compaction interface to allocate and free balloon pages. Other balloon drivers will have to maintain this counter manually. This makes the information visible in memory reporting interfaces like /proc/meminfo, show_mem, and OOM reporting. This provides admins visibility into their VM balloon sizes without requiring different virtualization tooling. Furthermore, this information is helpful when debugging an OOM inside a VM. This patch (of 4): Add NR_BALLOON_PAGES counter to track memory used by balloon drivers and expose it through /proc/meminfo and other memory reporting interfaces. [npache@redhat.com: document Balloon Meminfo entry] Link: https://lkml.kernel.org/r/a0315ccf-f244-460e-8643-fd7388724fe5@redhat.com Link: https://lkml.kernel.org/r/20250314213757.244258-1-npache@redhat.com Link: https://lkml.kernel.org/r/20250314213757.244258-2-npache@redhat.com Signed-off-by: Nico Pache <npache@redhat.com> Cc: Alexander Atanasov <alexander.atanasov@virtuozzo.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: David Hildenbrand <david@redhat.com> Cc: Dexuan Cui <decui@microsoft.com> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Juegren Gross <jgross@suse.com> Cc: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> Cc: K. Y. Srinivasan <kys@microsoft.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Wei Liu <wei.liu@kernel.org> Cc: Michael Kelley <mhklinux@outlook.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-21pNFS/flexfiles: Report ENETDOWN as a connection errorTrond Myklebust
If the client should see an ENETDOWN when trying to connect to the data server, it might still be able to talk to the metadata server through another NIC. If so, report the error. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Tested-by: Jeff Layton <jlayton@kernel.org> Acked-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-21pNFS/flexfiles: Treat ENETUNREACH errors as fatal in containersTrond Myklebust
Propagate the NFS_MOUNT_NETUNREACH_FATAL flag to work with the pNFS flexfiles client. In these circumstances, the client needs to treat the ENETDOWN and ENETUNREACH errors as fatal, and should abandon the attempted I/O. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Tested-by: Jeff Layton <jlayton@kernel.org> Acked-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-21NFS: Treat ENETUNREACH errors as fatal in containersTrond Myklebust
Propagate the NFS_MOUNT_NETUNREACH_FATAL flag to work with the generic NFS client. If the flag is set, the client will receive ENETDOWN and ENETUNREACH errors from the RPC layer, and is expected to treat them as being fatal. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Tested-by: Jeff Layton <jlayton@kernel.org> Acked-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-21NFS: Add a mount option to make ENETUNREACH errors fatalTrond Myklebust
If the NFS client was initially created in a container, and that container is torn down, there is usually no possibity to go back and destroy any NFS clients that are hung because their virtual network devices have been unlinked. Add a flag that tells the NFS client that in these circumstances, it should treat ENETDOWN and ENETUNREACH errors as fatal to the NFS client. The option defaults to being on when the mount happens from inside a net namespace that is not "init_net". Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Tested-by: Jeff Layton <jlayton@kernel.org> Acked-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-21NFS: Add implid to sysfsAnna Schumaker
The Linux NFS server added support for returning this information during an EXCHANGE_ID in Linux v6.13. This is something and admin might want to query, so let's add it to sysfs. Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com> Link: https://lore.kernel.org/r/20250207204225.594002-2-anna@kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-03-21NFS: Extend rdirplus mount option with "force|none"Benjamin Coddington
There are certain users that wish to force the NFS client to choose READDIRPLUS over READDIR for a particular mount. Update the "rdirplus" mount option to optionally accept values. For "rdirplus=force", the NFS client will always attempt to use READDDIRPLUS. The setting of "rdirplus=none" is aliased to the existing "nordirplus". Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Link: https://lore.kernel.org/r/c4cf0de4c8be0930b91bc74bee310d289781cd3b.1741885071.git.bcodding@redhat.com Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-03-21ubifs: Pass folios to acompHerbert Xu
As the acomp interface supports folios, use that instead of mapping the data in ubifs. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Tested-by: Zhihao Cheng <chengzhihao1@huawei.com> # For xfstests Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2025-03-21ubifs: Use crypto_acomp interfaceHerbert Xu
Replace the legacy crypto compression interface with the new acomp interface. Remove the compression mutexes and the overallocation for memory (the offender LZO has been fixed). Cap the output buffer length for compression to eliminate the post-compression check for UBIFS_MIN_COMPRESS_DIFF. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Tested-by: Zhihao Cheng <chengzhihao1@huawei.com> # For xfstests Reviewed-by: Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>