summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2025-04-03bcachefs: Fix check_snapshot_exists() restart handlingKent Overstreet
Codepaths that create entries in the snapshots btree currently call bch2_mark_snapshot(), which updates the in-memory snapshot table, before transaction commit. This is because bch2_mark_snapshot() is an atomic trigger, run with btree write locks held, and isn't allowed to fail - but it might need to reallocate the table, hence we call it early when we're still allowed to fail. This is generally harmless - if we fail, we'll have left an entry in the snapshots table around, but nothing will reference it and it'll get overwritten if reused by another transaction. But check_snapshot_exists(), which reconstructs snapshots when the snapshots btree has been corrupted or lost, was erronously rechecking if the snapshot exists inside the transaction commit loop - so on transaction restart (in this case mem_realloced), the second iteration would return without repairing. This code needs some cleanup: splitting out a "maybe realloc snapshots table" helper would have avoided this, that will be in the next patch. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-03bcachefs: use nonblocking variant of print_string_as_lines in error pathBharadwaj Raju
The inconsistency error path calls print_string_as_lines, which calls console_lock, which is a potentially-sleeping function and so can't be called in an atomic context. Replace calls to it with the nonblocking variant which is safe to call. Signed-off-by: Bharadwaj Raju <bharadwaj.raju777@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-03bcachefs: Fix scheduling while atomic from logging changesKent Overstreet
Two fixes from the recent logging changes: bch2_inconsistent(), bch2_fs_inconsistent() be called from interrupt context, or with rcu_read_lock() held. The one syzbot found is in bch2_bkey_pick_read_device bch2_dev_rcu bch2_fs_inconsistent We're starting to switch to lift the printbufs up to higher levels so we can emit better log messages and print them all in one go (avoid garbling), so that conversion will help with spotting these in the future; when we declare a printbuf it must be flagged if we're in an atomic context. Secondly, in btree_node_write_endio: 00085 BUG: sleeping function called from invalid context at include/linux/sched/mm.h:321 00085 in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 618, name: bch-reclaim/fa6 00085 preempt_count: 10001, expected: 0 00085 RCU nest depth: 0, expected: 0 00085 4 locks held by bch-reclaim/fa6/618: 00085 #0: ffffff80d7ccad68 (&j->reclaim_lock){+.+.}-{4:4}, at: bch2_journal_reclaim_thread+0x84/0x198 00085 #1: ffffff80d7c84218 (&c->btree_trans_barrier){.+.+}-{0:0}, at: __bch2_trans_get+0x1c0/0x440 00085 #2: ffffff80cd3f8140 (bcachefs_btree){+.+.}-{0:0}, at: __bch2_trans_get+0x22c/0x440 00085 #3: ffffff80c3823c20 (&vblk->vqs[i].lock){-.-.}-{3:3}, at: virtblk_done+0x58/0x130 00085 irq event stamp: 328 00085 hardirqs last enabled at (327): [<ffffffc080073a14>] finish_task_switch.isra.0+0xbc/0x2a0 00085 hardirqs last disabled at (328): [<ffffffc080971a10>] el1_interrupt+0x20/0x60 00085 softirqs last enabled at (0): [<ffffffc08002f920>] copy_process+0x7c8/0x2118 00085 softirqs last disabled at (0): [<0000000000000000>] 0x0 00085 Preemption disabled at: 00085 [<ffffffc08003ada0>] irq_enter_rcu+0x18/0x90 00085 CPU: 8 UID: 0 PID: 618 Comm: bch-reclaim/fa6 Not tainted 6.14.0-rc6-ktest-g04630bde23e8 #18798 00085 Hardware name: linux,dummy-virt (DT) 00085 Call trace: 00085 show_stack+0x1c/0x30 (C) 00085 dump_stack_lvl+0x84/0xc0 00085 dump_stack+0x14/0x20 00085 __might_resched+0x180/0x288 00085 __might_sleep+0x4c/0x88 00085 __kmalloc_node_track_caller_noprof+0x34c/0x3e0 00085 krealloc_noprof+0x1a0/0x2d8 00085 bch2_printbuf_make_room+0x9c/0x120 00085 bch2_prt_printf+0x60/0x1b8 00085 btree_node_write_endio+0x1b0/0x2d8 00085 bio_endio+0x138/0x1f0 00085 btree_node_write_endio+0xe8/0x2d8 00085 bio_endio+0x138/0x1f0 00085 blk_update_request+0x220/0x4c0 00085 blk_mq_end_request+0x28/0x148 00085 virtblk_request_done+0x64/0xe8 00085 blk_mq_complete_request+0x34/0x40 00085 virtblk_done+0x78/0x130 00085 vring_interrupt+0x6c/0xb0 00085 __handle_irq_event_percpu+0x8c/0x2e0 00085 handle_irq_event+0x50/0xb0 00085 handle_fasteoi_irq+0xc4/0x250 00085 handle_irq_desc+0x44/0x60 00085 generic_handle_domain_irq+0x20/0x30 00085 gic_handle_irq+0x54/0xc8 00085 call_on_irq_stack+0x24/0x40 Reported-by: syzbot+c82cd2906e2f192410bb@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-03bcachefs: Add error handling for zlib_deflateInit2()Wentao Liang
In attempt_compress(), the return value of zlib_deflateInit2() needs to be checked. A proper implementation can be found in pstore_compress(). Add an error check and return 0 immediately if the initialzation fails. Fixes: 986e9842fb68 ("bcachefs: Compression levels") Signed-off-by: Wentao Liang <vulab@iscas.ac.cn> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-02bcachefs: add missing selection of XARRAY_MULTIEric Biggers
When CONFIG_XARRAY_MULTI is not set, reading from a bcachefs file hits the 'BUG_ON(order > 0);' in xas_set_order(), because it tries to insert a large folio in the page cache. Fix this by making bcachefs select XARRAY_MULTI. Fixes: be212d86b19c ("bcachefs: bs > ps support") Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-02bcachefs: bch_dev_usage_fullKent Overstreet
All the fastpaths that need device usage don't need the sector totals or fragmentation, just bucket counts. Split bch_dev_usage up into two different versions, the normal one with just bucket counts. This is also a stack usage improvement, since we have a bch_dev_usage on the stack in the allocation path. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-02bcachefs: Kill btree_iter.transKent Overstreet
This was planned to be done ages ago, now finally completed; there are places where we have quite a few btree_trans objects on the stack, so this reduces stack usage somewhat. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-02bcachefs: do_trace_key_cache_fill()Kent Overstreet
Reducing stack frame usage; this moves the printbuf out of the main stack frame. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-02bcachefs: Split up bch_dev.io_refKent Overstreet
We now have separate per device io_refs for read and write access. This fixes a device removal bug where the discard workers were still running while we're removing alloc info for that device. It's also a bit of hardening; we no longer allow writes to devices that are read-only. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-01bcachefs: fix ref leak in btree_node_read_all_replicasKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-31bcachefs: Fix null ptr deref in bch2_write_endio()Kent Overstreet
This was previously hard to hit since it requires racing with device removal, but splitting up io_ref uncovered it. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-31bcachefs: Fix field spanning write warningKent Overstreet
Struct with embedded VLA... memcpy: detected field-spanning write (size 8) of single field "&gc->r.e" at fs/bcachefs/ec.c:465 (size 3) WARNING: CPU: 1 PID: 936 at fs/bcachefs/ec.c:465 bch2_trigger_stripe+0x706/0x730 Modules linked in: CPU: 1 UID: 0 PID: 936 Comm: mount.bcachefs Not tainted 6.14.0-rc6-ktest-00236-gefb0b5c62dbc #55 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 RIP: 0010:bch2_trigger_stripe+0x706/0x730 Code: b4 00 01 b9 03 00 00 00 48 89 fb 48 c7 c7 33 54 da 81 48 89 d6 49 89 d6 48 c7 c2 c3 36 db 81 e8 60 54 c5 ff 48 89 df 4c 89 f2 <0f> 0b e9 5c fd ff ff e8 fe 5e 4e 00 bf 10 00 00 00 48 c7 c6 ff ff RSP: 0018:ffff88817081f680 EFLAGS: 00010246 RAX: f8fe7dd1c56b5600 RBX: ffff888101265368 RCX: 0000000000000027 RDX: 0000000000000008 RSI: 00000000fffbffff RDI: ffff888101265368 RBP: 0000000000000000 R08: 000000000003ffff R09: ffff88817f1fe000 R10: 00000000000bfffd R11: 0000000000000004 R12: ffff8881012652c0 R13: 0000000000000000 R14: 0000000000000008 R15: ffff88817081f6c9 FS: 00007fc428bc7c80(0000) GS:ffff888179280000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007ffd3ee4a038 CR3: 000000010a9bc000 CR4: 0000000000750eb0 PKRU: 55555554 Call Trace: <TASK> ? __warn+0xce/0x1b0 ? bch2_trigger_stripe+0x706/0x730 ? report_bug+0x11b/0x1a0 ? bch2_trigger_stripe+0x706/0x730 ? handle_bug+0x5e/0x90 ? exc_invalid_op+0x1a/0x50 ? asm_exc_invalid_op+0x1a/0x20 ? bch2_trigger_stripe+0x706/0x730 bch2_gc_mark_key+0x2cf/0x430 bch2_check_allocations+0x1a64/0x1ed0 ? vsnprintf+0x1ad/0x420 ? bch2_check_allocations+0x191f/0x1ed0 bch2_run_recovery_passes+0x13b/0x2b0 bch2_fs_recovery+0x9b7/0x1290 ? __bch2_print+0xb2/0xf0 ? bch2_printbuf_exit+0x1e/0x30 ? print_mount_opts+0x153/0x180 bch2_fs_start+0x274/0x3b0 bch2_fs_get_tree+0x516/0x6e0 vfs_get_tree+0x21/0xa0 do_new_mount+0x153/0x350 __x64_sys_mount+0x16c/0x1f0 do_syscall_64+0x6c/0x140 ? arch_exit_to_user_mode_prepare+0x9/0x40 entry_SYSCALL_64_after_hwframe+0x4b/0x53 Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-31bcachefs: Fix striping behaviourKent Overstreet
For striping across devices, we maintain "clocks", and we advance them by the inverse of "how much free space this device has left", so that we round robin biased in favor of devices with more free space. This code was originally trying to do EWMA-ish stuff when originally written, ~10 years ago, and was never properly cleaned up when it was realized that an EWMA is not the right approach here. That left a bug, when we rescale to keep all the clocks in the correct range and prevent overflow. It was assumed that we'd always be allocated from the device with the smallest clock hand, but that's actually not correct: with the target options, allocations will be first tried from a subset of devices, and then the entire filesystem if that fails. Thus, the rescale from the first allocation - allocating from a subset of devices - can pick the wrong rescale value and cause the rest of the clocks to go to 0, losing information. This resuls in incorrect striping behaviour when the desired number of replicas doesn't fit on the foreground target. Link: https://www.reddit.com/r/bcachefs/comments/1jn3t26/replica_allocation_not_evenly_distributed_among/ Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-30bcachefs: fix bch2_write_point_to_text() unitsKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-30bcachefs: Log original key being moved in data updatesKent Overstreet
There's something going on with the data move path; log the original key being moved for debugging. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-30bcachefs: BCH_JSET_ENTRY_log_bkeyKent Overstreet
Add a journal entry type for logging - but logging a bkey, not a string; to be used for data move path debugging. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-30bcachefs: Reorder error messages that include journal debugKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-30bcachefs: Don't use designated initializers for disk_accounting_posKent Overstreet
Not all compilers fully initialize these - they're not guaranteed to because of the union shenanigans. Fixes: https://github.com/koverstreet/bcachefs/issues/844 Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-30bcachefs: Silence errors after emergency shutdownKent Overstreet
We don't care about errors from asynchronous ops that were because we did an emergency shutdown; silence them. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-30bcachefs: fix units in rebalance_statusKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-30bcachefs: bch2_ioctl_subvolume_destroy() fixesKent Overstreet
bch2_evict_subvolume_inodes() was getting stuck - due to incorrectly pruning the dcache. Also, fix missing permissions checks. Reported-by: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-29bcachefs: Clear fs_path_parent on subvolume unlinkKent Overstreet
This fixes recursive subvolume removal. Subvolume deletion is asynchronous; fs_path_parent, and thus the entry in the subvolume_children btree, need to be cleared when the subvolume is unlinked from the fs heirarchy - else we'll spuriously think a subvolume has children and deletion will fail. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-29bcachefs: Change btree_insert_node() assertion to errorKent Overstreet
Debug for https://github.com/koverstreet/bcachefs/issues/843 Print useful debug info and go emergency read-only. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-29bcachefs: Better printing of inconsistency errorsKent Overstreet
Build up and emit the error message for an inconsistency error all at once, instead of spread over multiple printk calls, so they're not jumbled in the dmesg log. Also, add better indenting. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-29bcachefs: bch2_count_fsck_err()Kent Overstreet
Factor out a helper from __bch2_fsck_err(), for counting the error in the superblock and deciding whether to print or ratelimit - will be used to replace some log_fsck_err() calls, where we want to lift out printing the error message. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-28bcachefs: Better helpers for inconsistency errorsKent Overstreet
An inconsistency error often happens as part of an event with multiple error messages, and we want to build up one single error message with proper indenting to produce more readable log messages that don't get garbled. Add new helpers that emit messages to a printbuf instead of printing them directly, next patch will convert to use them. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-28bcachefs: Consistent indentation of multiline fsck errorsKent Overstreet
Add the new helper printbuf_indent_add_nextline(), and use it in __bch2_fsck_err() to centralize setting the indentation of multiline fsck errors. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-28bcachefs: Add an "ignore unknown" option to bch2_parse_mount_opts()Kent Overstreet
To be used by the mount helper in userspace, where we still have options to be parsed by other layers. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-28bcachefs: bch2_time_stats_init_no_pcpu()Kent Overstreet
Add a mode to disable automatic switching to percpu mode, useful when a time_stats will only be used by one thread and we don't want to have to flush the percpu buffers. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-28bcachefs: Fix bch2_fs_get_tree() error pathFlorian Albrechtskirchinger
When a filesystem is mounted read-only, subsequent attempts to mount it as read-write fail with EBUSY. Previously, the error path in bch2_fs_get_tree() would unconditionally call __bch2_fs_stop(), improperly freeing resources for a filesystem that was still actively mounted. This change modifies the error path to only call __bch2_fs_stop() if the superblock has no valid root dentry, ensuring resources are not cleaned up prematurely when the filesystem is in use. Signed-off-by: Florian Albrechtskirchinger <falbrechtskirchinger@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-28bcachefs: fix logging in journal_entry_err_msg()Kent Overstreet
We want to log errors all at once, not spread across multiple printks. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-28bcachefs: add missing newline in bch2_trans_updates_to_text()Kent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-28bcachefs: print_string_as_lines: fix extra newlineKent Overstreet
Don't print a newline on empty string; this was causing us to also print an extra newline when we got to the end of th string. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-28bcachefs: Fix WARN() in bch2_bkey_pick_read_device()Kent Overstreet
syzbot discovered that this one is possible: we have pointers, but none of them are to valid devices. Reported-by: syzbot+336a6e6a2dbb7d4dba9a@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-28bcachefs: Don't return 0 size holes from bch2_seek_hole()Kent Overstreet
The hole we find in the btree might be fully dirty in the page cache. If so, keep searching. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-28bcachefs: Fix bch2_seek_hole() lockingKent Overstreet
We can't call bch2_seek_pagecache_hole(), and block on page locks, with btree locks held. This is easily fixed because we're at the end of the transaction - we can just unlock, we don't need a drop_locks_do(). Reported-by: https://github.com/nagalun Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-28bcachefs: Recovery no longer holds state_lockKent Overstreet
state_lock guards against devices coming or leaving, changing state, or the filesystem changing between ro <-> rw. But it's not necessary for running recovery passes, and holding it blocks asynchronous events that would cause us to go RO or kick out devices. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-28bcachefs: Fix permissions on version modparamKent Overstreet
There's no reason for this not to be world readable - it provides the currently supported on disk format version. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-26bcachefs: cond_resched() in journal_key_sort_cmp()Kent Overstreet
Fixes "task out to lunch" warnings during recovery on large machines with lots of dirty data in the journal. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-26bcachefs: Fix 'hung task' messages in btree node scanKent Overstreet
btree node scan has to wait on kthread workers that scan each device, potentially for awhile. We would like this to be interruptible, but we may need a different mechanism than signals for that - we've had bugs in the past where mounts were failing due to checking for signals, and no explanation on where they came from. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-26bcachefs: Fix btree iter flags in data move (2)Kent Overstreet
Data move -> move_get_io_opts -> bch2_get_update_rebalance_opts requires a not_extents iterator; this fixes the path where we're walking the extents btree and chase a reflink pointer into the reflink btree. bch2_lookup_indirect_extent() requires working with an extents iterator (due to peek_slot() semantics), so we implement bch2_lookup_indirect_extent_for_move(). This is simplified because there's no need to report indirect_extent_missing_errors here, that can be deferred until fsck or when a user reads that data. Reported-by: Maƫl Kerbiriou <mael.kerbiriou@free.fr> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-26bcachefs: Don't unnecessarily decrypt data when movingKent Overstreet
There's various checks for "are we going to compress this" - but we're not going to compress if we know it's incompressible. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-26bcachefs: Document disk accounting keys and conutersKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-26bcachefs: Validate number of counters for accounting keysKent Overstreet
We weren't checking that accounting keys have the expected number of accounters. Originally we probably wanted to be flexible on this, but it doesn't look like that will be required - accounting is extended by adding new counter types, not more counters to an existing type. This means we can drop a BUG_ON() that popped once in automated testing, and the new validation will make that bug easier to track down. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-25bcachefs: Use print_string_as_lines() for journal stuck messagesKent Overstreet
They were being truncated, printk has a 1k limit per call Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-25bcachefs: Fix duplicate checksum error messages in write pathKent Overstreet
Also, improve the message in prep_encoded_data() - it now prints good/bad checksums, and checksum type. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-25bcachefs: Fix silent short reads in data read retry pathKent Overstreet
__bch2_read, before calling __bch2_read_extent(), sets bvec_iter.bi_size to "the size we can read from the current extent" with a swap, and restores it to "the size for the total read" after the read_extent call with another swap. But we neglected to do the restore before the "if (ret) goto err;" - which is a problem if we're retrying those errors. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-25bcachefs: Fix nonce inconsistency in bch2_write_prep_encoded_data()Kent Overstreet
If we're moving an extent that was partially overwritten, bch2_write_rechecksum() will trim it to the currenty live range. If we then also want to compress it, it'll be decrypted - but the nonce has been advanced for the overwritten start of the extent that we dropped, and we were using the nonce we calculated before rechecksum(). Reported-by: Gabriel de Perthuis <g2p.code@gmail.com> Fixes: 127d90d2823e ("bcachefs: bch2_write_prep_encoded_data() now returns errcode") Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24bcachefs: Kill unnecessary bch2_dev_usage_read()Kent Overstreet
bch2_dev_usage_read() is fairly expensive, we should optimize this more. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24bcachefs: btree node write errors now print btree nodeKent Overstreet
It turned out a user was wondering why we were going read-only after a write error, and he didn't realize he didn't have replication enabled - this will make that more obvious, and we should be printing it anyways. Link: https://www.reddit.com/r/bcachefs/comments/1jf9akl/large_data_transfers_switched_bcachefs_to_readonly/ Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>