summaryrefslogtreecommitdiff
path: root/fs/bcachefs
AgeCommit message (Collapse)Author
2025-06-17bcachefs: Fix restart handling in btree_node_scrub_work()Kent Overstreet
btree node scrub was sometimes failing to rewrite nodes with errors; bch2_btree_node_rewrite() can return a transaction restart and we weren't checking - the lockrestart_do() needs to wrap the entire operation. And there's a better helper it should've been using, bch2_btree_node_rewrite_key(), which makes all this more convenient. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-16bcachefs: Fix bch2_read_bio_to_text()Kent Overstreet
We can only pass negative error codes to bch2_err_str(); if it's a positive integer it's not an error and we trip an assert. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-16bcachefs: fsck: Fix check_path_loop() + snapshotsKent Overstreet
A path exists in a particular snapshot: we should do the pathwalk in the snapshot ID of the inode we started from, _not_ change snapshot ID as we walk inodes and dirents. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-16bcachefs: fsck: check_subdir_count logs pathKent Overstreet
We can easily go from inode number -> path now, which makes for more useful log messages. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-16bcachefs: fsck: additional diagnostics for reattach_inode()Kent Overstreet
Log the inode's new path. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-16bcachefs: fsck: check_directory_structure runs in reverse orderKent Overstreet
When we find a directory connectivity problem, we should do the repair in the oldest snapshot that has the issue - so that we don't end up duplicating work or making a real mess of things. Oldest snapshot IDs have the highest integer value, so - just walk inodes in reverse order. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-16bcachefs: fsck: Fix reattach_inode() for subvol rootsKent Overstreet
bch_subvolume.fs_path_parent needs to be updated as well, it should match inode.bi_parent_subvol. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-16bcachefs: fsck: Fix remove_backpointer() for subvol rootsKent Overstreet
The dirent will be in a different snapshot if the inode is a subvolume root. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-16bcachefs: fsck: Print path when we find a subvol loopKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-16bcachefs: Fix __bch2_inum_to_path() when crossing subvol boundariesKent Overstreet
The bch2_subvolume_get_snapshot() call needs to happen before the dirent lookup - the dirent is in the parent subvolume. Also, check for loops. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-16bcachefs: Call bch2_fs_init_rw() early if we'll be going rwKent Overstreet
kthread creation checks for pending signals, which is _very_ annoying if we have to do a long recovery and don't go rw until we've done significant work. Check if we'll be going rw and pre-allocate kthreads/workqueues. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-16bcachefs: fsck: Improve check_key_has_inode()Kent Overstreet
Print out more info when we find a key (extent, dirent, xattr) for a missing inode - was there a good inode in an older snapshot, full(ish) list of keys for that missing inode, so we can make better decisions on how to repair. If it looks like it should've been deleted, autofix it. If we ever hit the non-autofix cases, we'll want to write more repair code (possibly reconstituting the inode). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-16bcachefs: don't return fsck_fix for unfixable node errors in __btree_errBharadwaj Raju
After cd3cdb1ef706 ("Single err message for btree node reads"), all errors caused __btree_err to return -BCH_ERR_fsck_fix no matter what the actual error type was if the recovery pass was scanning for btree nodes. This lead to the code continuing despite things like bad node formats when they earlier would have caused a jump to fsck_err, because btree_err only jumps when the return from __btree_err does not match fsck_fix. Ultimately this lead to undefined behavior by attempting to unpack a key based on an invalid format. Make only errors of type -BCH_ERR_btree_node_read_err_fixable cause __btree_err to return -BCH_ERR_fsck_fix when scanning for btree nodes. Reported-by: syzbot+cfd994b9cdf00446fd54@syzkaller.appspotmail.com Fixes: cd3cdb1ef706 ("bcachefs: Single err message for btree node reads") Signed-off-by: Bharadwaj Raju <bharadwaj.raju777@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-16bcachefs: Fix pool->alloc NULL pointer dereferenceAlan Huang
btree_interior_update_pool has not been initialized before the filesystem becomes read-write, thus mempool_alloc in bch2_btree_update_start will trigger pool->alloc NULL pointer dereference in mempool_alloc_noprof Reported-by: syzbot+2f3859bd28f20fa682e6@syzkaller.appspotmail.com Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-16bcachefs: Move bset size check before csum checkAlan Huang
In syzbot's crash, the bset's u64s is larger than the btree node. Reported-by: syzbot+bfaeaa8e26281970158d@syzkaller.appspotmail.com Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-16bcachefs: mark more errors autofixKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-16bcachefs: Kill unused tracepointsKent Overstreet
Dead code cleanup. Link: https://lore.kernel.org/linux-bcachefs/20250612224059.39fddd07@batman.local.home/ Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-16bcachefs: opts.journal_rewindKent Overstreet
Add a mount option for rewinding the journal, bringing the entire filesystem to where it was at a previous point in time. This is for extreme disaster recovery scenarios - it's not intended as an undelete operation. The option takes a journal sequence number; the desired sequence number can be determined with 'bcachefs list_journal' Caveats: - The 'journal_transaction_names' option must have been enabled (it's on by default). The option controls emitting of extra debug info in the journal, so we can see what individual transactions were doing; It also enables journalling of keys being overwritten, which is what we rely on here. - A full fsck run will be automatically triggered since alloc info will be inconsistent. Only leaf node updates to non-alloc btrees are rewound, since rewinding interior btree updates isn't possible or desirable. - We can't do anything about data that was deleted and overwritten. Lots of metadata updates after the point in time we're rewinding to shouldn't cause a problem, since we segragate data and metadata allocations (this is in order to make repair by btree node scan practical on larger filesystems; there's a small 64-bit per device bitmap in the superblock of device ranges with btree nodes, and we try to keep this small). However, having discards enabled will cause problems, since buckets are discarded as soon as they become empty (this is why we don't implement fstrim: we don't need it). Hopefully, this feature will be a one-off thing that's never used again: this was implemented for recovering from the "vfs i_nlink 0 -> subvol deletion" bug, and that bug was unusually disastrous and additional safeguards have since been implemented. But if it does turn out that we need this more in the future, I'll have to implement an option so that empty buckets aren't discarded immediately - lagging by perhaps 1% of device capacity. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-15bcachefs: fsck: fix extent past end of inode repairKent Overstreet
Fix the case where we're deleting in a different snapshot and need to emit a whiteout - that requires a regular BTREE_ITER_filter_snapshots iterator. Also, only delete the part of the extent that extents past i_size. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-15bcachefs: fsck: fix add_inode()Kent Overstreet
the inode btree uses the offset field for the inum, not the inode field. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-15bcachefs: Fix snapshot_key_missing_inode_snapshot repairKent Overstreet
When the inode was a whiteout, we were inserting a new whiteout at the wrong (old) snapshot. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-15bcachefs: Fix "now allowing incompatible features" messageKent Overstreet
Check against version_incompat_allowed, not version_incompat. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-15bcachefs: pass last_seq into fs_journal_start()Kent Overstreet
Prep work for journal rewind, where the seq we're replaying from may be different than the last journal entry's last_seq. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-15bcachefs: better __bch2_snapshot_is_ancestor() assertKent Overstreet
Previously, we weren't checking the result of the skiplist walk, just the is_ancestor bitmap. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-15bcachefs: btree_iter: fix updates, journal overlayKent Overstreet
We need to start searching from search_key - _not_ path->pos, which will point to the key we found in the btree Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-15bcachefs: Fix bch2_journal_keys_peek_prev_min()Kent Overstreet
this code is rarely invoked, so - we had a few bugs left from basing it off of bch2_journal_keys_peek_max()... Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-15bcachefs: Delay calculation of trans->journal_u64sAlan Huang
When there is commit error that need split btree leaf, fsck might change the value of trans->journal_entries.u64s, when retry commit, the value of trans->journal_u64s would be incorrect, which will lead to trans->journal_res.u64s underflow, and then out of bounds write will occur: [ 464.496970][T11969] Call trace: [ 464.496973][T11969] show_stack+0x3c/0x88 (C) [ 464.496995][T11969] dump_stack_lvl+0xf8/0x178 [ 464.497014][T11969] dump_stack+0x20/0x30 [ 464.497031][T11969] __bch2_trans_log_str+0x344/0x350 [ 464.497048][T11969] bch2_trans_log_str+0x3c/0x60 [ 464.497065][T11969] __bch2_fsck_err+0x11bc/0x1390 [ 464.497083][T11969] bch2_check_discard_freespace_key+0xad4/0x10d0 [ 464.497100][T11969] bch2_bucket_alloc_freelist+0x99c/0x1130 [ 464.497117][T11969] bch2_bucket_alloc_trans+0x79c/0xcb8 [ 464.497133][T11969] bch2_bucket_alloc_set_trans+0x378/0xc20 [ 464.497151][T11969] __open_bucket_add_buckets+0x7fc/0x1c00 [ 464.497168][T11969] open_bucket_add_buckets+0x184/0x3a8 [ 464.497185][T11969] bch2_alloc_sectors_start_trans+0xa04/0x1da0 [ 464.497203][T11969] bch2_btree_reserve_get+0x6e0/0xef0 [ 464.497220][T11969] bch2_btree_update_start+0x1618/0x2600 [ 464.497239][T11969] bch2_btree_split_leaf+0xcc/0x730 [ 464.497258][T11969] bch2_trans_commit_error+0x22c/0xc30 [ 464.497276][T11969] __bch2_trans_commit+0x207c/0x4e30 [ 464.497292][T11969] bch2_journal_replay+0x9e0/0x1420 [ 464.497305][T11969] __bch2_run_recovery_passes+0x458/0xf98 [ 464.497318][T11969] bch2_run_recovery_passes+0x280/0x478 [ 464.497331][T11969] bch2_fs_recovery+0x24f0/0x3a28 [ 464.497344][T11969] bch2_fs_start+0xb80/0x1248 [ 464.497358][T11969] bch2_fs_get_tree+0xe94/0x1708 [ 464.497377][T11969] vfs_get_tree+0x84/0x2d0 Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-15bcachefs: Add missing EBUG_ONAlan Huang
Just like the EBUG_ON in bch2_journal_add_entry(). Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-15bcachefs: Fix alloc_req use after freeAlan Huang
Now the alloc_req is allocated from the bump allocator, if there is reallocation, the memory of alloc_req would be frees, fix by delaying the reallocation to transaction restart, it has to restart anyway. Reported-by: syzbot+2887a13a5c387e616a68@syzkaller.appspotmail.com Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-15bcachefs: Don't allocate new memory when mempool is exhaustedAlan Huang
Allocating new memory when mempool is exhausted is too complicated, just return ENOMEM is fine. memcpy is not needed, since there might be pointers point to the old memory, that's the bug. Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-15bcachefs: btree iter tracepointsKent Overstreet
We've been seeing some livelock-ish behavior in the index update part of the main write path, and while we've got low level btree path tracepoints, we've been lacking high level btree iterator tracepoints. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-15bcachefs: trace_extent_trim_atomicKent Overstreet
Add a tracepoint for when we insert only part of an extent, due to too many overwrites. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-11bcachefs: Don't trace should_be_locked unless changingKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-11bcachefs: Ensure that snapshot creation propagates has_case_insensitiveKent Overstreet
We normally can't create a new directory with the case-insensitive option already set - except when we're creating a snapshot. And if casefolding is enabled filesystem wide, we should still set it even though not strictly required, for consistency. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-11bcachefs: Print devices we're mounting on multi device filesystemsKent Overstreet
Previously, we only ever logged the filesystem UUID. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-11bcachefs: Don't trust sb->nr_devices in members_to_text()Kent Overstreet
We have to be able to print superblock sections even if they fail to validate (for debugging), so we have to calculate the number of entries from the field size. Reported-by: syzbot+5138f00559ffb3cb3610@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-11bcachefs: Fix version checks in validate_bset()Kent Overstreet
It seems btree node scan picked up a partially overwritten btree node, and corrected the "bset version older than sb version_min" error - resulting in an invalid superblock with a bad version_min field. Don't run this check at all when we're in btree node scan, and when we do run it, do something saner if the bset version is totally crazy. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-11bcachefs: ioctl: avoid stack overflow warningArnd Bergmann
Multiple ioctl handlers individually use a lot of stack space, and clang chooses to inline them into the bch2_fs_ioctl() function, blowing through the warning limit: fs/bcachefs/chardev.c:655:6: error: stack frame size (1032) exceeds limit (1024) in 'bch2_fs_ioctl' [-Werror,-Wframe-larger-than] 655 | long bch2_fs_ioctl(struct bch_fs *c, unsigned cmd, void __user *arg) By marking the largest two of them as noinline_for_stack, no indidual code path ends up using this much, which avoids the warning and reduces the possible total stack usage in the ioctl handler. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-11bcachefs: Don't pass trans to fsck_err() in gc_accounting_doneKent Overstreet
fsck_err() can return a transaction restart if passed a transaction object - this has always been true when it has to drop locks to prompt for user input, but we're seeing this more now that we're logging the error being corrected in the journal. gc_accounting_done() doesn't call fsck_err() from an actual commit loop, and it doesn't need to be holding btree locks when it calls fsck_err(), so the easy fix here for the unhandled transaction restart is to just not pass it the transaction object. We'll miss out on the fancy new logging, but that's ok. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-11bcachefs: Fix leak in bch2_fs_recovery() error pathKent Overstreet
Fix a small leak of the superblock 'clean' section. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-11bcachefs: Fix rcu_pending for PREEMPT_RTKent Overstreet
PREEMPT_RT redefines how standard spinlocks work, so local_irq_save() + spin_lock() is no longer equivalent to spin_lock_irqsave(). Fortunately, we don't strictly need to do it that way. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-11bcachefs: Fix downgrade_table_extra()Kent Overstreet
Fix a UAF: we were calling darray_make_room() and retaining a pointer to the old buffer. And fix an UBSAN warning: struct bch_sb_field_downgrade_entry uses __counted_by, so set dst->nr_errors before assigning to the array entry. Reported-by: syzbot+14c52d86ddbd89bea13e@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-11bcachefs: Don't put rhashtable on stackKent Overstreet
Object debugging generally needs special provisions for putting said objects on the stack, which rhashtable does not have. Reported-by: syzbot+bcc38a9556d0324c2ec2@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-11bcachefs: Make sure opts.read_only gets propagated back to VFSKent Overstreet
If we think we're read-only but the VFS doesn't, fun will ensue. And now that we know we have to be able to do this safely, just make nochanges imply ro. Reported-by: syzbot+a7d6ceaba099cc21dee4@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-11bcachefs: Fix possible console lock involved deadlockAlan Huang
Link: https://lore.kernel.org/all/6822ab02.050a0220.f2294.00cb.GAE@google.com/T/ Reported-by: syzbot+2c3ef91c9523c3d1a25c@syzkaller.appspotmail.com Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-11bcachefs: mark more errors autofixKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-11bcachefs: Don't persistently run scan_for_btree_nodesKent Overstreet
bch2_btree_lost_data() gets called on btree node read error, but the error might be transient. btree_node_scan is expensive, and there's no need to run it persistently (marking it in the superblock as required to run) - check_topology will run it if required, via bch2_get_scanned_nodes(). Running it non-persistently is fine, to avoid check_topology having to rewind recovery to run it. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-11bcachefs: Read error message now prints if self healingKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-11bcachefs: Only run 'increase_depth' for keys from btree node csanKent Overstreet
bch2_btree_increase_depth() was originally for disaster recovery, to get some data back from the journal when a btree root was bad. We don't need it for that purpose anymore; on bad btree root we'll launch btree node scan and reconstruct all the interior nodes. If there's a key in the journal for a depth that doesn't exists, and it's not from check_topology/btree node scan, we should just ignore it. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-11bcachefs: Mark need_discard_freespace_key_bad autofixKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>