linux.git - Linus' kernel tree

Age	Commit message (Collapse)	Author
2025-07-01	bcachefs: opts.casefold_disabled	Kent Overstreet
	Add an option for completely disabling casefolding on a filesystem, as a workaround for overlayfs. This should only be needed as a temporary workaround, until the overlayfs fix arrives. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-16	bcachefs: opts.journal_rewind	Kent Overstreet
	Add a mount option for rewinding the journal, bringing the entire filesystem to where it was at a previous point in time. This is for extreme disaster recovery scenarios - it's not intended as an undelete operation. The option takes a journal sequence number; the desired sequence number can be determined with 'bcachefs list_journal' Caveats: - The 'journal_transaction_names' option must have been enabled (it's on by default). The option controls emitting of extra debug info in the journal, so we can see what individual transactions were doing; It also enables journalling of keys being overwritten, which is what we rely on here. - A full fsck run will be automatically triggered since alloc info will be inconsistent. Only leaf node updates to non-alloc btrees are rewound, since rewinding interior btree updates isn't possible or desirable. - We can't do anything about data that was deleted and overwritten. Lots of metadata updates after the point in time we're rewinding to shouldn't cause a problem, since we segragate data and metadata allocations (this is in order to make repair by btree node scan practical on larger filesystems; there's a small 64-bit per device bitmap in the superblock of device ranges with btree nodes, and we try to keep this small). However, having discards enabled will cause problems, since buckets are discarded as soon as they become empty (this is why we don't implement fstrim: we don't need it). Hopefully, this feature will be a one-off thing that's never used again: this was implemented for recovering from the "vfs i_nlink 0 -> subvol deletion" bug, and that bug was unusually disastrous and additional safeguards have since been implemented. But if it does turn out that we need this more in the future, I'll have to implement an option so that empty buckets aren't discarded immediately - lagging by perhaps 1% of device capacity. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21	bcachefs: Knob for manual snapshot deletion	Kent Overstreet
	Add 'opts.snapshot_deletion_enabled', enabled by default. This may be turned off so that the new sysfs knob, 'internal/trigger_delete_dead_snapshots', may be used instead - this will allow snapshot deletion to be profiled more easily. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21	bcachefs: opts.rebalance_on_ac_only	Kent Overstreet
	Add an option for setting rebalance to only run when connected to mains power. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21	bcachefs: Incompatible features may now be enabled at runtime	Kent Overstreet
	version_upgrade is now a runtime option. In the future we'll want to add compatible upgrades at runtime, and call the full check_version_upgrade() when the option changes, but we don't have compatible optional upgrades just yet. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21	bcachefs: Clean up option pre/post hooks, small fixes	Kent Overstreet
	The helpers are now: - bch2_opt_hook_pre_set() - bch2_opts_hooks_pre_set() - bch2_opt_hook_post_set Fix a bug where the filesystem discard option would incorrectly be changed when setting the device option, and don't trigger rebalance scans unnecessarily (when options aren't changing). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21	bcachefs: Single device mode	Kent Overstreet
	Single device filesystems are now identified by the block device name, not the UUID - and single device filesystems with the same UUID can be mounted simultaneously, without any special options. This allocates a new bit in the superblock, BCH_SB_MULTI_DEVICE, which indicates whether a filesystem has ever been multi device. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21	bcachefs: Improve opts.degraded	Kent Overstreet
	Kill 'opts.very_degraded', and make 'opts.degraded' a persistent option, stored in the superblock. It's now an enum, with available choices ask/yes/very/no. "ask" mode will be handled by the mount helper, for prompting the user (on a machine used interactively) for whether to do a degraded mount. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-24	bcachefs: Casefold is now a regular opts.h option	Kent Overstreet
	Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-28	bcachefs: Add an "ignore unknown" option to bch2_parse_mount_opts()	Kent Overstreet
	To be used by the mount helper in userspace, where we still have options to be parsed by other layers. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-25	bcachefs: Fix duplicate checksum error messages in write path	Kent Overstreet
	Also, improve the message in prep_encoded_data() - it now prints good/bad checksums, and checksum type. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24	bcachefs: Fix block/btree node size defaults	Kent Overstreet
	We're fixing option parsing in userspace, it now obeys OPT_SB_FIELD_SECTORS Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24	bcachefs: Device state is now a runtime option	Kent Overstreet
	Other options can normally be set at runtime via sysfs, no reason for this one not to be as well - it just doesn't support the degraded flags argument this way, that requires the ioctl. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24	bcachefs: Device options now use standard sysfs code	Kent Overstreet
	Device options now use the common code for sysfs, and can superblock fields (in a struct bch_member). This replaces BCH_DEV_OPT_SETTERS(), which was weird and easy to miss. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24	bcachefs: Kill BCH_DEV_OPT_SETTERS()	Kent Overstreet
	Previously, device options had their superblock option field listed separately, which was weird and easy to miss when defining options. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-16	bcachefs: Checksum errors get additional retries	Kent Overstreet
	It's possible for checksum errors to be transient - e.g. flakey controller or cable, thus we need additional retries (besides retrying from different replicas) before we can definitely return an error. This is particularly important for the next patch, which will allow the data move path to move extents with checksum errors - we don't want to accidentally introduce bitrot due to a transient error! - bch2_bkey_pick_read_device() is substantially reworked, and bch2_dev_io_failures is expanded to record more information about the type of failure (i.e. number of checksum errors). It now returns an error code that describes more precisely the reason for the failure - checksum error, io error, or offline device, instead of the previous generic "insufficient devices". This is important for the next patches that add poisoning, as we only want to poison extents when we've got real checksum errors (or perhaps IO errors?) - not because a device was offline. - Add a new option and superblock field for the number of checksum retries. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14	bcachefs: Kick devices out after too many write IO errors	Kent Overstreet
	We're improving our handling of write errors - we shouldn't write degraded data just because a write failed once, we should retry it (on other devices, if possible). But for this to work, we need to kick devices out when they're only returning errors - otherwise those retries will loop infinitely. This adds a configurable timeout - if writes are failing for too long, we'll set that device read-only. In the future we should also implement more tracking and another knob for an "allowed error rate", so that we can kick out drives that are acting "unhealthy". Another thing we'll want is a mechanism (likely in userspace) for bringing a device back in after a transient error - perhaps a cable was jiggled, or there was a controller reset. After transient errors we also need a mechanism to walk (from the journal) recent btree updates that weren't flushed to that device and treat them as "degraded", since unflushed data may well not have been written. Out of scope for this patch, but becoming relevant. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14	bcachefs: metadata_target is not an inode option	Kent Overstreet
	This option only applies filesystem wide. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-02-06	bcachefs: bch2_bkey_sectors_need_rebalance() now only depends on ↵	Kent Overstreet
	bch_extent_rebalance Previously, bch2_bkey_sectors_need_rebalance() called bch2_target_accepts_data(), checking whether the target is writable. However, this means that adding or removing devices from a target would change the value of bch2_bkey_sectors_need_rebalance() for an existing extent; this needs to be invariant so that the extent trigger can correctly maintain rebalance_work accounting. Instead, check target_accepts_data() in io_opts_to_rebalance_opts(), before creating the bch_extent_rebalance entry. This fixes (one?) cause of rebalance_work accounting being off. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-01-25	bcachefs: rebalance, copygc enabled are runtime opts	Kent Overstreet
	Fix a regression from when these were switched to normal opts.h options. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-01-09	bcachefs: bcachefs_metadata_version_persistent_inode_cursors	Kent Overstreet
	Persistent cursors for inode allocation. A free inodes btree would add substantial overhead to inode allocation and freeing - a "next num to allocate" cursor is always going to be faster. We just need it to be persistent, to avoid scanning the inodes btree from the start on startup. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21	bcachefs: Don't BUG_ON() when superblock feature wasn't set for compressed data	Kent Overstreet
	We don't allocate the mempools for compression/decompression unless we need them - but that means there's an inconsistency to check for. Reported-by: syzbot+cb3fbcfb417448cfd278@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21	bcachefs: Correct the description of the '--bucket=size' options	Youling Tang
	Signed-off-by: Youling Tang <tangyouling@kylinos.cn> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21	bcachefs: New bch_extent_rebalance fields	Kent Overstreet
	- Add more io path options to bch_extent_rebalance - For each option, track whether it came from the filesystem or the inode This will be used for improved rebalance support for reflinked data. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21	bcachefs: bch2_prt_csum_opt()	Kent Overstreet
	bounds checking helper Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21	bcachefs: copygc_enabled, rebalance_enabled now opts.h options	Kent Overstreet
	They can now be set at mount time Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21	bcachefs: Add bch_io_opts fields for indicating whether the opts came from ↵	Kent Overstreet
	the inode This is going to be used in the bch_extent_rebalance improvements, which propagate io_path options into the extent (important for rebalance, which needs something present in the extent for transactionally tagging them in the rebalance_work btree, and also for indirect extents). By tracking in bch_extent_rebalance whether the option came from the filesystem or the inode we can correctly handle options being changed on indirect extents. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21	bcachefs: io_opts_to_rebalance_opts()	Kent Overstreet
	New helper to simplify bch2_bkey_set_needs_rebalance() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21	bcachefs: bch2_io_opts_fixups()	Kent Overstreet
	Centralize some io path option fixups - they weren't always being applied correctly: - background_compression uses compression if unset - background_target uses foreground_target if unset - nocow disables most fancy io path options Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-18	bcachefs: Add hash seed, type to inode_to_text()	Kent Overstreet
	This helped with discovering some filesystem corruption fsck has having trouble with: the str_hash type had gotten flipped on one snapshot's version of an inode. All versions of a given inode number have the same hash seed and hash type, since lookups will be done with a single hash/seed and type and see dirents/xattrs from multiple snapshots. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-21	bcachefs: bch2_opts_to_text()	Kent Overstreet
	Factor out bch2_show_options() into a generic helper, for debugging option passing issues. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-21	bcachefs: Options for recovery_passes, recovery_passes_exclude	Kent Overstreet
	This adds mount options for specifying recovery passes to run, or exclude; the immediate need for this is that backpointers fsck is having trouble completing, so we need a way to skip it. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-09	bcachefs: promote_whole_extents is now a normal option	Kent Overstreet
	Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-09	bcachefs: Opt_durability can now be set via bch2_opt_set_sb()	Kent Overstreet
	Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-09	bcachefs: bch2_opt_set_sb() can now set (some) device options	Kent Overstreet
	Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-09	bcachefs: data_allowed is now an opts.h option	Kent Overstreet
	need this so cmd_option in userspace can handle it Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-07	bcachefs: Make allocator stuck timeout configurable, ratelimit messages	Kent Overstreet
	Limit these messages to once every 2 minutes to avoid spamming logs; with multiple devices the output can be quite significant. Also, up the default timeout to 30 seconds from 10 seconds. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14	bcachefs: Make read_only a mount option again, but hidden	Kent Overstreet
	fsck passes read_only as a mount option, and it's required for nochanges, which it also uses. Usually read_only is handled by the VFS, but we need to be able to handle it too; we just don't want to print it out twice, so mark it as a hidden option. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14	bcachefs: use new mount API	Thomas Bertschinger
	This updates bcachefs to use the new mount API: - Update the file_system_type to use the new init_fs_context() function. - Define the new fs_context_operations functions. - No longer register bch2_mount() and bch2_remount(); these are now called via the new fs_context functions. - Define a new helper type, bch2_opts_parse that includes a struct bch_opts and additionally a printbuf used to save options that can't be parsed until after the FS is opened. This enables us to parse as many options as possible prior to opening the filesystem while saving those options that need the open FS for later parsing. Signed-off-by: Thomas Bertschinger <tahbertschinger@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14	bcachefs: add printbuf arg to bch2_parse_mount_opts()	Thomas Bertschinger
	Mount options that take the name of a device that may be part of a filesystem, for example "metadata_target", cannot be validated until after the filesystem has been opened. However, an attempt to parse those options may be made prior to the filesystem being opened. This change adds a printbuf parameter to bch2_parse_mount_opts() which will be used to save those mount options, when they are supplied prior to the FS being opened, so that they can be parsed later. This functionality is not currently needed, but will be used after bcachefs starts using the new mount API to parse mount options. This is because using the new mount API, we will process mount options prior to opening the FS, but the new API doesn't provide a convenient way to "replay" mount option parsing. So we save these options ourselves to accomplish this. This change also splits out the code to parse a single option into bch2_parse_one_mount_opt(), which will be useful when using the new mount API which deals with a single mount option at a time. Signed-off-by: Thomas Bertschinger <tahbertschinger@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14	bcachefs: don't expose "read_only" as a mount option	Thomas Bertschinger
	When "read_only" is exposed as a mount option, it is redundant with the standard option "ro" and gives users multiple ways to specify that a bcachefs filesystem should be mounted read-only. This presents the risk of having inconsistent options specified. This can be seen when remounting a read-only filesystem in read-write mode, using mount(8) from util-linux. Because mount(8) parses the existing mount options from `/proc/mounts` and applies them when remounting, it can end up applying both "read_only" and "rw": $ mount img -o ro /mnt $ strace mount -o remount,rw /mnt ... fsconfig(4, FSCONFIG_SET_FLAG, "read_only", NULL, 0) = 0 fsconfig(4, FSCONFIG_SET_FLAG, "rw", NULL, 0) = 0 ... Making "read_only" no longer a mount option means this edge case cannot occur. Fixes: 62719cf33c3a ("bcachefs: Fix nochanges/read_only interaction") Signed-off-by: Thomas Bertschinger <tahbertschinger@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-20	bcachefs: Fix safe errors by default	Kent Overstreet
	i.e. the start of automatic self healing: If errors=continue or fix_safe, we now automatically fix simple errors without user intervention. New error action option: fix_safe This replaces the existing errors=ro option, which gets a new slot, i.e. existing errors=ro users now get errors=fix_safe. This is currently only enabled for a limited set of errors - initially just disk accounting; errors we would never not want to fix, and we don't want to require user intervention (i.e. to make sure a bug report gets filed). Errors will still be counted in the superblock, so we (developers) will still know they've been occuring if a bug report gets filed (as bug reports typically include the errors superblock section). Eventually we'll be enabling this for a much wider set of errors, after we've done thorough error injection testing. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-08	bcachefs: Kill opts.buckets_nouse	Kent Overstreet
	Now explicitly allocate and free the buckets_nouse bitmap - this is going to be used for online fsck. To go RW when we haven't check allocations, we'll do a much slimmed down version that just initializes the buckets_nouse bitmaps. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-08	bcachefs: iter/update/trigger/str_hash flag cleanup	Kent Overstreet
	Combine iter/update/trigger/str_hash flags into a single enum, and x-macroize them for a to_text() function later. These flags are all for a specific iter/key/update context, so it makes sense to group them together - iter/update/trigger flags were already given distinct bits, this cleans up and unifies that handling. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-13	bcachefs: Standardize helpers for printing enum strs with bounds checks	Kent Overstreet
	Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-03	bcachefs: Repair pass for scanning for btree nodes	Kent Overstreet
	If a btree root or interior btree node goes bad, we're going to lose a lot of data, unless we can recover the nodes that it pointed to by scanning. Fortunately btree node headers are fully self describing, and additionally the magic number is xored with the filesytem UUID, so we can do so safely. This implements the scanning - next patch will rework topology repair to make use of the found nodes. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-31	bcachefs: Improve -o norecovery; opts.recovery_pass_limit	Kent Overstreet
	This adds opts.recovery_pass_limit, and redoes -o norecovery to make use of it; this fixes some issues with -o norecovery so it can be safely used for data recovery. Norecovery means "don't do journal replay"; it's an important data recovery tool when we're getting stuck in journal replay. When using it this way we need to make sure we don't free journal keys after startup, so we continue to overlay them: thus it needs to imply retain_recovery_info, as well as nochanges. recovery_pass_limit is an explicit option for telling recovery to exit after a specific recovery pass; this is a much cleaner way of implementing -o norecovery, as well as being a useful debug feature in its own right. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-13	bcachefs: Pin btree cache in ram for random access in fsck	Kent Overstreet
	Various phases of fsck involve checking references from one btree to another: this means doing a sequential scan of one btree, and then mostly random access into the second. This is particularly painful for checking extents <-> backpointers; we can prefetch btree node access on the sequential scan, but not on the random access portion, and this is particularly painful on spinning rust, where we'd like to keep the pipeline fairly full of btree node reads so that the elevator can reduce seeking. This patch implements prefetching and pinning of the portion of the btree that we'll be doing random access to. We already calculate how much of the random access btree will fit in memory so it's a fairly straightforward change. This will put more pressure on system memory usage, so we introduce a new option, fsck_memory_usage_percent, which is the percentage of total system ram that fsck is allowed to pin. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-10	bcachefs: no_splitbrain_check option	Kent Overstreet
	This adds an option to disable kicking out devices when splitbrain is detected - it seems there's some issues with splitbrain detection and we're kicking out devices erronously. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-21	bcachefs: opts->compression can now also be applied in the background	Kent Overstreet
	The "apply this compression method in the background" paths now use the compression option if background_compression is not set; this means that setting or changing the compression option will cause existing data to be compressed accordingly in the background. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>