Age | Commit message (Collapse) | Author |
|
steal the (clever) algorithm from get_random_u32_below()
this fixes a bug where we were passing roundup_pow_of_two() a 64 bit
number - we're squaring device latencies now:
[ +1.681698] ------------[ cut here ]------------
[ +0.000010] UBSAN: shift-out-of-bounds in ./include/linux/log2.h:57:13
[ +0.000011] shift exponent 64 is too large for 64-bit type 'long unsigned int'
[ +0.000011] CPU: 1 UID: 0 PID: 196 Comm: kworker/u32:13 Not tainted 6.14.0-rc6-dave+ #10
[ +0.000012] Hardware name: ASUS System Product Name/PRIME B460I-PLUS, BIOS 1301 07/13/2021
[ +0.000005] Workqueue: events_unbound __bch2_read_endio [bcachefs]
[ +0.000354] Call Trace:
[ +0.000005] <TASK>
[ +0.000007] dump_stack_lvl+0x5d/0x80
[ +0.000018] ubsan_epilogue+0x5/0x30
[ +0.000008] __ubsan_handle_shift_out_of_bounds.cold+0x61/0xe6
[ +0.000011] bch2_rand_range.cold+0x17/0x20 [bcachefs]
[ +0.000231] bch2_bkey_pick_read_device+0x547/0x920 [bcachefs]
[ +0.000229] __bch2_read_extent+0x1e4/0x18e0 [bcachefs]
[ +0.000241] ? bch2_btree_iter_peek_slot+0x3df/0x800 [bcachefs]
[ +0.000180] ? bch2_read_retry_nodecode+0x270/0x330 [bcachefs]
[ +0.000230] bch2_read_retry_nodecode+0x270/0x330 [bcachefs]
[ +0.000230] bch2_rbio_retry+0x1fa/0x600 [bcachefs]
[ +0.000224] ? bch2_printbuf_make_room+0x71/0xb0 [bcachefs]
[ +0.000243] ? bch2_read_csum_err+0x4a4/0x610 [bcachefs]
[ +0.000278] bch2_read_csum_err+0x4a4/0x610 [bcachefs]
[ +0.000227] ? __bch2_read_endio+0x58b/0x870 [bcachefs]
[ +0.000220] __bch2_read_endio+0x58b/0x870 [bcachefs]
[ +0.000268] ? try_to_wake_up+0x31c/0x7f0
[ +0.000011] ? process_one_work+0x176/0x330
[ +0.000008] process_one_work+0x176/0x330
[ +0.000008] worker_thread+0x252/0x390
[ +0.000008] ? __pfx_worker_thread+0x10/0x10
[ +0.000006] kthread+0xec/0x230
[ +0.000011] ? __pfx_kthread+0x10/0x10
[ +0.000009] ret_from_fork+0x31/0x50
[ +0.000009] ? __pfx_kthread+0x10/0x10
[ +0.000008] ret_from_fork_asm+0x1a/0x30
[ +0.000012] </TASK>
[ +0.000046] ---[ end trace ]---
Reported-by: Roland Vet <vet.roland@protonmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Pull misc vfs cleanups from Al Viro:
"Two unrelated patches - one is a removal of long-obsolete include in
overlayfs (it used to need fs/internal.h, but the extern it wanted has
been moved back to include/linux/namei.h) and another introduces
convenience helper constructing struct qstr by a NUL-terminated
string"
* tag 'pull-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
add a string-to-qstr constructor
fs/overlayfs/namei.c: get rid of include ../internal.h
|
|
Quite a few places want to build a struct qstr by given string;
it would be convenient to have a primitive doing that, rather
than open-coding it via QSTR_INIT().
The closest approximation was in bcachefs, but that expands to
initializer list - {.len = strlen(string), .name = string}.
It would be more useful to have it as compound literal -
(struct qstr){.len = strlen(string), .name = string}.
Unlike initializer list it's a valid expression. What's more,
it's a valid lvalue - it's an equivalent of anonymous local
variable with such initializer, so the things like
path->dentry = d_alloc_pseudo(mnt->mnt_sb, &QSTR(name));
are valid. It can also be used as initializer, with identical
effect -
struct qstr x = (struct qstr){.name = s, .len = strlen(s)};
is equivalent to
struct qstr anon_variable = {.name = s, .len = strlen(s)};
struct qstr x = anon_variable;
// anon_variable is never used after that point
and any even remotely sane compiler will manage to collapse that
into
struct qstr x = {.name = s, .len = strlen(s)};
What compound literals can't be used for is initialization of
global variables, but those are covered by QSTR_INIT().
This commit lifts definition(s) of QSTR() into linux/dcache.h,
converts it to compound literal (all bcachefs users are fine
with that) and converts assorted open-coded instances to using
that.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Add a version of kvmalloc() that doesn't have the INT_MAX limit; large
filesystems do hit this.
We'll want to get rid of the in-memory bucket gens array, but we're not
there quite yet.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Fix sort order for disk accounting keys, in order to fix a regression on
mount times.
The typetag is now the most significant byte of the key, meaning disk
accounting keys of the same type now sort together.
This lets us skip over disk accounting keys that aren't mirrored in
memory when reading accounting at startup, instead of having them
interleaved with other counter types.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
need this so cmd_option in userspace can handle it
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull non-MM updates from Andrew Morton:
- In the series "treewide: Refactor heap related implementation",
Kuan-Wei Chiu has significantly reworked the min_heap library code
and has taught bcachefs to use the new more generic implementation.
- Yury Norov's series "Cleanup cpumask.h inclusion in core headers"
reworks the cpumask and nodemask headers to make things generally
more rational.
- Kuan-Wei Chiu has sent along some maintenance work against our
sorting library code in the series "lib/sort: Optimizations and
cleanups".
- More library maintainance work from Christophe Jaillet in the series
"Remove usage of the deprecated ida_simple_xx() API".
- Ryusuke Konishi continues with the nilfs2 fixes and clanups in the
series "nilfs2: eliminate the call to inode_attach_wb()".
- Kuan-Ying Lee has some fixes to the gdb scripts in the series "Fix
GDB command error".
- Plus the usual shower of singleton patches all over the place. Please
see the relevant changelogs for details.
* tag 'mm-nonmm-stable-2024-07-21-15-07' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (98 commits)
ia64: scrub ia64 from poison.h
watchdog/perf: properly initialize the turbo mode timestamp and rearm counter
tsacct: replace strncpy() with strscpy()
lib/bch.c: use swap() to improve code
test_bpf: convert comma to semicolon
init/modpost: conditionally check section mismatch to __meminit*
init: remove unused __MEMINIT* macros
nilfs2: Constify struct kobj_type
nilfs2: avoid undefined behavior in nilfs_cnt32_ge macro
math: rational: add missing MODULE_DESCRIPTION() macro
lib/zlib: add missing MODULE_DESCRIPTION() macro
fs: ufs: add MODULE_DESCRIPTION()
lib/rbtree.c: fix the example typo
ocfs2: add bounds checking to ocfs2_check_dir_entry()
fs: add kernel-doc comments to ocfs2_prepare_orphan_dir()
coredump: simplify zap_process()
selftests/fpu: add missing MODULE_DESCRIPTION() macro
compiler.h: simplify data_race() macro
build-id: require program headers to be right after ELF header
resource: add missing MODULE_DESCRIPTION()
...
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Rewrite fsck/gc for the new accounting scheme.
This adds a second set of in-memory accounting counters for gc to use;
like with other parts of gc we run all trigger in TRIGGER_GC mode, then
compare what we calculated to existing in-memory accounting at the end.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Drop the heap-related macros from bcachefs and replacing them with the
generic min_heap implementation from include/linux. By doing so, code
readability is improved by using functions instead of macros. Moreover,
the min_heap implementation in include/linux adopts a bottom-up variation
compared to the textbook version currently used in bcachefs. This
bottom-up variation allows for approximately 50% reduction in the number
of comparison operations during heap siftdown, without changing the number
of swaps, thus making it more efficient.
[visitorckw@gmail.com: fix missing assignment of minimum element]
Link: https://lkml.kernel.org/r/20240602174828.1955320-1-visitorckw@gmail.com
Link: https://lkml.kernel.org/ioyfizrzq7w7mjrqcadtzsfgpuntowtjdw5pgn4qhvsdp4mqqg@nrlek5vmisbu
Link: https://lkml.kernel.org/r/20240524152958.919343-17-visitorckw@gmail.com
Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com>
Reviewed-by: Ian Rogers <irogers@google.com>
Acked-by: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Bagas Sanjaya <bagasdotme@gmail.com>
Cc: Brian Foster <bfoster@redhat.com>
Cc: Ching-Chun (Jim) Huang <jserv@ccns.ncku.edu.tw>
Cc: Coly Li <colyli@suse.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Sakai <msakai@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
bdev_sectors() is not used hence remove it.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Link: https://lore.kernel.org/r/20240411145346.2516848-10-viro@zeniv.linux.org.uk
Acked-by: Kent Overstreet <kent.overstreet@linux.dev>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
"bcachefs; Fix deadlock in bch2_btree_update_start()" was a significant
performance regression (nearly 50%) on multithreaded random writes with
fio.
The reason is that the journal watermark checks multiple things,
including the state of the btree write buffer, and on multithreaded
update heavy workloads we're bottleneked on write buffer flushing - we
don't want kicknig off btree updates to depend on the state of the write
buffer.
This isn't strictly correct; the interior btree update path does do
write buffer updates, but it's a tiny fraction of total accounting
updates and we're more concerned with space in the journal itself.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
The ! was obviously intended to be ~. As it is, this function does
the equivalent to: "addr[bit / 64] = 0;".
Fixes: 27fcec6c27ca ("bcachefs: Clear recovery_passes_required as they complete without errors")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Pull out eytzinger.c and kill eytzinger_cmp_fn. We now provide
eytzinger0_sort and eytzinger0_sort_r, which use the standard cmp_func_t
and cmp_r_func_t callbacks.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
prep work for lifting out of fs/bcachefs/
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Now that we've got the errors_silent mechanism, we don't have to check
if the reconstruct_alloc option is set all over the place.
Also - users no longer have to explicitly select fsck and fix_errors.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
we've got some helpers that return errors sanely, move them to a more
common location for use in fs-ioctl.c
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Fixes: e6a2566f7a00 ("bcachefs: Better journal tracepoints")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Reported-by: smatch
|
|
Also print out the data_opts, so that we can see what specifically is
being done to an extent.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
not portable to userspace
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
We now include backtraces for every thread involved in the cycle.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Bit of cleanup & modernization: also moving this code to util.c, it'll
be used by userspace as well.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
This introduces a new helper for connecting time_stats to state changes,
i.e. when taking journal reservations is blocked for some reason.
We use this to track separately the different reasons the journal might
be blocked - i.e. space in the journal full, or the journal pin fifo
full.
Also do some cleanup and improvements on the time stats code.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
similar to prt_bitflags(), but for ulong arrays
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Improved, better named version of pr_time().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
subvolume.c has gotten a bit large, this splits out a separate file just
for managing snapshot trees - BTREE_ID_snapshots.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
This improves copygc pipelining across multiple buckets: we now track
each in flight bucket we're evacuating, with separate moving_contexts.
This means that whereas previously we had to wait for outstanding moves
to complete to ensure we didn't try to evacuate the same bucket twice,
we can now just check buckets we want to evacuate against the pending
list.
This also mean we can run the verify_bucket_evacuated() check without
killing pipelining - meaning it can now always be enabled, not just on
debug builds.
This is going to be important for the upcoming erasure coding work,
where moving IOs that are being erasure coded will now skip the initial
replication step; instead the IOs will wait on the stripe to complete.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Now we also print the number of buckets reserved for each watermark.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
These were wrappers around atomic operations that verified that the
counter wasn't negative, but they're dead code - delete.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Daniel Hill <daniel@gluo.nz>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Some lock operations can't fail; a cycle of nofail locks is impossible
to recover from. So we want to get rid of these nofail locking
operations, but as this is tricky it'll be done incrementally.
If such a cycle happens, this patch prints out which codepaths are
involved so we know what to work on next.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
This adds a helper for printing a large buffer one line at a time, to
avoid the 1k printk limit.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
For debugging the eytzinger search tree code, and low level bkey packing
code, it can be helpful to see things in binary: this patch improves our
helpers for doing so.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
This converts bcachefs to the modern printbuf interface/implementation,
synced with the version to be submitted upstream.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Btree updates before we go RW work by inserting into the array of keys
that journal replay will insert - but inserting into a flat array is
O(n), meaning if btree_gc needs to update many alloc keys, we're O(n^2).
Fortunately, the updates btree_gc does happens in sequential order,
which means a gap buffer works nicely here - this patch implements a gap
buffer for journal keys.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
In a few places we were passing a variable to pr_buf() for the format
string - oops.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
In the write path, after the write to the block device(s) complete we
have to punt to process context to do the btree update.
Instead of using the work item embedded in op->cl, this patch switches
to a per write-point work item. This helps with two different issues:
- lock contention: btree updates to the same writepoint will (usually)
be updating the same alloc keys
- context switch overhead: when we're bottlenecked on btree updates,
having a thread (running out of a work item) checking the write point
for completed ops is cheaper than queueing up a new work item and
waking up a kworker.
In an arbitrary benchmark, 4k random writes with fio running inside a
VM, this patch resulted in a 10% improvement in total iops.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
When deleting an entry from a heap that was at entry h->used - 1, we'd
end up calling heap_sift() on an entry outside the heap - the entry we
just removed - which would end up re-adding it to the heap and deleting
something we didn't want to delete. Oops...
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
This tells the compiler to check printf format strings, and catches a
few bugs.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|