Age | Commit message (Collapse) | Author |
|
The journal_keys array can't be substantially modified after we go RW,
because lookups need to be able to check it locklessly - thus we're
limited on what we can do when a key in the journal has been
overwritten.
This is a problem when there's many overwrites to skip over for peek()
operations. To fix this, add tracking of ranges of overwrites: we create
a range entry when there's more than one contiguous whiteout.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
prefetching
To help ameloriate issues with peek operations having to skip over
deletions in the journal - just bail out if all we're doing is
prefetching btree nodes.
Since btree node prefetching runs every time we iterate to a new node,
and has to sequentially scan ahead, this avoids another O(n^2).
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
There's an unavoidable issue with btree lookups when we're overlaying
journal keys and the journal has many deletions for keys present in the
btree - peek operations will have to iterate over all those deletions to
find the next live key to return.
This is mainly a problem for lookups in interior nodes, if we have to
traverse to a leaf. Looking up an insert position in a leaf (for journal
replay) doesn't have to find the next live key, but walking down the
btree does.
So to ameloriate this, change journal key sort ordering so that we
replay keys from roots and interior nodes first.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
A user contributed a filessytem dump, where the dump was actually
corrupted (due to being taken while the filesystem was online), but
which exposed an interesting bug in fsck - reconstruct_inode().
When itearting in BTREE_ITER_filter_snapshots mode, it's required to
give an end position for the iteration and it can't span inode numbers;
continuing into the next inode might mean we start seeing keys from a
different snapshot tree, that the is_ancestor() checks always filter,
thus we're never able to return a key and stop iterating.
Backwards iteration never implemented the end position because nothing
else needed it - except for reconstuct_inode().
Additionally, backwards iteration is now able to overlay keys from the
journal, which will be useful if we ever decide to start doing journal
replay in the background.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
We'll be introducing btree_iter_peek_prev_min(), so rename for
consistency.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
This fixes a performance regression in journal replay; without
colaescing accounting keys we have multiple keys at the same position,
which means journal_keys_peek_upto() has to skip past many overwritten
keys - turning journal replay into an O(n^2) algorithm.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Until accounting keys hit the btree, they are deltas, not new versions
of the existing key; this means we have to teach journal replay to
accumulate them.
Additionally, the journal doesn't track precisely which entries have
been flushed to the btree; it only tracks a range of entries that may
possibly still need to be flushed.
That means we need to compare accounting keys against the version in the
btree and only flush updates that are newer.
There's another wrinkle with the write buffer: if the write buffer
starts flushing accounting keys before journal replay has finished
flushing accounting keys, journal replay will see the version number
from the new updates and updates from the journal will be lost.
To avoid this, journal replay has to flush accounting keys first, and
we'll be adding a flag so that write buffer flush knows to hold
accounting keys until then.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
debug helper
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Don't pick a pivot that's going to be deleted.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
btree_and_journal_iter is old code that we want to get rid of, but we're
not ready to yet.
lack of btree node prefetching is, it turns out, a real performance
issue for fsck on spinning rust, so - add it.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
we now always have a btree_trans when using a btree_and_journal_iter;
prep work for adding prefetching to btree_and_journal_iter
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
The btree iterator code overlays keys from the journal until journal
replay is finished; since we're now starting copygc/rebalance etc.
before replay is finished, this is multithreaded access and thus needs
refcounting.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Split out a new file from recovery.c for managing the list of keys we
read from the journal: before journal replay finishes the btree iterator
code needs to be able to iterate over and return keys from the journal
as well, so there's a fair bit of code here.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|