summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2023-10-22bcachefs: Don't write bucket IO time lazilyKent Overstreet
With the btree key cache code, we don't need to update the alloc btree lazily - and this will mean we can remove the bch2_alloc_write() call in the shutdown path. Future work: we really need to expend the bucket IO clocks from 16 to 64 bits, so that we don't have to rescale them. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Add BCH_BKEY_PTRS_MAXKent Overstreet
This now means "the maximum number of pointers within a bkey" - and bch_devs_list is updated to use it instead of BCH_REPLICAS_MAX, since stripes can contain more than BCH_REPLICAS_MAX pointers. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Check for duplicate device ptrs in bch2_bkey_ptrs_invalid()Kent Overstreet
This is something we clearly should be checking for, but weren't - oops. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Add some cond_rescheds() in shutdown pathKent Overstreet
Particularly on emergency shutdown we can end up having to clean up a lot of dirty cached btree keys here. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Fix btree node merge -> split operationsKent Overstreet
If a btree node merger is followed by a split or compact of the parent node, we could end up with the parent btree node iterator pointing to the whiteout inserted by the btree node merge operation - the fix is to ensure that interior btree node iterators always point to the first non whiteout. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Always check if we need disk res in extent update pathKent Overstreet
With erasure coding, we now have processes in the background that compact data, causing it to take up less space on disk than when it was written, or potentially when it was read. This means that we can't trust the page cache when it says "we have data on disk taking up x amount of space here" - there's always the potential to race with background compaction. To fix this, just check if we need to add to our disk reservation in the bch2_extent_update() path, in the transaction that will do the btree update. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Update transactional triggers interface to pass old & new keysKent Overstreet
This is needed to fix a bug where we're overflowing iterators within a btree transaction, because we're updating the stripes btree (to update block counts) and the stripes btree trigger is unnecessarily updating the alloc btree - it doesn't need to update the alloc btree when the pointers within a stripe aren't changing. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Only try to get existing stripe once in stripe create pathKent Overstreet
The stripe creation path was too state-machiney: it would always run the full state machine until it had succesfully created a new stripe. But if we tried to get and reuse an existing stripe after we'd already allocated some buckets, the buckets we'd allocated might have conflicted with the blocks in the existing stripe we need to keep - oops. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Fix __btree_iter_next() when all iters are in use_next() when all ↵Kent Overstreet
iters are in use Also, print out more information on btree transaction iterator overflow. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Fix rand_delete() testKent Overstreet
When we didn't find a key to delete we were getting a null ptr deref. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Try to print full btree error messageKent Overstreet
Metadata corruption bugs are hard to debug if we can't see exactly what went wrong - try to allocate a bigger buffer so we can print out everything we have. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Prevent journal reclaim from spinningKent Overstreet
Without checking if we actually flushed anything, journal reclaim could still go into an infinite loop while trying ot shut down. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Fix btree key cache dirty checksKent Overstreet
Had a type that meant we were triggering journal reclaim _much_ more aggressively than needed. Also, fix a potential integer overflow. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Be more conservation about journal pre-reservationsKent Overstreet
- Try to always keep 1/8th of the journal free, on top of pre-reservations - Move the check for whether the journal is stuck to bch2_journal_space_available, and make it only fire when there aren't any journal writes in flight (that might free up space by updating last_seq) Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Don't require flush/fua on every journal writeKent Overstreet
This patch adds a flag to journal entries which, if set, indicates that they weren't done as flush/fua writes. - non flush/fua journal writes don't update last_seq (i.e. they don't free up space in the journal), thus the journal free space calculations now check whether nonflush journal writes are currently allowed (i.e. are we low on free space, or would doing a flush write free up a lot of space in the journal) - write_delay_ms, the user configurable option for when open journal entries are automatically written, is now interpreted as the max delay between flush journal writes (default 1 second). - bch2_journal_flush_seq_async is changed to ensure a flush write >= the requested sequence number has happened - journal read/replay must now ignore, and blacklist, any journal entries newer than the most recent flush entry in the journal. Also, the way the read_entire_journal option is handled has been improved; struct journal_replay now has an entry, 'ignore', for entries that were read but should not be used. - assorted refactoring and improvements related to journal read in journal_io.c and recovery.c Previously, we'd have to issue a flush/fua write every time we accumulated a full journal entry - typically the bucket size. Now we need to issue them much less frequently: when an fsync is requested, or it's been more than write_delay_ms since the last flush, or when we need to free up space in the journal. This is a significant performance improvement on many write heavy workloads. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Improve journal free space calculationsKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Increase journal pipeliningKent Overstreet
This patch increases the maximum journal buffers in flight from 2 to 4 - this will be particularly helpful when in the future we stop requiring flush+fua for every journal write. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Don't issue btree writes that weren't journalledKent Overstreet
If we have an error in the btree interior update path that prevents us from journalling the update, we can't issue the corresponding btree node write - we didn't get a journal sequence number that would cause it to be ignored in recovery. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Check for errors in bch2_journal_reclaim()Kent Overstreet
If the journal is halted, journal reclaim won't necessarily be able to make any forward progress, and won't accomplish anything anyways - we should bail out so that we don't get stuck looping in reclaim when the caches are too dirty and we should be shutting down. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Flag inodes that had btree update errorsKent Overstreet
On write error, the vfs inode's i_size may be inconsistent with the btree inode's i_size - flag this so we don't have spurious assertions. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Improve some IO error messagesKent Overstreet
it's useful to know whether an error was for a read or a write - this also standardizes error messages a bit more. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Refactor filesystem usage accountingKent Overstreet
Various filesystem usage counters are kept in percpu counters, with one set per in flight journal buffer. Right now all the code that deals with it assumes that there's only two buffers/sets of counters, but the number of journal bufs is getting increased to 4 in the next patch - so refactor that code to not assume a constant. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Fix spurious alloc errors on forced shutdownKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Fix some spurious gcc warningsKent Overstreet
These only come up when building in userspace, for some reason. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Fix journal_flush_seq()Kent Overstreet
The error check was inverted - leading fsyncs to get stuck and hang, oops. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: bch2_trans_get_iter() no longer returns errorsKent Overstreet
Since we now always preallocate the maximum number of iterators when we initialize a btree transaction, getting an iterator never fails - we can delete a fair amount of error path code. This patch also simplifies the iterator allocation code a bit. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Add error handling to unit & perf testsKent Overstreet
This way, these tests can be used with tests that inject IO errors and shut down the filesystem. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Journal pin refactoringKent Overstreet
This deletes some duplicated code. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Fix for fsck spuriously finding duplicate extentsKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Use BTREE_ITER_PREFETCH in journal+btree iterKent Overstreet
Introducing the journal+btree iter introduced a regression where we stopped using BTREE_ITER_PREFETCH - this is a performance regression on rotating disks. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Ensure we always have a journal pin in interior update pathKent Overstreet
For the new nodes an interior btree update makes reachable, updates to those nodes may be journalled after the btree update starts but before the transactional part - where we make those nodes reachable. Those updates need to be kept in the journal until after the btree update completes, hence we should always get a journal pin at the start of the interior update. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Change a BUG_ON() to a fatal errorKent Overstreet
In the btree key cache code, failing to flush a dirty key is a serious error, but it doesn't need to be a BUG_ON(), we can stop the filesystem instead. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Fix error in filesystem initializationKent Overstreet
The rhashtable code doesn't like when we destroy an rhashtable that was never initialized Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Fix journal reclaim spinning in recoveryKent Overstreet
We can't run journal reclaim until we've finished replaying updates to interior btree nodes - the check for this was in the wrong place though, leading to journal reclaim spinning before it was allowed to proceed. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Fix for __readahead_batch getting partial batchKent Overstreet
We were incorrectly ignoring the return value of __readahead_batch, leading to a null ptr deref in __bch2_page_state_create(). Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Optimize bch2_journal_flush_seq_async()Kent Overstreet
Avoid taking the journal lock if we don't have to. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Delete dead codeKent Overstreet
The interior btree node update path has changed, this is no longer needed. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: bch2_btree_delete_range_trans()Kent Overstreet
This helps reduce stack usage by avoiding multiple btree_trans on the stack. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Don't use bkey cache for inode update in fsckKent Overstreet
fsck doesn't know about the btree key cache, and non-cached iterators aren't cache coherent (yet?) Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Fix an rcu splatKent Overstreet
bch2_bucket_alloc() requires rcu_read_lock() to be held. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Move journal reclaim to a kthreadKent Overstreet
This is to make tracing easier. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Throttle updates when btree key cache is too dirtyKent Overstreet
This is needed to ensure we don't deadlock because journal reclaim and thus memory reclaim isn't making forward progress. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Journal reclaim requires memalloc_noreclaim_save()Kent Overstreet
Memory reclaim requires journal reclaim to make forward progress - it's what cleans our caches - thus, while we're in journal reclaim or holding the journal reclaim lock we can't recurse into memory reclaim. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Simplify transaction commit error pathKent Overstreet
The transaction restart path traverses all iterators, we don't need to do it here. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Ensure journal reclaim runs when btree key cache is too dirtyKent Overstreet
Ensuring the key cache isn't too dirty is critical for ensuring that the shrinker can reclaim memory. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Improve btree key cache shrinkerKent Overstreet
The shrinker should start scanning for entries that can be freed oldest to newest - this way, we can avoid scanning a lot of entries that are too new to be freed. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: More debug code improvementsKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Add a kmem_cache for btree_key_cache objectsKent Overstreet
We allocate a lot of these, and we're seeing sporading OOMs - this will help with tracking those down. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Be more precise with journal error reportingKent Overstreet
We were incorrectly detecting a journal deadlock - the journal filling up - when only the journal pin fifo had filled up; if the journal pin fifo is full that just means we need to wait on reclaim. This plumbs through better error reporting so we can better discriminate in the journal_res_get path what's going on. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Add btree cache stats to sysfsKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>