summaryrefslogtreecommitdiff
path: root/fs/bcachefs/journal.h
AgeCommit message (Collapse)Author
2025-02-06bcachefs: fix deadlock in journal_entry_open()Jeongjun Park
In the previous commit b3d82c2f2761, code was added to prevent journal sequence overflow. Among them, the code added to journal_entry_open() uses the bch2_fs_fatal_err_on() function to handle errors. However, __journal_res_get() , which calls journal_entry_open() , calls journal_entry_open() while holding journal->lock , but bch2_fs_fatal_err_on() internally tries to acquire journal->lock , which results in a deadlock. So we need to add a locked helper to handle fatal errors even when the journal->lock is held. Fixes: b3d82c2f2761 ("bcachefs: Guard against journal seq overflow") Signed-off-by: Jeongjun Park <aha310510@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-01-25bcachefs: Improve journal pin flushingKent Overstreet
Running the preempt tiering tests with a lower than normal journal reclaim delay turned up a shutdown hang - a lost wakeup, caused because flushing a journal pin (e.g. key cache/write buffer) can generate a new journal pin. The "simple" fix of adding the correct wakeup didn't work because of ordering issues; if we flush btree node pins too aggressively before other pins have completed, we end up spinning where each flush iteration generates new work. So to fix this correctly: - The list of flushed journal pins is now broken out by type, so that we can wait for key cache/write buffer pin flushing to complete before flushing dirty btree nodes - A new closure_waitlist is added for bch2_journal_flush_pins; this one is only used under or when we're taking the journal lock, so it's pretty cheap to add rigorously correct wakeups to journal_pin_set() and journal_pin_drop(). Additionally, bch2_journal_seq_pins_to_text() is moved to journal_reclaim.c, where it belongs, along with a bit of other small renaming and refactoring. Besides fixing the hang, the better ordering between key cache/write buffer flushing and btree node flushing should help or fix the "unmount taking excessively long" a few users have been noticing. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-01-21bcachefs: "Journal stuck" timeout now takes into account device latencyKent Overstreet
If a block device (e.g. your typical consumer SSD) is taking multiple seconds for IOs (typically flushes), we don't want to emit the "journal stuck" message prematurely. Also, make sure to drop the btree_trans srcu lock if we're blocking for more than a second. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21bcachefs: bch2_journal_noflush_seq() now takes [start, end)Kent Overstreet
Harder to screw up if we're explicit about the range, and more correct as journal reservations can be outstanding on multiple journal entries simultaneously. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21bcachefs: errcode cleanup: journal errorsKent Overstreet
Instead of throwing standard error codes, we should be throwing dedicated private error codes, this greatly improves debugability. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21bcachefs: btree_write_buffer_flush_seq() no longer closes journalKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21bcachefs: Can now block journal activity without closing cur entryKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-20bcachefs: Don't use wait_event_interruptible() in recoveryKent Overstreet
Fix a bug where mount was failing with -ERESTARTSYS: https://github.com/koverstreet/bcachefs/issues/741 We only want the interruptible wait when called from fsync. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14bcachefs: Use try_cmpxchg() family of functions instead of cmpxchg()Uros Bizjak
Use try_cmpxchg() family of functions instead of cmpxchg (*ptr, old, new) == old. x86 CMPXCHG instruction returns success in ZF flag, so this change saves a compare after cmpxchg (and related move instruction in front of cmpxchg). Also, try_cmpxchg() implicitly assigns old *ptr value to "old" when cmpxchg fails. There is no need to re-read the value in the loop. No functional change intended. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Brian Foster <bfoster@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-28bcachefs: Don't use the new_fs() bucket alloc path on an initialized fsKent Overstreet
On a new filesystem or device we have to allocate the journal with a bump allocator, because allocation info isn't ready yet - but when hot-adding a device that doesn't have a journal, we don't want to use that path. Reported-by: syzbot+24a867cb90d8315cccff@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-08bcachefs: x-macroize journal flags enumsKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-08bcachefs: New assertion for writing to the journal after shutdownKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-10bcachefs: better journal pipeliningKent Overstreet
Recently a severe performance regression was discovered, which bisected to a6548c8b5eb5 bcachefs: Avoid flushing the journal in the discard path It turns out the old behaviour, which issued excessive journal flushes, worked around a performance issue where queueing delays would cause the journal to not be able to write quickly enough and stall. The journal flushes masked the issue because they periodically flushed the device write cache, reducing write latency for non flushes. This patch reworks the journalling code to allow more than one (non-flush) write to be in flight at a time. With this patch, doing 4k random writes and an iodepth of 128, we are now able to hit 560k iops to a Samsung 970 EVO Plus - previously, we were stuck in the ~200k range. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-01bcachefs: vstruct_for_each() now declares loop iterKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-01bcachefs: btree write buffer now slurps keys from journalKent Overstreet
Previosuly, the transaction commit path would have to add keys to the btree write buffer as a separate operation, requiring additional global synchronization. This patch introduces a new journal entry type, which indicates that the keys need to be copied into the btree write buffer prior to being written out. We switch the journal entry type back to JSET_ENTRY_btree_keys prior to write, so this is not an on disk format change. Flushing the btree write buffer may require pulling keys out of journal entries yet to be written, and quiescing outstanding journal reservations; we previously added journal->buf_lock for synchronization with the journal write path. We also can't put strict bounds on the number of keys in the journal destined for the write buffer, which means we might overflow the size of the preallocated buffer and have to reallocate - this introduces a potentially fatal memory allocation failure. This is something we'll have to watch for, if it becomes an issue in practice we can do additional mitigation. The transaction commit path no longer has to explicitly check if the write buffer is full and wait on flushing; this is another performance optimization. Instead, when the btree write buffer is close to full we change the journal watermark, so that only reservations for journal reclaim are allowed. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-01bcachefs: kill journal->preres_waitKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-12-10bcachefs: Close journal entry if necessary when flushing all pinsKent Overstreet
Since outstanding journal buffers hold a journal pin, when flushing all pins we need to close the current journal entry if necessary so its pin can be released. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-28bcachefs: move journal seq assertionKent Overstreet
journal_cur_seq() can legitimately be used outside of the journal lock, where this assert can race Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-14bcachefs: Kill journal pre-reservationsKent Overstreet
This deletes the complicated and somewhat expensive journal pre-reservation machinery in favor of just using journal watermarks: when the journal is more than half full, we run journal reclaim more aggressively, and when the journal is more than 3/4s full we only allow journal reclaim to get new journal reservations. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-31bcachefs: Ensure devices are always correctly initializedKent Overstreet
We can't mark device superblocks or allocate journal on a device that isn't online. That means we may need to do this on every mount, because we may have formatted a new filesystem and then done the first mount (bch2_fs_initialize()) in degraded mode. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: fix race between journal entry close and pin setBrian Foster
bcachefs freeze testing via fstests generic/390 occasionally reproduces the following BUG from bch2_fs_read_only(): BUG_ON(atomic_long_read(&c->btree_key_cache.nr_dirty)); This indicates that one or more dirty key cache keys still exist after the attempt to flush and quiesce the fs. The sequence that leads to this problem actually occurs on unfreeze (ro->rw), and looks something like the following: - Task A begins a transaction commit and acquires journal_res for the current seq. This transaction intends to perform key cache insertion. - Task B begins a bch2_journal_flush() via bch2_sync_fs(). This ends up in journal_entry_want_write(), which closes the current journal entry and drops the reference to the pin list created on entry open. The pin put pops the front of the journal via fast reclaim since the reference count has dropped to 0. - Task A attempts to set the journal pin for the associated cached key, but bch2_journal_pin_set() skips the pin insert because the seq of the transaction reservation is behind the front of the pin list fifo. The end result is that the pin associated with the cached key is not added, which prevents a subsequent reclaim from processing the key and thus leaves it dangling at freeze time. The fundamental cause of this problem is that the front of the journal is allowed to pop before a transaction with outstanding reservation on the associated journal seq is able to add a pin. The count for the pin list associated with the seq drops to zero and is prematurely reclaimed as a result. The logical fix for this problem lies in how the journal buffer is managed in similar scenarios where the entry might have been closed before a transaction with outstanding reservations happens to be committed. When a journal entry is opened, the current sequence number is bumped, the associated pin list is initialized with a reference count of 1, and the journal buffer reference count is bumped (via journal_state_inc()). When a journal reservation is acquired, the reservation also acquires a reference on the associated buffer. If the journal entry is closed in the meantime, it drops both the pin and buffer references held by the open entry, but the buffer still has references held by outstanding reservation. After the associated transaction commits, the reservation release drops the associated buffer references and the buffer is written out once the reference count has dropped to zero. The fundamental problem here is that the lifecycle of the pin list reference held by an open journal entry is too short to cover the processing of transactions with outstanding reservations. The simplest way to address this is to expand the pin list reference to the lifecycle of the buffer vs. the shorter lifecycle of the open journal entry. This ensures the pin list for a seq with outstanding reservation cannot be popped and reclaimed before all outstanding reservations have been released, even if the associated journal entry has been closed for further reservations. Move the pin put from journal entry close to where final processing of the journal buffer occurs. Create a duplicate helper to cover the case where the caller doesn't already hold the journal lock. This allows generic/390 to pass reliably. Signed-off-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: prepare journal buf put to handle pin putBrian Foster
bcachefs freeze testing has uncovered some raciness between journal entry open/close and pin list reference count management. The details of the problem are described in a separate patch. In preparation for the associated fix, refactor the journal buffer put path a bit to allow it to eventually handle dropping the pin list reference currently held by an open journal entry. Retain the journal write dispatch helper since the closure code is inlined and we don't want to increase the amount of inline code in the transaction commit path, but rename the function to reflect the purpose of final processing of the journal buffer. Signed-off-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Kill JOURNAL_WATERMARKKent Overstreet
This unifies JOURNAL_WATERMARK with BCH_WATERMARK; we're working towards specifying watermarks once in the transaction commit path. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Convert EAGAIN errors to private error codesKent Overstreet
More error code cleanup, for better error messages and debugability. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Log more messages in the journalKent Overstreet
This patch - Adds a mechanism for queuing up journal entries prior to the journal being started, which will be used for early journal log messages - Adds bch2_fs_log_msg() and improves bch2_trans_log_msg(), which now take format strings. bch2_fs_log_msg() can be used before or after the journal has been started, and will use the appropriate mechanism. - Deletes the now obsolete bch2_journal_log_msg() - And adds more log messages to the recovery path - messages for journal/filesystem started, journal entries being blacklisted, and journal replay starting/finishing. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: More style fixesKent Overstreet
Fixes for various checkpatch errors. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Refactor journal entry addingKent Overstreet
This takes copying the payload out of bch2_journal_add_entry(), which means we can use it for journal_transaction_name() - also prep work for journalling overwrites. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22bcachefs: Use a genradix for reading journal entriesKent Overstreet
Previously, the journal read path used a linked list for storing the journal entries we read from disk. But there's been a bug that's been causing journal_flush_delay to incorrectly be set to 0, leading to far more journal entries than is normal being written out, which then means filesystems are no longer able to start due to the O(n^2) behaviour of inserting into/searching that linked list. Fix this by switching to a radix tree. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22bcachefs: Introduce a separate journal watermark for copygcKent Overstreet
Since journal reclaim -> btree key cache flushing may require the allocation of new btree nodes, it has an implicit dependency on copygc in order to make forward progress - so we should avoid blocking copygc unless the journal is really close to full. This introduces watermarks to replace our single MAY_GET_UNRESERVED bit in the journal, and adds a watermark for copygc and plumbs it through. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: bch2_journal_log_msg()Kent Overstreet
This adds bch2_journal_log_msg(), which just logs a message to the journal, and uses it to mark startup and when journal replay finishes. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22bcachefs: __journal_entry_close() never failsKent Overstreet
Previous patch just moved responsibility for incrementing the journal sequence number and initializing the new journal entry from __journal_entry_close() to journal_entry_open(); this patch makes the analagous change for journal reservation state, incrementing the index into array of journal_bufs at open time. This means that __journal_entry_close() never fails to close an open journal entry, which is important for the next patch that will change our emergency shutdown behaviour. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22bcachefs: Refactor journal code to not use unwritten_idxKent Overstreet
It makes the code more readable if we work off of sequence numbers, instead of direct indexes into the array of journal buffers. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22bcachefs: Start moving debug info from sysfs to debugfsKent Overstreet
In sysfs, files can only output at most PAGE_SIZE. This is a problem for debug info that needs to list an arbitrary number of times, and because of this limit some of our debug info has been terser and harder to read than we'd like. This patch moves info about journal pins and cached btree nodes to debugfs, and greatly expands and improves the output we return. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22bcachefs: Revert "Ensure journal doesn't get stuck in nochanges mode"Kent Overstreet
This patch was originally to work around the journal geting stuck in nochanges mode - but that was just a hack, we needed to fix the actual bug. It should be fixed now, so revert it. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22bcachefs: Fix for journal getting stuckKent Overstreet
The journal can get stuck if we need to get a journal reservation for something we have a pre-reservation for, but aren't able to reclaim space, or if the pin fifo is full - it's impractical to resize the pin fifo at runtime. Previously, we reserved 8 entries in the pin fifo for pre-reservations, but that seems small - we're seeing the journal occasionally get stuck. Let's reserve a quarter of it. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22bcachefs: bch2_journal_noflush_seq()Kent Overstreet
Add bch2_journal_noflush_seq(), for telling the journal that entries before a given sequence number should not be flushes - to be used by an upcoming allocator optimization. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22bcachefs: Kill journal buf bloom filterKent Overstreet
This was used for recording which inodes have been modified by in flight journal writes, but was broken and has been superceded. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22bcachefs: Ensure journal doesn't get stuck in nochanges modeKent Overstreet
This tweaks the journal code to always act as if there's space available in nochanges mode, when we're not going to be doing any writes. This helps in recovering filesystems that won't mount because they need journal replay and the journal has gotten stuck. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22bcachefs: Fix for btree_gc repairing interior btree ptrsKent Overstreet
Using the normal transaction commit path to insert and journal updates to interior nodes hadn't been done before this repair code was written, not surprising that there was a bug. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Eliminate memory barrier from fast path of journal_preres_put()Kent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Be more careful about JOURNAL_RES_GET_RESERVEDKent Overstreet
JOURNAL_RES_GET_RESERVED should only be used for updatse that need to be done to free up space in the journal. In particular, when we're flushing keys from the key cache, if we're flushing them out of order we shouldn't be using it, since we're using up our remaining space in the journal without dropping a pin that will let us make forward progress. With this patch, BTREE_INSERT_JOURNAL_RECLAIM without BTREE_INSERT_JOURNAL_RESERVED may return -EAGAIN - we can't wait on journal reclaim if we're already in journal reclaim. This means we need to propagate these errors up to journal reclaim, indicating that flushing a journal pin should be retried in the future. This is prep work for a patch to change the way journal reclaim works, to split out flushing key cache keys because the btree key cache is too dirty from journal reclaim because we need space in the journal. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Correctly order flushes and journal writes on multi device filesystemsKent Overstreet
All writes prior to a journal write need to be flushed before the journal write itself happens. On single device filesystems, it suffices to mark the write with REQ_PREFLUSH|REQ_FUA, but on multi device filesystems we need to issue flushes to every device - and wait for them to complete - before issuing the journal writes. Previously, we were issuing flushes to every device, but we weren't waiting for them to complete before issuing the journal writes. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Be more conservation about journal pre-reservationsKent Overstreet
- Try to always keep 1/8th of the journal free, on top of pre-reservations - Move the check for whether the journal is stuck to bch2_journal_space_available, and make it only fire when there aren't any journal writes in flight (that might free up space by updating last_seq) Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Don't require flush/fua on every journal writeKent Overstreet
This patch adds a flag to journal entries which, if set, indicates that they weren't done as flush/fua writes. - non flush/fua journal writes don't update last_seq (i.e. they don't free up space in the journal), thus the journal free space calculations now check whether nonflush journal writes are currently allowed (i.e. are we low on free space, or would doing a flush write free up a lot of space in the journal) - write_delay_ms, the user configurable option for when open journal entries are automatically written, is now interpreted as the max delay between flush journal writes (default 1 second). - bch2_journal_flush_seq_async is changed to ensure a flush write >= the requested sequence number has happened - journal read/replay must now ignore, and blacklist, any journal entries newer than the most recent flush entry in the journal. Also, the way the read_entire_journal option is handled has been improved; struct journal_replay now has an entry, 'ignore', for entries that were read but should not be used. - assorted refactoring and improvements related to journal read in journal_io.c and recovery.c Previously, we'd have to issue a flush/fua write every time we accumulated a full journal entry - typically the bucket size. Now we need to issue them much less frequently: when an fsync is requested, or it's been more than write_delay_ms since the last flush, or when we need to free up space in the journal. This is a significant performance improvement on many write heavy workloads. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Increase journal pipeliningKent Overstreet
This patch increases the maximum journal buffers in flight from 2 to 4 - this will be particularly helpful when in the future we stop requiring flush+fua for every journal write. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Assorted journal refactoringKent Overstreet
Improved the way we track various state by adding j->err_seq, which records the first journal sequence number that encountered an error being written, and j->last_empty_seq, which records the most recent journal entry that was completely empty. Also, use the low bits of the journal sequence number to index the corresponding journal_buf. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Delete dead journalling codeKent Overstreet
Usage of the journal has gotten somewhat simpler over time - neat. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Fix journal_seq_copy()Kent Overstreet
We also need to update the journal's bloom filter of inode numbers that each journal write has upudates for - in case the inode gets evicted before it gets fsynced. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Convert various code to printbufKent Overstreet
printbufs know how big the buffer is that was allocated, so we can get rid of the random PAGE_SIZEs all over the place. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Always give out journal pre-res if we already have oneKent Overstreet
This is better than skipping the journal pre-reservation if we already have one - we should still acount for the journal reservation we're going to have to get. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>