summaryrefslogtreecommitdiff
path: root/fs/bcachefs/super.c
AgeCommit message (Collapse)Author
2023-10-22bcachefs: Add a kmem_cache for btree_key_cache objectsKent Overstreet
We allocate a lot of these, and we're seeing sporading OOMs - this will help with tracking those down. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: New varintsKent Overstreet
Previous varint implementation used by the inode code was not nearly as fast as it could have been; partly because it was attempting to encode integers up to 96 bits (for timestamps) but this meant that encoding and decoding the length required a table lookup. Instead, we'll just encode timestamps greater than 64 bits as two separate varints; this will make decoding/encoding of inodes significantly faster overall. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Add a single slot percpu buf for btree itersKent Overstreet
Allocating our array of btree iters is a big enough allocation that it hits the buddy allocator, and we're seeing lots of lock contention. Sticking a single element buffer in front of it should help. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Improved inode create optimizationKent Overstreet
This shards new inodes into different btree nodes by using the processor ID for the high bits of the new inode number. Much faster than the previous inode create optimization - this also helps with sharding in the other btrees that index by inode number. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Minor journal reclaim improvementKent Overstreet
With the btree key cache code, journal reclaim now has a lot more work to do. It could be the case that after journal reclaim has finished one iteration there's already more work to do, so put it in a loop to check for that. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Inode create optimizationKent Overstreet
On workloads that do a lot of multithreaded creates all at once, lock contention on the inodes btree turns out to still be an issue. This patch adds a small buffer of inode numbers that are known to be free, so that we can avoid touching the btree on every create. Also, this changes inode creates to update via the btree key cache for the initial create. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Start/stop io clock hands in read/write pathsKent Overstreet
This fixes a bug where the clock hands in the journal and superblock didn't match, because we were still incrementing the read clock hand while read-only. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Improvements to writing alloc infoKent Overstreet
Now that we've got transactional alloc info updates (and have for awhile), we don't need to write it out on shutdown, and we don't need to write it out on startup except when GC found errors - this is a big improvement to mount/unmount performance. This patch also fixes a few bugs where we weren't writing out alloc info (on new filesystems, and new devices) and should have been. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Fix copygc dying on startupKent Overstreet
The copygc threads errors out and makes the filesystem go RO if it ever tries to run and discovers it has no reserve allocated - which is a problem if it races with the allocator thread and its reserve hasn't been filled yet. The allocator thread doesn't start filling the copygc reserve until after BCH_FS_STARTED has been set, so make sure to wake up the allocator threads after setting that and before starting copygc. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Fix errors early in the fs init processKent Overstreet
At some point bch2_fs_alloc() was changed to always call bch2_fs_free() in the error path, which means we need c->cl to always be initialized. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Fix unmount pathKent Overstreet
There was a long standing race in the mount/unmount code - the VFS intends for mount/unmount synchronizatino to be handled by the list of superblocks, but we were still holding devices open after tearing down our superblock in the unmount path. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Don't fail mount if device has been removedKent Overstreet
Also - make sure to show the devices we actually have open in /proc Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Fix a bug with the journal_seq_blacklist mechanismKent Overstreet
Previously, we would start doing btree updates before writing the first journal entry; if this was after an unclean shutdown, this could cause those btree updates to not be blacklisted. Also, move some code to headers for userspace debug tools. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Don't let copygc buckets be stolen by other threadsKent Overstreet
And assorted other copygc fixes. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Make copygc thread globalKent Overstreet
Per device copygc threads don't move data to different devices and they make fragmentation works - they don't make much sense anymore. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Use x-macros for data typesKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Move stripe creation to workqueueKent Overstreet
This is mainly to solve a lock ordering issue, and also simplifies the code a bit. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Improve stripe triggers/heap codeKent Overstreet
Soon we'll be able to modify existing stripes - replacing empty blocks with new blocks and new p/q blocks. This patch updates the trigger code to handle pointers changing in an existing stripe; also, it significantly improves how the stripes heap works, which means we can get rid of the stripe creation/deletion lock. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Use cached iterators for alloc btreeKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Btree key cacheKent Overstreet
This introduces a new kind of btree iterator, cached iterators, which point to keys cached in a hash table. The cache also acts as a write cache - in the update path, we journal the update but defer updating the btree until the cached entry is flushed by journal reclaim. Cache coherency is for now up to the users to handle, which isn't ideal but should be good enough for now. These new iterators will be used for updating inodes and alloc info (the alloc and stripes btrees). Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Turn c->state_lock into an rwsemKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Kill old allocator startup codeKent Overstreet
It's not needed anymore since we can now write to buckets before updating the alloc btree. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Fixes for going ROKent Overstreet
Now that interior btree updates are fully transactional, we don't need to write out alloc info in a loop. However, interior btree updates do put more things in the journal, so we still need a loop in the RO sequence. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Interior btree updates are now fully transactionalKent Overstreet
We now update the alloc info (bucket sector counts) atomically with journalling the update to the interior btree nodes, and we also set new btree roots atomically with the journalled part of the btree update. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Factor out bch2_fs_btree_interior_update_init()Kent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Fix reading of alloc info after unclean shutdownKent Overstreet
When updates to interior nodes started being journalled, that meant that after an unclean shutdown, until journal replay is done we can't walk the btree without overlaying the updates from the journal. The initial btree gc was changed to walk the btree overlaying keys from the journal - but bch2_alloc_read() and bch2_stripes_read() were missed. Major whoops... Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Be more rigorous about marking the filesystem cleanKent Overstreet
Previously, there was at least one error path where we could mark the filesystem clean when we hadn't sucessfully written out alloc info. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Fixes for startup on very full filesystemsKent Overstreet
- Always pass BTREE_INSERT_USE_RESERVE when writing alloc btree keys - Don't strand buckest on the copygc freelist until after recovery is done and we're starting copygc. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Add an option for keeping journal entries after startupKent Overstreet
This will be used by the userspace debug tools. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Dont't del sysfs dir until after we go ROKent Overstreet
This will help for debugging hangs during unmount Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Seralize btree_update operations at btree_update_nodes_written()Kent Overstreet
Prep work for journalling updates to interior nodes - enforcing ordering will greatly simplify those changes. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Hacky fixes for device removalKent Overstreet
The device remove test was sporadically failing, because we hadn't finished dropping btree sector counts for the device when bch2_replicas_gc2() was called - mainly due to in flight journal writes. We don't yet have a good mechanism for flushing the counts that correspend to open journal entries yet. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Don't export __bch2_fs_read_writeKent Overstreet
BTREE_INSERT_LAZY_RW was added for this since this code was written; use it instead. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Kill btree_node_iter_largeKent Overstreet
Long overdue cleanup - this converts btree_node_iter_large uses to sort_iter. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: kill ca->freelist_lockKent Overstreet
All uses were supposed to be switched over to c->freelist_lock Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: bkey_on_stackKent Overstreet
This implements code for storing small bkeys on the stack and allocating out of a mempool if they're too big. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Do updates in order they were queued up inKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Fix stripe_idx_to_delete()Kent Overstreet
There was a null ptr deref when there wasn't a stripes heap allocated Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Don't try to delete stripes when ROKent Overstreet
We weren't checking for errors when trying to delet stripes, which meant ec_stripe_delete_work() would spin trying to delete the same stripe over and over. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Fix return code from bch2_fs_start()Kent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: some improvements to startup messages and optionsKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Various improvements to bch2_alloc_write()Kent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Write out fs usage consistentlyKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Avoid write lock on mark_lockKent Overstreet
mark_lock is a frequently taken lock, and there's also potential for deadlocks since currently bch2_clear_page_bits which is called from memory reclaim has to take it to drop disk reservations. The disk reservation get path takes it when it recalculates the number of sectors known to be available, but it's not really needed for consistency. We just want to make sure we only have one thread updating the sectors_available count, which we can do with a dedicated mutex. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Pass flags arg to bch2_alloc_write()Kent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: allocate sb_read_scratch with __get_free_pageKent Overstreet
kmalloc allocations aren't guranteed alignment for io Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: disallow ever going rw if nochanges or noreplayKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Rewrite journal_seq_blacklist machineryKent Overstreet
Now, we store blacklisted journal sequence numbers in the superblock, not the journal: this helps to greatly simplify the code, and more importantly it's now implemented in a way that doesn't require all btree nodes to be visited before starting the journal - instead, we unconditionally blacklist the next 4 journal sequence numbers after an unclean shutdown. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: allow journal reply on ro mountKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Track whether filesystem has errors in superblockKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>