summaryrefslogtreecommitdiff
path: root/fs/btrfs/disk-io.c
AgeCommit message (Collapse)Author
2023-12-15btrfs: migrate btrfs_repair_io_failure() to folio interfacesQu Wenruo
[BUG] Test case btrfs/124 failed if larger metadata folio is enabled, the dying message looks like this: BTRFS error (device dm-2): bad tree block start, mirror 2 want 31686656 have 0 BTRFS info (device dm-2): read error corrected: ino 0 off 31686656 (dev /dev/mapper/test-scratch2 sector 20928) BUG: kernel NULL pointer dereference, address: 0000000000000020 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page CPU: 6 PID: 350881 Comm: btrfs Tainted: G OE 6.7.0-rc3-custom+ #128 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 2/2/2022 RIP: 0010:btrfs_read_extent_buffer+0x106/0x180 [btrfs] PKRU: 55555554 Call Trace: <TASK> read_tree_block+0x33/0xb0 [btrfs] read_block_for_search+0x23e/0x340 [btrfs] btrfs_search_slot+0x2f9/0xe60 [btrfs] btrfs_lookup_csum+0x75/0x160 [btrfs] btrfs_lookup_bio_sums+0x21a/0x560 [btrfs] btrfs_submit_chunk+0x152/0x680 [btrfs] btrfs_submit_bio+0x1c/0x50 [btrfs] submit_one_bio+0x40/0x80 [btrfs] submit_extent_page+0x158/0x390 [btrfs] btrfs_do_readpage+0x330/0x740 [btrfs] extent_readahead+0x38d/0x6c0 [btrfs] read_pages+0x94/0x2c0 page_cache_ra_unbounded+0x12d/0x190 relocate_file_extent_cluster+0x7c1/0x9d0 [btrfs] relocate_block_group+0x2d3/0x560 [btrfs] btrfs_relocate_block_group+0x2c7/0x4b0 [btrfs] btrfs_relocate_chunk+0x4c/0x1a0 [btrfs] btrfs_balance+0x925/0x13c0 [btrfs] btrfs_ioctl+0x19f1/0x25d0 [btrfs] __x64_sys_ioctl+0x90/0xd0 do_syscall_64+0x3f/0xf0 entry_SYSCALL_64_after_hwframe+0x6e/0x76 [CAUSE] The dying line is at btrfs_repair_io_failure() call inside btrfs_repair_eb_io_failure(). The function is still relying on the extent buffer using page sized folios. When the extent buffer is using larger folio, we go into the 2nd slot of folios[], and triggered the NULL pointer dereference. [FIX] Migrate btrfs_repair_io_failure() to folio interfaces. So that when we hit a larger folio, we just submit the whole folio in one go. This also affects data repair path through btrfs_end_repair_bio(), thankfully data is still fully page based, we can just add an ASSERT(), and use page_folio() to convert the page to folio. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15btrfs: migrate subpage code to folio interfacesQu Wenruo
Although subpage itself is conflicting with higher folio, since subpage (sectorsize < PAGE_SIZE and nodesize < PAGE_SIZE) means we will never need higher order folio, there is a hidden pitfall: - btrfs_page_*() helpers Those helpers are an abstraction to handle both subpage and non-subpage cases, which means we're going to pass pages pointers to those helpers. And since those helpers are shared between data and metadata paths, it's unavoidable to let them to handle folios, including higher order folios). Meanwhile for true subpage case, we should only have a single page backed folios anyway, thus add a new ASSERT() for btrfs_subpage_assert() to ensure that. Also since those helpers are shared between both data and metadata, add some extra ASSERT()s for data path to make sure we only get single page backed folio for now. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15btrfs: migrate get_eb_page_index() and get_eb_offset_in_page() to foliosQu Wenruo
These two functions are still using the old page based code, which is not going to handle larger folios at all. The migration itself is going to involve the following changes: - PAGE_SIZE -> folio_size() - PAGE_SHIFT -> folio_shift() - get_eb_page_index() -> get_eb_folio_index() - get_eb_offset_in_page() -> get_eb_offset_in_folio() And since we're going to support larger folios, although above straight conversion is good enough, this patch would add extra comments in the involved functions to explain why the same single line code can now cover 3 cases: - folio_size == PAGE_SIZE, sectorsize == PAGE_SIZE, nodesize >= PAGE_SIZE The common, non-subpage case with per-page folio. - folio_size > PAGE_SIZE, sectorsize == PAGE_SIZE, nodesize >= PAGE_SIZE The incoming larger folio, non-subpage case. - folio_size == PAGE_SIZE, sectorsize < PAGE_SIZE, nodesize < PAGE_SIZE The existing subpage case, we won't larger folio anyway. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15btrfs: cleanup metadata page pointer usageQu Wenruo
Although we have migrated extent_buffer::pages[] to folios[], we're still mostly using the folio_page() help to grab the page. This patch would do the following cleanups for metadata: - Introduce num_extent_folios() helper This is to replace most num_extent_pages() callers. - Use num_extent_folios() to iterate future large folios This allows us to use things like bio_add_folio()/bio_add_folio_nofail(), and only set the needed flags for the folio (aka the leading/tailing page), which reduces the loop iteration to 1 for large folios. - Change metadata related functions to use folio pointers Including their function name, involving: * attach_extent_buffer_page() * detach_extent_buffer_page() * page_range_has_eb() * btrfs_release_extent_buffer_pages() * btree_clear_page_dirty() * btrfs_page_inc_eb_refs() * btrfs_page_dec_eb_refs() - Change btrfs_is_subpage() to accept an address_space pointer This is to allow both page->mapping and folio->mapping to be utilized. As data is still using the old per-page code, and may keep so for a while. - Special corner case place holder for future order mismatches between extent buffer and inode filemap For now it's just a block of comments and a dead ASSERT(), no real handling yet. The subpage code would still go page, just because subpage and large folio are conflicting conditions, thus we don't need to bother subpage with higher order folios at all. Just folio_page(folio, 0) would be enough. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ minor styling tweaks ] Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15btrfs: migrate extent_buffer::pages[] to folioQu Wenruo
For now extent_buffer::pages[] are still only accepting single page pointer, thus we can migrate to folios pretty easily. As for single page, page and folio are 1:1 mapped, including their page flags. This patch would just do the conversion from struct page to struct folio, providing the first step to higher order folio in the future. This conversion is pretty simple: - extent_buffer::pages[] -> extent_buffer::folios[] - page_address(eb->pages[i]) -> folio_address(eb->pages[i]) - eb->pages[i] -> folio_page(eb->folios[i], 0) There would be more specific cleanups preparing for the incoming higher order folio support. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15btrfs: switch btrfs_root::delayed_nodes_tree to xarray from radix-treeDavid Sterba
The radix-tree has been superseded by the xarray (https://lwn.net/Articles/745073), this patch converts the btrfs_root::delayed_nodes, the APIs are used in a simple way. First idea is to do xa_insert() but this would require GFP_ATOMIC allocation which we want to avoid if possible. The preload mechanism of radix-tree can be emulated within the xarray API. - xa_reserve() with GFP_NOFS outside of the lock, the reserved entry is inserted atomically at most once - xa_store() under a lock, in case something races in we can detect that and xa_load() returns a valid pointer All uses of xa_load() must check for a valid pointer in case they manage to get between the xa_reserve() and xa_store(), this is handled in btrfs_get_delayed_node(). Otherwise the functionality is equivalent, xarray implements the radix-tree and there should be no performance difference. The patch continues the efforts started in 253bf57555e451 ("btrfs: turn delayed_nodes_tree into an XArray") and fixes the problems with locking and GFP flags 088aea3b97e0ae ("Revert "btrfs: turn delayed_nodes_tree into an XArray""). Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15btrfs: set clear_cache if we use usebackuprootJosef Bacik
We're currently setting this when we try to load the roots and we see that usebackuproot is set. Instead set this at mount option parsing time. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Acked-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15btrfs: move one shot mount option clearing to super.cJosef Bacik
There's no reason this has to happen in open_ctree, and in fact in the old mount API we had to call this from remount. Move this to super.c, unexport it, and call it from both mount and reconfigure. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Acked-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15btrfs: move the device specific mount options to super.cJosef Bacik
We add these mount options based on the fs_devices settings, which can be set once we've opened the fs_devices. Move these into their own helper and call it from get_tree_super. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Acked-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15btrfs: switch to the new mount APIJosef Bacik
Now that we have all of the parts in place to use the new mount API, switch our fs_type to use the new callbacks. There are a few things that have to be done at the same time because of the order of operations changes that come along with the new mount API. These must be done in the same patch otherwise things will go wrong. 1. Export and use btrfs_check_options in open_ctree(). This is because the options are done ahead of time, and we need to check them once we have the feature flags loaded. 2. Update the free space cache settings. Since we're coming in with the options already set we need to make sure we don't undo what the user has asked for. 3. Set our sb_flags at init_fs_context time, the fs_context stuff is trying to manage the sb_flagss itself, so move that into init_fs_context and out of the fill super part. Additionally I've marked the unused functions with __maybe_unused and will remove them in a future patch. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Acked-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15btrfs: add a NOSPACECACHE mount option flagJosef Bacik
With the old mount API we'd pre-populate the mount options with the space cache settings of the file system, and then the user toggled them on or off with the mount options. When we switch to the new mount API the mount options will be set before we get into opening the file system, so we need a flag to indicate that the user explicitly asked for -o nospace_cache so we can make the appropriate changes after the fact. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Acked-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15btrfs: do not allow free space tree rebuild on extent tree v2Josef Bacik
We currently don't allow these options to be set if we're extent tree v2 via the mount option parsing. However when we switch to the new mount API we'll no longer have the super block loaded, so won't be able to make this distinction at mount option parsing time. Address this by checking for extent tree v2 at the point where we make the decision to rebuild the free space tree. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Acked-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15btrfs: move space cache settings into open_ctreeJosef Bacik
Currently we pre-load the space cache settings in btrfs_parse_options, however when we switch to the new mount API the mount option parsing will happen before we have the super block loaded. Add a helper to set the appropriate options based on the fs settings, this will allow us to have consistent free space cache settings. This also folds in the space cache related decisions we make for subpage sectorsize support, so all of this is done in one place. Since this was being called by parse options it looks like we're changing the behavior of remount, but in fact we aren't. The pre-loading of the free space cache settings is done because we want to handle the case of users not using any space_cache options, we'll derive the appropriate mount option based on the on disk state. On remount this wouldn't reset anything as we'll have cleared the v1 cache generation if we mounted -o nospace_cache. Similarly it's impossible to turn off the free space tree without specifically saying -o nospace_cache,clear_cache, which will delete the free space tree and clear the compat_ro option. Again in this case calling this code in remount wouldn't result in any change. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Acked-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15btrfs: set default compress type at btrfs_init_fs_info timeJosef Bacik
With the new mount API we'll be setting our compression well before we call open_ctree. We don't want to overwrite our settings, so set the default in btrfs_init_fs_info instead of open_ctree. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Acked-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15btrfs: allow extent buffer helpers to skip cross-page handlingQu Wenruo
Currently btrfs extent buffer helpers are doing all the cross-page handling, as there is no guarantee that all those eb pages are contiguous. However on systems with enough memory, there is a very high chance the page cache for btree_inode are allocated with physically contiguous pages. In that case, we can skip all the complex cross-page handling, thus speeding up the code. This patch adds a new member, extent_buffer::addr, which is only set to non-NULL if all the extent buffer pages are physically contiguous. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15btrfs: zoned: don't clear dirty flag of extent bufferJohannes Thumshirn
One a zoned filesystem, never clear the dirty flag of an extent buffer, but instead mark it as zeroout. On writeout, when encountering a marked extent_buffer, zero it out. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15btrfs: rename EXTENT_BUFFER_NO_CHECK to EXTENT_BUFFER_ZONED_ZEROOUTJohannes Thumshirn
EXTENT_BUFFER_ZONED_ZEROOUT better describes the state of the extent buffer, namely it is written as all zeros. This is needed in zoned mode, to preserve I/O ordering. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15btrfs: use a dedicated data structure for chunk mapsFilipe Manana
Currently we abuse the extent_map structure for two purposes: 1) To actually represent extents for inodes; 2) To represent chunk mappings. This is odd and has several disadvantages: 1) To create a chunk map, we need to do two memory allocations: one for an extent_map structure and another one for a map_lookup structure, so more potential for an allocation failure and more complicated code to manage and link two structures; 2) For a chunk map we actually only use 3 fields (24 bytes) of the respective extent map structure: the 'start' field to have the logical start address of the chunk, the 'len' field to have the chunk's size, and the 'orig_block_len' field to contain the chunk's stripe size. Besides wasting a memory, it's also odd and not intuitive at all to have the stripe size in a field named 'orig_block_len'. We are also using 'block_len' of the extent_map structure to contain the chunk size, so we have 2 fields for the same value, 'len' and 'block_len', which is pointless; 3) When an extent map is associated to a chunk mapping, we set the bit EXTENT_FLAG_FS_MAPPING on its flags and then make its member named 'map_lookup' point to the associated map_lookup structure. This means that for an extent map associated to an inode extent, we are not using this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform); 4) Extent maps associated to a chunk mapping are never merged or split so it's pointless to use the existing extent map infrastructure. So add a dedicated data structure named 'btrfs_chunk_map' to represent chunk mappings, this is basically the existing map_lookup structure with some extra fields: 1) 'start' to contain the chunk logical address; 2) 'chunk_len' to contain the chunk's length; 3) 'stripe_size' for the stripe size; 4) 'rb_node' for insertion into a rb tree; 5) 'refs' for reference counting. This way we do a single memory allocation for chunk mappings and we don't waste memory for them with unused/unnecessary fields from an extent_map. We also save 8 bytes from the extent_map structure by removing the 'map_lookup' pointer, so the size of struct extent_map is reduced from 144 bytes down to 136 bytes, and we can now have 30 extents map per 4K page instead of 28. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15btrfs: remove log_extents_lock and logged_list from struct btrfs_rootFilipe Manana
The logged_list[2] and log_extents_lock[2] members of struct btrfs_root are no longer used, their last use was removed in commit 5636cf7d6dc8 ("btrfs: remove the logged extents infrastructure"). So remove these fields. This reduces the size of struct btrfs_root, on a release kernel, from 1392 bytes down to 1352 bytes. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-06btrfs: free qgroup pertrans reserve on transaction abortBoris Burkov
If we abort a transaction, we never run the code that frees the pertrans qgroup reservation. This results in warnings on unmount as that reservation has been leaked. The leak isn't a huge issue since the fs is read-only, but it's better to clean it up when we know we can/should. Do it during the cleanup_transaction step of aborting. CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-23btrfs: add dmesg output for first mount and last unmount of a filesystemQu Wenruo
There is a feature request to add dmesg output when unmounting a btrfs. There are several alternative methods to do the same thing, but with their own problems: - Use eBPF to watch btrfs_put_super()/open_ctree() Not end user friendly, they have to dip their head into the source code. - Watch for directory /sys/fs/<uuid>/ This is way more simple, but still requires some simple device -> uuid lookups. And a script needs to use inotify to watch /sys/fs/. Compared to all these, directly outputting the information into dmesg would be the most simple one, with both device and UUID included. And since we're here, also add the output when mounting a filesystem for the first time for parity. A more fine grained monitoring of subvolume mounts should be done by another layer, like audit. Now mounting a btrfs with all default mkfs options would look like this: [81.906566] BTRFS info (device dm-8): first mount of filesystem 633b5c16-afe3-4b79-b195-138fe145e4f2 [81.907494] BTRFS info (device dm-8): using crc32c (crc32c-intel) checksum algorithm [81.908258] BTRFS info (device dm-8): using free space tree [81.912644] BTRFS info (device dm-8): auto enabling async discard [81.913277] BTRFS info (device dm-8): checking UUID tree [91.668256] BTRFS info (device dm-8): last unmount of filesystem 633b5c16-afe3-4b79-b195-138fe145e4f2 CC: stable@vger.kernel.org # 5.4+ Link: https://github.com/kdave/btrfs-progs/issues/689 Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: add and use helpers for reading and writing last_trans_committedFilipe Manana
Currently the last_trans_committed field of struct btrfs_fs_info is modified and read without any locking or other protection. For example early in the fsync path, skip_inode_logging() is called which reads fs_info->last_trans_committed, but at the same time we can have a transaction commit completing and updating that field. In the case of an fsync this is harmless and any data race should be rare and at most cause an unnecessary logging of an inode. To avoid data race warnings from tools like KCSAN and other issues such as load and store tearing (amongst others, see [1]), create helpers to access the last_trans_committed field of struct btrfs_fs_info using READ_ONCE() and WRITE_ONCE(), and use these helpers everywhere. [1] https://lwn.net/Articles/793253/ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: add and use helpers for reading and writing log_transidFilipe Manana
Currently the log_transid field of a root is always modified while holding the root's log_mutex locked. Most readers of a root's log_transid are also holding the root's log_mutex locked, however there is one exception which is btrfs_set_inode_last_trans() where we don't take the lock to avoid blocking several operations if log syncing is happening in parallel. Any races here should be harmless, and in the worst case they may cause a fsync to log an inode when it's not really needed, so nothing bad from a functional perspective. To avoid data race warnings from tools like KCSAN and other issues such as load and store tearing (amongst others, see [1]), create helpers to access the log_transid field of a root using READ_ONCE() and WRITE_ONCE(), and use these helpers where needed. [1] https://lwn.net/Articles/793253/ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: add and use helpers for reading and writing last_log_commitFilipe Manana
Currently, the last_log_commit of a root can be accessed concurrently without any lock protection. Readers can be calling btrfs_inode_in_log() early in a fsync call, which reads a root's last_log_commit, while a writer can change the last_log_commit while a log tree if being synced, at btrfs_sync_log(). Any races here should be harmless, and in the worst case they may cause a fsync to log an inode when it's not really needed, so nothing bad from a functional perspective. To avoid data race warnings from tools like KCSAN and other issues such as load and store tearing (amongst others, see [1]), create helpers to access the last_log_commit field of a root using READ_ONCE() and WRITE_ONCE(), and use these helpers everywhere. [1] https://lwn.net/Articles/793253/ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: support cloned-device mount capabilityAnand Jain
Guilherme's previous work [1] aimed at the mounting of cloned devices using a superblock flag SINGLE_DEV during mkfs. [1] https://lore.kernel.org/linux-btrfs/20230831001544.3379273-1-gpiccoli@igalia.com/ Building upon this work, here is in memory only approach. As it mounts we determine if the same fsid is already mounted if then we generate a random temp fsid which shall be used the mount, in memory only not written to the disk. We distinguish devices by devt. Example: $ fallocate -l 300m ./disk1.img $ mkfs.btrfs -f ./disk1.img $ cp ./disk1.img ./disk2.img $ cp ./disk1.img ./disk3.img $ mount -o loop ./disk1.img /btrfs $ mount -o ./disk2.img /btrfs1 $ mount -o ./disk3.img /btrfs2 $ btrfs fi show -m Label: none uuid: 4a212b48-1bec-46a5-938a-783c8c1f0b02 Total devices 1 FS bytes used 144.00KiB devid 1 size 300.00MiB used 88.00MiB path /dev/loop0 Label: none uuid: adabf2fe-5515-4ad0-95b4-7b1609218c16 Total devices 1 FS bytes used 144.00KiB devid 1 size 300.00MiB used 88.00MiB path /dev/loop1 Label: none uuid: 1d77d0df-7d92-439e-adbd-20b9b86fdedb Total devices 1 FS bytes used 144.00KiB devid 1 size 300.00MiB used 88.00MiB path /dev/loop2 Co-developed-by: Guilherme G. Piccoli <gpiccoli@igalia.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: stop reserving excessive space for block group item updatesFilipe Manana
Space for block group item updates, necessary after allocating or deallocating an extent from a block group, is reserved in the delayed refs block reserve. Currently we do this by incrementing the transaction handle's delayed_ref_updates counter and then calling btrfs_update_delayed_refs_rsv(), which will increase the size of the delayed refs block reserve by an amount that corresponds to the same amount we use for delayed refs, given by btrfs_calc_delayed_ref_bytes(). That is an excessive amount because it corresponds to the amount of space needed to insert one item in a btree (btrfs_calc_insert_metadata_size()) times 2 when the free space tree feature is enabled. All we need is an amount as given by btrfs_calc_metadata_size(), since we only need to update an existing block group item in the extent tree (or block group tree if this feature is enabled). By using btrfs_calc_metadata_size() we will need to reserve 4 times less space when using the free space tree and 2 times less space when not using it, putting less pressure on space reservation. So use helpers to reserve and release space for block group item updates that use btrfs_calc_metadata_size() for calculation of the space. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: reject devices with CHANGING_FSID_V2Anand Jain
The BTRFS_SUPER_FLAG_CHANGING_FSID_V2 flag indicates a transient state where the device in the userspace btrfstune -m|-M operation failed to complete changing the fsid. This flag makes the kernel to automatically determine the other partner devices to which a given device can be associated, based on the fsid, metadata_uuid and generation values. btrfstune -m|M feature is especially useful in virtual cloud setups, where compute instances (disk images) are quickly copied, fsid changed, and launched. Given numerous disk images with the same metadata_uuid but different fsid, there's no clear way a device can be correctly assembled with the proper partners when the CHANGING_FSID_V2 flag is set. So, the disk could be assembled incorrectly, as in the example below: Before this patch: Consider the following two filesystems: /dev/loop[2-3] are raw copies of /dev/loop[0-1] and the btrsftune -m operation fails. In this scenario, as the /dev/loop0's fsid change is interrupted, and the CHANGING_FSID_V2 flag is set as shown below. $ p="device|devid|^metadata_uuid|^fsid|^incom|^generation|^flags" $ btrfs inspect dump-super /dev/loop0 | egrep '$p' superblock: bytenr=65536, device=/dev/loop0 flags 0x1000000001 fsid 7d4b4b93-2b27-4432-b4e4-4be1fbccbd45 metadata_uuid bb040a9f-233a-4de2-ad84-49aa5a28059b generation 9 num_devices 2 incompat_flags 0x741 dev_item.devid 1 $ btrfs inspect dump-super /dev/loop1 | egrep '$p' superblock: bytenr=65536, device=/dev/loop1 flags 0x1 fsid 11d2af4d-1b71-45a9-83f6-f2100766939d metadata_uuid bb040a9f-233a-4de2-ad84-49aa5a28059b generation 10 num_devices 2 incompat_flags 0x741 dev_item.devid 2 $ btrfs inspect dump-super /dev/loop2 | egrep '$p' superblock: bytenr=65536, device=/dev/loop2 flags 0x1 fsid 7d4b4b93-2b27-4432-b4e4-4be1fbccbd45 metadata_uuid bb040a9f-233a-4de2-ad84-49aa5a28059b generation 8 num_devices 2 incompat_flags 0x741 dev_item.devid 1 $ btrfs inspect dump-super /dev/loop3 | egrep '$p' superblock: bytenr=65536, device=/dev/loop3 flags 0x1 fsid 7d4b4b93-2b27-4432-b4e4-4be1fbccbd45 metadata_uuid bb040a9f-233a-4de2-ad84-49aa5a28059b generation 8 num_devices 2 incompat_flags 0x741 dev_item.devid 2 It is normal that some devices aren't instantly discovered during system boot or iSCSI discovery. The controlled scan below demonstrates this. $ btrfs device scan --forget $ btrfs device scan /dev/loop0 Scanning for btrfs filesystems on '/dev/loop0' $ mount /dev/loop3 /btrfs $ btrfs filesystem show -m Label: none uuid: 7d4b4b93-2b27-4432-b4e4-4be1fbccbd45 Total devices 2 FS bytes used 144.00KiB devid 1 size 300.00MiB used 48.00MiB path /dev/loop0 devid 2 size 300.00MiB used 40.00MiB path /dev/loop3 /dev/loop0 and /dev/loop3 are incorrectly partnered. This kernel patch removes functions and code connected to the CHANGING_FSID_V2 flag. With this patch, now devices with the CHANGING_FSID_V2 flag are rejected. And its partner will fail to mount with the extra -o degraded option. The check is removed from open_ctree(), devices are rejected during scanning which in turn fails the mount. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: qgroup: only set QUOTA_ENABLED when done reading qgroupsBoris Burkov
In open_ctree, we set BTRFS_FS_QUOTA_ENABLED as soon as we see a quota_root, as opposed to after we are done setting up the qgroup structures. In the quota_enable path, we wait until after the structures are set up. Likewise, in disable, we clear the bit before tearing down the structures. I feel that this organization is less surprising for the open_ctree path. I don't believe this fixes any actual bug, but avoids potential confusion when using btrfs_qgroup_mode in an intermediate state where we are enabled but haven't yet setup the qgroup status flags. It also avoids any risk of calling a qgroup function and attempting to use the qgroup rbtrees before they exist/are setup. This all occurs before we do rw setup, so I believe it should be mostly a no-op. Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: qgroup: track metadata relocation COW with simple quotaBoris Burkov
Relocation COWs metadata blocks in two cases for the reloc root: - copying the subvolume root item when creating the reloc root - copying a btree node when there is a COW during relocation In both cases, the resulting btree node hits an abnormal code path with respect to the owner field in its btrfs_header. It first creates the root item for the new objectid, which populates the reloc root id, and it at this point that delayed refs are created. Later, it fully copies the old node into the new node (including the original owner field) which overwrites it. This results in a simple quotas mismatch where we run the delayed ref for the reloc root which has no simple quota effect (reloc root is not an fstree) but when we ultimately delete the node, the owner is the real original fstree and we do free the space. To work around this without tampering with the behavior of relocation, add a parameter to btrfs_add_tree_block that lets the relocation code path specify a different owning root than the "operating" root (in this case, owning root is the real root and the operating root is the reloc root). These can naturally be plumbed into delayed refs that have the same concept. Note that this is a double count in some sense, but a relatively natural one, as there are really two extents, and the old one will be deleted soon. This is consistent with how data relocation extents are accounted by simple quotas. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: read raid stripe tree from diskJohannes Thumshirn
If we find the raid-stripe-tree on mount, read it from disk. This is a backward incompatible feature. The rescue=ignorebadroots mount option will skip this tree. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: mark transaction id check as unlikely at btrfs_mark_buffer_dirty()Filipe Manana
At btrfs_mark_buffer_dirty(), having a transaction id mismatch is never expected to happen and it usually means there's a bug or some memory corruption due to a bitflip for example. So mark the condition as unlikely to optimize code generation as well as to make it obvious for human readers that it is a very unexpected condition. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: use btrfs_crit at btrfs_mark_buffer_dirty()Filipe Manana
There's no need to use WARN() at btrfs_mark_buffer_dirty() to print an error message, as we have the fs_info pointer we can use btrfs_crit() which prints device information and makes the message have a more uniform format. As we are already aborting the transaction we already have a stack trace printed as well. So replace the use of WARN() with btrfs_crit(). Also slightly reword the message to use 'logical' instead of 'block' as it's what is used in other error/warning messages. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: abort transaction on generation mismatch when marking eb as dirtyFilipe Manana
When marking an extent buffer as dirty, at btrfs_mark_buffer_dirty(), we check if its generation matches the running transaction and if not we just print a warning. Such mismatch is an indicator that something really went wrong and only printing a warning message (and stack trace) is not enough to prevent a corruption. Allowing a transaction to commit with such an extent buffer will trigger an error if we ever try to read it from disk due to a generation mismatch with its parent generation. So abort the current transaction with -EUCLEAN if we notice a generation mismatch. For this we need to pass a transaction handle to btrfs_mark_buffer_dirty() which is always available except in test code, in which case we can pass NULL since it operates on dummy extent buffers and all test roots have a single node/leaf (root node at level 0). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: stop doing excessive space reservation for csum deletionFilipe Manana
Currently when reserving space for deleting the csum items for a data extent, when adding or updating a delayed ref head, we determine how many leaves of csum items we can have and then pass that number to the helper btrfs_calc_delayed_ref_bytes(). This helper is used for calculating space for all tree modifications we need when running delayed references, however the amount of space it computes is excessive for deleting csum items because: 1) It uses btrfs_calc_insert_metadata_size() which is excessive because we only need to delete csum items from the csum tree, we don't need to insert any items, so btrfs_calc_metadata_size() is all we need (as it computes space needed to delete an item); 2) If the free space tree is enabled, it doubles the amount of space, which is pointless for csum deletion since we don't need to touch the free space tree or any other tree other than the csum tree. So improve on this by tracking how many csum deletions we have and using a new helper to calculate space for csum deletions (just a wrapper around btrfs_calc_metadata_size() with a comment). This reduces the amount of space we need to reserve for csum deletions by a factor of 4, and it helps reduce the number of times we have to block space reservations and have the reclaim task enter the space flushing algorithm (flush delayed items, flush delayed refs, etc) in order to satisfy tickets. For example this results in a total time decrease when unlinking (or truncating) files with many extents, as we end up having to block on space metadata reservations less often. Example test: $ cat test.sh #!/bin/bash DEV=/dev/nullb0 MNT=/mnt/test umount $DEV &> /dev/null mkfs.btrfs -f $DEV # Use compression to quickly create files with a lot of extents # (each with a size of 128K). mount -o compress=lzo $DEV $MNT # 100G gives at least 983040 extents with a size of 128K. xfs_io -f -c "pwrite -S 0xab -b 1M 0 120G" $MNT/foobar # Flush all delalloc and clear all metadata from memory. umount $MNT mount -o compress=lzo $DEV $MNT start=$(date +%s%N) rm -f $MNT/foobar end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo "rm took $dur milliseconds" umount $MNT Before this change rm took: 7504 milliseconds After this change rm took: 6574 milliseconds (-12.4%) Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: reserve space for delayed refs on a per ref basisFilipe Manana
Currently when reserving space for delayed refs we do it on a per ref head basis. This is generally enough because most back refs for an extent end up being inlined in the extent item - with the default leaf size of 16K we can have at most 33 inline back refs (this is calculated by the macro BTRFS_MAX_EXTENT_ITEM_SIZE()). The amount of bytes reserved for each ref head is given by btrfs_calc_delayed_ref_bytes(), which basically corresponds to a single path for insertion into the extent tree plus another path for insertion into the free space tree if it's enabled. However if we have reached the limit of inline refs or we have a mix of inline and non-inline refs, then we will need to insert a non-inline ref and update the existing extent item to update the total number of references for the extent. This implies we need reserved space for two insertion paths in the extent tree, but we only reserved for one path. The extent item and the non-inline ref item may be located in different leaves, or even if they are located in the same leaf, after updating the extent item and before inserting the non-inline ref item, the extent buffers in the btree path may have been written (due to memory pressure for e.g.), in which case we need to COW the entire path again. In this case since we have not reserved enough space for the delayed refs block reserve, we will use the global block reserve. If we are in a situation where the fs has no more unallocated space enough to allocate a new metadata block group and available space in the existing metadata block groups is close to the maximum size of the global block reserve (512M), we may end up consuming too much of the free metadata space to the point where we can't commit any future transaction because it will fail, with -ENOSPC, during its commit when trying to allocate an extent for some COW operation (running delayed refs generated by running delayed refs or COWing the root tree's root node at commit_cowonly_roots() for example). Such dramatic scenario can happen if we have many delayed refs that require the insertion of non-inline ref items, due to too many reflinks or snapshots. We also have situations where we use the global block reserve because we could not in advance know that we will need space to update some trees (block group creation for example), so this all adds up to increase the chances of exhausting the global block reserve and making any future transaction commit to fail with -ENOSPC and turn the fs into RO mode, or fail the mount operation in case the mount needs to start and commit a transaction, such as when we have orphans to cleanup for example - such case was reported and hit by someone running a SLE (SUSE Linux Enterprise) distribution for example - where the fs had no more unallocated space that could be used to allocate a new metadata block group, and the available metadata space was about 1.5M, not enough to commit a transaction to cleanup an orphan inode (or do relocation of data block groups that were far from being full). So reserve space for delayed refs by individual refs and not by ref heads, as we may need to COW multiple extent tree paths due to non-inline ref items. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: check-integrity: remove CONFIG_BTRFS_FS_CHECK_INTEGRITY optionQu Wenruo
Since all check-integrity entry points have been removed, let's also remove the config and all related code relying on that. And since we have removed the mount option for check-integrity, we also need to re-number all the BTRFS_MOUNT_* enums. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: check-integrity: remove btrfsic_unmount() functionQu Wenruo
The function btrfsic_mount() is part of the deprecated check-integrity functionality. Now let's remove the main entry point of check-integrity, and thankfully most of the check-integrity code is self-contained inside check-integrity.c, we can safely remove the function without huge changes to btrfs code base. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: check-integrity: remove btrfsic_mount() functionQu Wenruo
The function btrfsic_mount() is part of the deprecated check-integrity functionality. Now let's remove the main entry point of check-integrity, and thankfully most of the check-integrity code is self-contained inside check-integrity.c, we can safely remove the function without huge changes to btrfs code base. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: check-integrity: remove btrfsic_check_bio() functionQu Wenruo
The function btrfsic_check_bio() is part of the deprecated check-integrity functionality. Now let's remove the main entry point of check-integrity, and thankfully most of the check-integrity code is self-contained inside check-integrity.c, we can safely remove the function without huge changes to btrfs code base. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: reformat remaining kdoc style commentsDavid Sterba
Function name in the comment does not bring much value to code not exposed as API and we don't stick to the kdoc format anymore. Update formatting of parameter descriptions. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: sipmlify uuid parameters of alloc_fs_devices()Anand Jain
Among all the callers, only the device_list_add() function uses the second argument of alloc_fs_devices(). It passes metadata_uuid when available, otherwise, it passes NULL. And in turn, alloc_fs_devices() is designed to copy either metadata_uuid or fsid into fs_devices::metadata_uuid. So remove the second argument in alloc_fs_devices(), and always copy the fsid. In the caller device_list_add() function, we will overwrite it with metadata_uuid when it is available. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-09-08btrfs: fix a compilation error if DEBUG is defined in btree_dirty_folioQu Wenruo
[BUG] After commit 72a69cd03082 ("btrfs: subpage: pack all subpage bitmaps into a larger bitmap"), the DEBUG section of btree_dirty_folio() would no longer compile. [CAUSE] If DEBUG is defined, we would do extra checks for btree_dirty_folio(), mostly to make sure the range we marked dirty has an extent buffer and that extent buffer is dirty. For subpage, we need to iterate through all the extent buffers covered by that page range, and make sure they all matches the criteria. However commit 72a69cd03082 ("btrfs: subpage: pack all subpage bitmaps into a larger bitmap") changes how we store the bitmap, we pack all the 16 bits bitmaps into a larger bitmap, which would save some space. This means we no longer have btrfs_subpage::dirty_bitmap, instead the dirty bitmap is starting at btrfs_subpage_info::dirty_offset, and has a length of btrfs_subpage_info::bitmap_nr_bits. [FIX] Although I'm not sure if it still makes sense to maintain such code, at least let it compile. This patch would let us test the bits one by one through the bitmaps. CC: stable@vger.kernel.org # 6.1+ Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-09-08btrfs: do not block starts waiting on previous transaction commitJosef Bacik
Internally I got a report of very long stalls on normal operations like creating a new file when auto relocation was running. The reporter used the 'bpf offcputime' tracer to show that we would get stuck in start_transaction for 5 to 30 seconds, and were always being woken up by the transaction commit. Using my timing-everything script, which times how long a function takes and what percentage of that total time is taken up by its children, I saw several traces like this 1083 took 32812902424 ns 29929002926 ns 91.2110% wait_for_commit_duration 25568 ns 7.7920e-05% commit_fs_roots_duration 1007751 ns 0.00307% commit_cowonly_roots_duration 446855602 ns 1.36182% btrfs_run_delayed_refs_duration 271980 ns 0.00082% btrfs_run_delayed_items_duration 2008 ns 6.1195e-06% btrfs_apply_pending_changes_duration 9656 ns 2.9427e-05% switch_commit_roots_duration 1598 ns 4.8700e-06% btrfs_commit_device_sizes_duration 4314 ns 1.3147e-05% btrfs_free_log_root_tree_duration Here I was only tracing functions that happen where we are between START_COMMIT and UNBLOCKED in order to see what would be keeping us blocked for so long. The wait_for_commit() we do is where we wait for a previous transaction that hasn't completed it's commit. This can include all of the unpin work and other cleanups, which tends to be the longest part of our transaction commit. There is no reason we should be blocking new things from entering the transaction at this point, it just adds to random latency spikes for no reason. Fix this by adding a PREP stage. This allows us to properly deal with multiple committers coming in at the same time, we retain the behavior that the winner waits on the previous transaction and the losers all wait for this transaction commit to occur. Nothing else is blocked during the PREP stage, and then once the wait is complete we switch to COMMIT_START and all of the same behavior as before is maintained. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: drop redundant check to use fs_devices::metadata_uuidAnand Jain
fs_devices::metadata_uuid value is already updated based on the super_block::METADATA_UUID flag for either fsid or metadata_uuid as appropriate. So, fs_devices::metadata_uuid can be used directly. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: compare the correct fsid/metadata_uuid in btrfs_validate_superAnand Jain
The function btrfs_validate_super() should verify the metadata_uuid in the provided superblock argument. Because, all its callers expect it to do that. Such as in the following stacks: write_all_supers() sb = fs_info->super_for_commit; btrfs_validate_write_super(.., sb) btrfs_validate_super(.., sb, ..) scrub_one_super() btrfs_validate_super(.., sb, ..) And check_dev_super() btrfs_validate_super(.., sb, ..) However, it currently verifies the fs_info::super_copy::metadata_uuid instead. Fix this using the correct metadata_uuid in the superblock argument. CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: use the correct superblock to compare fsid in btrfs_validate_superAnand Jain
The function btrfs_validate_super() should verify the fsid in the provided superblock argument. Because, all its callers expect it to do that. Such as in the following stack: write_all_supers() sb = fs_info->super_for_commit; btrfs_validate_write_super(.., sb) btrfs_validate_super(.., sb, ..) scrub_one_super() btrfs_validate_super(.., sb, ..) And check_dev_super() btrfs_validate_super(.., sb, ..) However, it currently verifies the fs_info::super_copy::fsid instead, which is not correct. Fix this using the correct fsid in the superblock argument. CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: use LIST_HEAD() to initialize the list_headRuan Jinjie
Use LIST_HEAD() to initialize the list_head instead of open-coding it. Signed-off-by: Ruan Jinjie <ruanjinjie@huawei.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: zoned: reserve zones for an active metadata/system block groupNaohiro Aota
Ensure a metadata and system block group can be activated on write time, by leaving a certain number of active zones when trying to activate a data block group. Zones for two metadata block groups (normal and tree-log) and one system block group are reserved, according to the profile type: two zones per block group on the DUP profile and one zone per block group otherwise. The reservation must be freed once a non-data block group is allocated. If not, we over-reserve the active zones and data block group activation will suffer. For the dynamic reservation count, we need to manage the reservation count per device. The reservation count variable is protected by fs_info->zone_active_bgs_lock. Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: make btrfs_cleanup_fs_roots() staticFilipe Manana
btrfs_cleanup_fs_roots() is not used outside disk-io.c, so make it static, remove its prototype from disk-io.h and move its definition above the where it's used in disk-io.c Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: store the error that turned the fs into error stateFilipe Manana
Currently when we turn the fs into an error state, typically after a transaction abort, we don't store the error anywhere, we just set a bit (BTRFS_FS_STATE_ERROR) at struct btrfs_fs_info::fs_state to signal the error state. There are cases where it would be useful to have access to the specific error in order to provide a more meaningful error to users/applications. This change adds a member to struct btrfs_fs_info to store the error and removes the BTRFS_FS_STATE_ERROR bit. When there's no error, the new member (fs_error) has a value of 0, otherwise its value is a negative errno value. Followup changes will make use of this new member. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>