summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2022-12-06NFSv4.x: Fail client initialisation if state manager thread can't runTrond Myklebust
If the state manager thread fails to start, then we should just mark the client initialisation as failed so that other processes or threads don't get stuck in nfs_wait_client_init_complete(). Reported-by: ChenXiaoSong <chenxiaosong2@huawei.com> Fixes: 4697bd5e9419 ("NFSv4: Fix a race in the net namespace mount notification") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-12-06fs: nfs: sysfs: use sysfs_emit() to instead of scnprintf()ye xingchen
Follow the advice of the Documentation/filesystems/sysfs.rst and show() should only use sysfs_emit() or sysfs_emit_at() when formatting the value to be returned to user space. Signed-off-by: ye xingchen <ye.xingchen@zte.com.cn> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-12-06NFS: use sysfs_emit() to instead of scnprintf()ye xingchen
Follow the advice of the Documentation/filesystems/sysfs.rst and show() should only use sysfs_emit() or sysfs_emit_at() when formatting the value to be returned to user space. Signed-off-by: ye xingchen <ye.xingchen@zte.com.cn> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-12-06NFS: Allow very small rsize & wsize againAnna Schumaker
940261a19508 introduced nfs_io_size() to clamp the iosize to a multiple of PAGE_SIZE. This had the unintended side effect of no longer allowing iosizes less than a page, which could be useful in some situations. UDP already has an exception that causes it to fall back on the power-of-two style sizes instead. This patch adds an additional exception for very small iosizes. Reported-by: Jeff Layton <jlayton@kernel.org> Fixes: 940261a19508 ("NFS: Allow setting rsize / wsize to a multiple of PAGE_SIZE") Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-12-06NFSv4.2: Fix up READ_PLUS alignmentAnna Schumaker
Assume that the first segment will be a DATA segment, and place the data directly into the xdr pages so it doesn't need to be shifted. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-12-06NFSv4.2: Set the correct size scratch buffer for decoding READ_PLUSAnna Schumaker
The scratch_buf array is 16 bytes, but I was passing 32 to the xdr_set_scratch_buffer() function. Fix this by using sizeof(), which is what I probably should have been doing this whole time. Fixes: d3b00a802c84 ("NFS: Replace the READ_PLUS decoding code") Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-12-06NFS: avoid spurious warning of lost lock that is being unlocked.NeilBrown
When the NFSv4 state manager recovers state after a server restart, it reports that locks have been lost if it finds any lock state for which recovery hasn't been successful. i.e. any for which NFS_LOCK_INITIALIZED is not set. However it only tries to recover locks that are still linked to inode->i_flctx. So if a lock has been removed from inode->i_flctx, but the state for that lock has not yet been destroyed, then a spurious warning results. nfs4_proc_unlck() calls locks_lock_inode_wait() - which removes the lock from ->i_flctx - before sending the unlock request to the server and before the final nfs4_put_lock_state() is called. This allows a window in which a spurious warning can be produced. So add a new flag NFS_LOCK_UNLOCKING which is set once the decision has been made to unlock the lock. This will prevent it from triggering any warning. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-12-06nfs: fix possible null-ptr-deref when parsing paramHawkins Jiawei
According to commit "vfs: parse: deal with zero length string value", kernel will set the param->string to null pointer in vfs_parse_fs_string() if fs string has zero length. Yet the problem is that, nfs_fs_context_parse_param() will dereferences the param->string, without checking whether it is a null pointer, which may trigger a null-ptr-deref bug. This patch solves it by adding sanity check on param->string in nfs_fs_context_parse_param(). Signed-off-by: Hawkins Jiawei <yin31149@gmail.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-12-06NFSv4: check FMODE_EXEC from open context mode in nfs4_opendata_access()ChenXiaoSong
After converting file f_flags to open context mode by flags_to_mode(), open context mode will have FMODE_EXEC when file open for exec, so we check FMODE_EXEC from open context mode. No functional change, just simplify the code. Signed-off-by: ChenXiaoSong <chenxiaosong2@huawei.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-12-06NFS: make sure open context mode have FMODE_EXEC when file open for execChenXiaoSong
Because file f_mode never have FMODE_EXEC, open context mode won't get FMODE_EXEC from file f_mode. Open context mode only care about FMODE_READ/ FMODE_WRITE/FMODE_EXEC, and all info about open context mode can be convert from file f_flags, so convert file f_flags to open context mode by flags_to_mode(). Signed-off-by: ChenXiaoSong <chenxiaosong2@huawei.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-12-06gfs2: Partially revert gfs2_inode_lookup changeAndreas Gruenbacher
Commit c412a97cf6c5 changed delete_work_func() to always perform an inode lookup when gfs2_try_evict() fails. This doesn't make sense as a gfs2_try_evict() failure indicates that the inode is likely still in use. Revert that change. Fixes: c412a97cf6c5 ("gfs2: Use TRY lock in gfs2_inode_lookup for UNLINKED inodes") Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2022-12-06gfs2: Add gfs2_inode_lookup commentAndreas Gruenbacher
Add comment on when and why gfs2_cancel_delete_work() needs to be skipped in gfs2_inode_lookup(). Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2022-12-06gfs2: Uninline and improve glock_{set,clear}_objectAndreas Gruenbacher
Those functions have reached a size at which having them inline isn't useful anymore, so uninline them. In addition, report the glock name on assertion failures. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2022-12-06gfs2: Simply dequeue iopen glock in gfs2_evict_inodeAndreas Gruenbacher
With the previous change, to simplify things, we can always just dequeue and uninitialize the iopen glock in gfs2_evict_inode() even if it isn't queued anymore. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2022-12-06gfs2: Clean up after gfs2_create_inode reworkAndreas Gruenbacher
Since commit 3d36e57ff768 ("gfs2: gfs2_create_inode rework"), gfs2_evict_inode() and gfs2_create_inode() / gfs2_inode_lookup() will synchronize via the inode hash table and we can be certain that once a new inode is inserted into the inode hash table(), gfs2_evict_inode() has completely destroyed any previous versions. We no longer need to worry about overlapping inode object lifespans. Update the code and comments accordingly. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2022-12-06gfs2: Avoid dequeuing GL_ASYNC glock holders twiceAndreas Gruenbacher
When a locking request fails, the associated glock holder is automatically dequeued from the list of active and waiting holders. For GL_ASYNC locking requests, this will obviously happen asynchronously and it can race with attempts to cancel that locking request via gfs2_glock_dq(). Therefore, don't forget to check if a locking request has already been dequeued in gfs2_glock_dq(). Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2022-12-06gfs2: Make gfs2_glock_hold return its glock argumentAndreas Gruenbacher
This allows code like 'gl = gfs2_glock_hold(...)'. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2022-12-06gfs2: Always check inode size of inline inodesAndreas Gruenbacher
Check if the inode size of stuffed (inline) inodes is within the allowed range when reading inodes from disk (gfs2_dinode_in()). This prevents us from on-disk corruption. The two checks in stuffed_readpage() and gfs2_unstuffer_page() that just truncate inline data to the maximum allowed size don't actually make sense, and they can be removed now as well. Reported-by: syzbot+7bb81dfa9cda07d9cd9d@syzkaller.appspotmail.com Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2022-12-06gfs2: Cosmetic gfs2_dinode_{in,out} cleanupAndreas Gruenbacher
In each of the two functions, add an inode variable that points to &ip->i_inode and use that throughout the rest of the function. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2022-12-05pstore: Avoid kcore oops by vmap()ing with VM_IOREMAPStephen Boyd
An oops can be induced by running 'cat /proc/kcore > /dev/null' on devices using pstore with the ram backend because kmap_atomic() assumes lowmem pages are accessible with __va(). Unable to handle kernel paging request at virtual address ffffff807ff2b000 Mem abort info: ESR = 0x96000006 EC = 0x25: DABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 FSC = 0x06: level 2 translation fault Data abort info: ISV = 0, ISS = 0x00000006 CM = 0, WnR = 0 swapper pgtable: 4k pages, 39-bit VAs, pgdp=0000000081d87000 [ffffff807ff2b000] pgd=180000017fe18003, p4d=180000017fe18003, pud=180000017fe18003, pmd=0000000000000000 Internal error: Oops: 96000006 [#1] PREEMPT SMP Modules linked in: dm_integrity CPU: 7 PID: 21179 Comm: perf Not tainted 5.15.67-10882-ge4eb2eb988cd #1 baa443fb8e8477896a370b31a821eb2009f9bfba Hardware name: Google Lazor (rev3 - 8) (DT) pstate: a0400009 (NzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : __memcpy+0x110/0x260 lr : vread+0x194/0x294 sp : ffffffc013ee39d0 x29: ffffffc013ee39f0 x28: 0000000000001000 x27: ffffff807ff2b000 x26: 0000000000001000 x25: ffffffc0085a2000 x24: ffffff802d4b3000 x23: ffffff80f8a60000 x22: ffffff802d4b3000 x21: ffffffc0085a2000 x20: ffffff8080b7bc68 x19: 0000000000001000 x18: 0000000000000000 x17: 0000000000000000 x16: 0000000000000000 x15: ffffffd3073f2e60 x14: ffffffffad588000 x13: 0000000000000000 x12: 0000000000000001 x11: 00000000000001a2 x10: 00680000fff2bf0b x9 : 03fffffff807ff2b x8 : 0000000000000001 x7 : 0000000000000000 x6 : 0000000000000000 x5 : ffffff802d4b4000 x4 : ffffff807ff2c000 x3 : ffffffc013ee3a78 x2 : 0000000000001000 x1 : ffffff807ff2b000 x0 : ffffff802d4b3000 Call trace: __memcpy+0x110/0x260 read_kcore+0x584/0x778 proc_reg_read+0xb4/0xe4 During early boot, memblock reserves the pages for the ramoops reserved memory node in DT that would otherwise be part of the direct lowmem mapping. Pstore's ram backend reuses those reserved pages to change the memory type (writeback or non-cached) by passing the pages to vmap() (see pfn_to_page() usage in persistent_ram_vmap() for more details) with specific flags. When read_kcore() starts iterating over the vmalloc region, it runs over the virtual address that vmap() returned for ramoops. In aligned_vread() the virtual address is passed to vmalloc_to_page() which returns the page struct for the reserved lowmem area. That lowmem page is passed to kmap_atomic(), which effectively calls page_to_virt() that assumes a lowmem page struct must be directly accessible with __va() and friends. These pages are mapped via vmap() though, and the lowmem mapping was never made, so accessing them via the lowmem virtual address oopses like above. Let's side-step this problem by passing VM_IOREMAP to vmap(). This will tell vread() to not include the ramoops region in the kcore. Instead the area will look like a bunch of zeros. The alternative is to teach kmap() about vmalloc areas that intersect with lowmem. Presumably such a change isn't a one-liner, and there isn't much interest in inspecting the ramoops region in kcore files anyway, so the most expedient route is taken for now. Cc: Brian Geffon <bgeffon@google.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Fixes: 404a6043385d ("staging: android: persistent_ram: handle reserving and mapping memory") Signed-off-by: Stephen Boyd <swboyd@chromium.org> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20221205233136.3420802-1-swboyd@chromium.org
2022-12-05gfs2: Handle -EBUSY result of insert_inode_locked4Andreas Gruenbacher
When creating a new inode, there is a small chance that an inode lookup for a previous version of the same inode is still in progress. In that case, that previous lookup will eventually fail, but we may still need to retry here. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2022-12-05NFS4.x/pnfs: Fix up logging of layout stateidsTrond Myklebust
If the layout is invalid, then just log a '0' value. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-12-05btrfs: print transaction aborted messages with an error levelFilipe Manana
Currently we print the transaction aborted message with a debug level, but a transaction abort is an exceptional event that indicates something went wrong and it's useful to have it printed with an error level as it helps analysing problems in a production environment, where debug level messages are typically not logged. For example reports from syzbot never include the transaction aborted message, since the log level on the test machines is above the debug level. So change the log level from debug to error. Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: do not BUG_ON() on ENOMEM when dropping extent items for a rangeFilipe Manana
If we get -ENOMEM while dropping file extent items in a given range, at btrfs_drop_extents(), due to failure to allocate memory when attempting to increment the reference count for an extent or drop the reference count, we handle it with a BUG_ON(). This is excessive, instead we can simply abort the transaction and return the error to the caller. In fact most callers of btrfs_drop_extents(), directly or indirectly, already abort the transaction if btrfs_drop_extents() returns any error. Also, we already have error paths at btrfs_drop_extents() that may return -ENOMEM and in those cases we abort the transaction, like for example anything that changes the b+tree may return -ENOMEM due to a failure to allocate a new extent buffer when COWing an existing extent buffer, such as a call to btrfs_duplicate_item() for example. So replace the BUG_ON() calls with proper logic to abort the transaction and return the error. Reported-by: syzbot+0b1fb6b0108c27419f9f@syzkaller.appspotmail.com Link: https://lore.kernel.org/linux-btrfs/00000000000089773e05ee4b9cb4@google.com/ CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: fix extent map use-after-free when handling missing device in ↵void0red
read_one_chunk Store the error code before freeing the extent_map. Though it's reference counted structure, in that function it's the first and last allocation so this would lead to a potential use-after-free. The error can happen eg. when chunk is stored on a missing device and the degraded mount option is missing. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=216721 Reported-by: eriri <1527030098@qq.com> Fixes: adfb69af7d8c ("btrfs: add_missing_dev() should return the actual error") CC: stable@vger.kernel.org # 4.9+ Signed-off-by: void0red <void0red@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: remove outdated logic from overwrite_item() and add assertionFilipe Manana
As of commit 193df6245704 ("btrfs: search for last logged dir index if it's not cached in the inode"), the overwrite_item() function is always called for a root that is from a fs/subvolume tree. In other words, now it's only used during log replay to modify a fs/subvolume tree. Therefore we can remove the logic that checks if we are dealing with a log tree at overwrite_item(). So remove that logic, replacing it with an assertion and document that if we ever need to support a log root there, we will need to clone the leaf from the fs/subvolume tree and then release it before modifying the log tree, which is needed to avoid a potential deadlock, similar to the one recently fixed by a patch with the subject: "btrfs: do not modify log tree while holding a leaf from fs tree locked" Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: unify overwrite_item() and do_overwrite_item()Filipe Manana
After commit 193df6245704 ("btrfs: search for last logged dir index if it's not cached in the inode"), there are no more callers of do_overwrite_item(), except overwrite_item(). Originally both used to be the same function, but were split in commit 086dcbfa50d3 ("btrfs: insert items in batches when logging a directory when possible"), as there was the need to execute all logic of overwrite_item() but skip the tree search, since in the context of directory logging we already had a path with a leaf to copy data from. So unify them again as there is no more need to have them split. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: replace strncpy() with strscpy()Artem Chernyshev
Using strncpy() on NUL-terminated strings are deprecated. To avoid possible forming of non-terminated string strscpy() should be used. Found by Linux Verification Center (linuxtesting.org) with SVACE. CC: stable@vger.kernel.org # 4.9+ Signed-off-by: Artem Chernyshev <artem.chernyshev@red-soft.ru> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: fix uninitialized variable in find_first_clear_extent_bitJosef Bacik
This was caught when syncing extent-io-tree.c into btrfs-progs. This however isn't really a problem, the only way next would be uninitialized is if we found the range we were looking for, and in this case we don't care about next. However it's a compile error, so fix it up. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: fix uninitialized parent in insert_stateJosef Bacik
I don't know how this isn't caught when we build this in the kernel, but while syncing extent-io-tree.c into btrfs-progs I got an error because parent could potentially be uninitialized when we link in a new node, specifically when the extent_io_tree is empty. This means we could have garbage in the parent color. I don't know what the ramifications are of that, but it's probably not great, so fix this by initializing parent to NULL. I spot checked all of our other usages in btrfs and we appear to be doing the correct thing everywhere else. Fixes: c7e118cf98c7 ("btrfs: open code rbtree search in insert_state") CC: stable@vger.kernel.org # 6.0+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: add might_sleep() annotationsChenXiaoSong
Add annotations to functions that might sleep due to allocations or IO and could be called from various contexts. In case of btrfs_search_slot it's not obvious why it would sleep: btrfs_search_slot setup_nodes_for_search reada_for_balance btrfs_readahead_node_child btrfs_readahead_tree_block btrfs_find_create_tree_block alloc_extent_buffer kmem_cache_zalloc /* allocate memory non-atomically, might sleep */ kmem_cache_alloc(GFP_NOFS|__GFP_NOFAIL|__GFP_ZERO) read_extent_buffer_pages submit_extent_page /* disk IO, might sleep */ submit_one_bio Other examples where the sleeping could happen is in 3 places might sleep in update_qgroup_limit_item(), as shown below: update_qgroup_limit_item btrfs_alloc_path /* allocate memory non-atomically, might sleep */ kmem_cache_zalloc(btrfs_path_cachep, GFP_NOFS) Signed-off-by: ChenXiaoSong <chenxiaosong2@huawei.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: add stack helpers for a few btrfs itemsJosef Bacik
We don't have these defined in the kernel because we don't have any users of these helpers. However we do use them in btrfs-progs, so define them to make keeping accessors.h in sync between progs and the kernel easier. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: add nr_global_roots to the super block definitionJosef Bacik
We already have this defined in btrfs-progs, add it to the kernel to make it easier to sync these files into btrfs-progs. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: remove BTRFS_LEAF_DATA_OFFSETJosef Bacik
This is simply the same thing as btrfs_item_nr_offset(leaf, 0), so remove this helper and replace it's usage with the above statement. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: add helpers for manipulating leaf items and dataJosef Bacik
We have some gnarly memmove and copy_extent_buffer calls for leaf manipulation. This is because our item offsets aren't absolute, they're based on 0 being where the items start in the leaf, which is after the btrfs_header. This means any manipulation of the data requires adding sizeof(struct btrfs_header) to the offsets we pull from the items. Moving the items themselves is easier as the helpers are absolute offsets, however we of course have to call the helpers to get the offsets for the item numbers. This makes for copy_extent_buffer/memmove_extent_buffer calls that are kind of hard to reason about what's happening. Fix this by pushing this logic into helpers. For data we'll only use the item provided offsets, and the helpers will use the BTRFS_LEAF_DATA_OFFSET addition for the offsets. Additionally for the item manipulation simply pass in the item numbers, and then the helpers will call the offset helper to get the actual offset into the leaf. The diffstat makes this look like more code, but that's simply because I added comments for the helpers, it's net negative for the amount of code, and is easier to reason. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: add eb to btrfs_node_key_ptr_offsetJosef Bacik
This is a change needed for extent tree v2, as we will be growing the header size. This exists in btrfs-progs currently, and not having it makes syncing accessors.[ch] more problematic. So make this change to set us up for extent tree v2 and match what btrfs-progs does to make syncing easier. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: pass the extent buffer for the btrfs_item_nr helpersJosef Bacik
This is actually a change for extent tree v2, but it exists in btrfs-progs but not in the kernel. This makes it annoying to sync accessors.h with btrfs-progs, and since this is the way I need it for extent-tree v2 simply update these helpers to take the extent buffer in order to make syncing possible now, and make the extent tree v2 stuff easier moving forward. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: move the csum helpers into ctree.hJosef Bacik
These got moved because of copy+paste, but this code exists in ctree.c, so move the declarations back into ctree.h. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: move eb offset helpers into extent_io.hJosef Bacik
These are very specific to how the extent buffer is defined, so this differs between btrfs-progs and the kernel. Make things easier by moving these helpers into extent_io.h so we don't have to worry about this when syncing ctree.h. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: move file_extent_item helpers into file-item.hJosef Bacik
These helpers use functions that are in multiple places, which makes it tricky to sync them into btrfs-progs. Move them to file-item.h and then include file-item.h in places that use these helpers. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: move leaf_data_end into ctree.cJosef Bacik
This is only used in ctree.c, with the exception of zero'ing out extent buffers we're getting ready to write out. In theory we shouldn't have an extent buffer with 0 items that we're writing out, however I'd rather be safe than sorry so open code it in extent_io.c, and then copy the helper into ctree.c. This will make it easier to sync accessors.[ch] into btrfs-progs, as this requires a helper that isn't defined in accessors.h. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: move root helpers back into ctree.hJosef Bacik
These accidentally got brought into accessors.h, but belong with the btrfs_root definitions which are currently in ctree.h. Move these to make it easier to sync accessors.[ch] into btrfs-progs. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: move repair_io_failure to bio.cChristoph Hellwig
repair_io_failure ties directly into all the glory low-level details of mapping a bio with a logic address to the actual physical location. Move it right below btrfs_submit_bio to keep all the related logic together. Also move btrfs_repair_eb_io_failure to its caller in disk-io.c now that repair_io_failure is available in a header. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: split the bio submission path into a separate fileChristoph Hellwig
The code used by btrfs_submit_bio only interacts with the rest of volumes.c through __btrfs_map_block (which itself is a more generic version of two exported helpers) and does not really have anything to do with volumes.c. Create a new bio.c file and a bio.h header going along with it for the btrfs_bio-based storage layer, which will grow even more going forward. Also update the file with my copyright notice given that a large part of the moved code was written or rewritten by me. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: move struct btrfs_tree_parent_check out of disk-io.hChristoph Hellwig
Move struct btrfs_tree_parent_check out of disk-io.h so that volumes.h an various .c files don't have to include disk-io.h just for it. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> [ use tree-checker.h for the structure ] Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: raid56: do data csum verification during RMW cycleQu Wenruo
[BUG] For the following small script, btrfs will be unable to recover the content of file1: mkfs.btrfs -f -m raid1 -d raid5 -b 1G $dev1 $dev2 $dev3 mount $dev1 $mnt xfs_io -f -c "pwrite -S 0xff 0 64k" -c sync $mnt/file1 md5sum $mnt/file1 umount $mnt # Corrupt the above 64K data stripe. xfs_io -f -c "pwrite -S 0x00 323026944 64K" -c sync $dev3 mount $dev1 $mnt # Write a new 64K, which should be in the other data stripe # And this is a sub-stripe write, which will cause RMW xfs_io -f -c "pwrite 0 64k" -c sync $mnt/file2 md5sum $mnt/file1 umount $mnt Above md5sum would fail. [CAUSE] There is a long existing problem for raid56 (not limited to btrfs raid56) that, if we already have some corrupted on-disk data, and then trigger a sub-stripe write (which needs RMW cycle), it can cause further damage into P/Q stripe. Disk 1: data 1 |0x000000000000| <- Corrupted Disk 2: data 2 |0x000000000000| Disk 2: parity |0xffffffffffff| In above case, data 1 is already corrupted, the original data should be 64KiB of 0xff. At this stage, if we read data 1, and it has data checksum, we can still recovery going via the regular RAID56 recovery path. But if now we decide to write some data into data 2, then we need to go RMW. Let's say we want to write 64KiB of '0x00' into data 2, then we read the on-disk data of data 1, calculate the new parity, resulting the following layout: Disk 1: data 1 |0x000000000000| <- Corrupted Disk 2: data 2 |0x000000000000| <- New '0x00' writes Disk 2: parity |0x000000000000| <- New Parity. But the new parity is calculated using the *corrupted* data 1, we can no longer recover the correct data of data1. Thus the corruption is forever there. [FIX] To solve above problem, this patch will do a full stripe data checksum verification at RMW time. This involves the following changes: - Always read the full stripe (including data/P/Q) when doing RMW Before we only read the missing data sectors, but since we may do a data csum verification and recovery, we need to read everything out. Please note that, if we have a cached rbio, we don't need to read anything, and can treat it the same as full stripe write. As only stripe with all its csum matches can be cached. - Verify the data csum during read. The goal is only the rbio stripe sectors, and only if the rbio already has csum_buf/csum_bitmap filled. And sectors which cannot pass csum verification will have their bit set in error_bitmap. - Always call recovery_sectors() after we read out all the sectors Since error_bitmap will be updated during read, recover_sectors() can easily find out all the bad sectors and try to recover (if still under tolerance). And since recovery_sectors() is already migrated to use error_bitmap, it can skip vertical stripes which don't have any error. - Verify the repaired sectors against its csum in recover_vertical() - Rename rmw_read_and_wait() to rmw_read_wait_recover() Since we will always recover the sectors, the old name is no longer accurate. Furthermore since recovery is already done in rmw_read_wait_recover(), we no longer need to call recovery_sectors() inside rmw_rbio(). Obviously this will have a performance impact, as we are doing more work during RMW cycle: - Fetch the data checksums - Do checksum verification for all data stripes - Do checksum verification again after repair But for full stripe write or cached rbio we won't have the overhead all, thus for fully optimized RAID56 workload (always full stripe write), there should be no extra overhead. To me, the extra overhead looks reasonable, as data consistency is way more important than performance. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: raid56: prepare data checksums for later RMW verificationQu Wenruo
This is for later data checksum verification at RMW time. This patch will try to allocate the needed memory for a locked rbio if the rbio is for data exclusively (we don't want to handle mixed bg yet). The memory will be released when the rbio is finished. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: introduce a bitmap based csum range search functionQu Wenruo
Although we have an existing function, btrfs_lookup_csums_range(), to find all data checksums for a range, it's based on a btrfs_ordered_sum list. For the incoming RAID56 data checksum verification at RMW time, we don't want to waste time by allocating temporary memory. So this patch will introduce a new helper, btrfs_lookup_csums_bitmap(). It will use bitmap based result, which will be a perfect fit for later RAID56 usage. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: refactor checksum calculations in btrfs_lookup_csums_range()Qu Wenruo
The refactoring involves the following parts: - Introduce bytes_to_csum_size() and csum_size_to_bytes() helpers As we have quite some open-coded calculations, some of them are even split into two assignments just to fit 80 chars limit. - Remove the @csum_size parameter from max_ordered_sum_bytes() Csum size can be fetched from @fs_info. And we will use the csum_size_to_bytes() helper anyway. - Add a comment explaining how we handle the first search result - Use newly introduced helpers to cleanup btrfs_lookup_csums_range() - Move variables declaration to the minimal scope - Never mix number of sectors with bytes There are several locations doing things like: size = min_t(size_t, csum_end - start, max_ordered_sum_bytes(fs_info)); ... size >>= fs_info->sectorsize_bits Or offset = (start - key.offset) >> fs_info->sectorsize_bits; offset *= csum_size; Make sure these variables can only represent BYTES inside the function, by using the above bytes_to_csum_size() helpers. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: allocate btrfs_io_context without GFP_NOFAILLi zeming
The __GFP_NOFAIL flag could loop indefinitely when allocation memory in alloc_btrfs_io_context. The callers starting from __btrfs_map_block already handle errors so it's safe to drop the flag. Signed-off-by: Li zeming <zeming@nfschina.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>