summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2025-01-14btrfs: selftests: add selftest for punching holes into the RAID stripe extentsJohannes Thumshirn
Add a selftest for punching a hole into a RAID stripe extent. The test create an 1M extent and punches a 64k bytes long hole at offset of 32k from the start of the extent. Afterwards it verifies the start and length of both resulting new extents "left" and "right" as well as the absence of the hole. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-14btrfs: selftests: test RAID stripe-tree deletion spanning two itemsJohannes Thumshirn
Add a selftest for RAID stripe-tree deletion with a delete range spanning two items, so that we're punching a hole into two adjacent RAID stripe extents truncating the first and "moving" the second to the right. The following diagram illustrates the operation: |--- RAID Stripe Extent ---||--- RAID Stripe Extent ---| |----- keep -----|--- drop ---|----- keep ----| Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-14btrfs: selftests: don't split RAID extents in halfJohannes Thumshirn
The selftests for partially deleting the start or tail of RAID stripe-extents split these extents in half. This can hide errors in the calculation, so don't split the RAID stripe-extents in half but delete the first or last 16K of the 64K extents. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-14btrfs: selftests: check for correct return value of failed lookupJohannes Thumshirn
Commit 5e72aabc1fff ("btrfs: return ENODATA in case RST lookup fails") changed btrfs_get_raid_extent_offset()'s return value to ENODATA in case the RAID stripe-tree lookup failed. Adjust the test cases which check for absence of a given range to check for ENODATA as return value in this case. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-14btrfs: don't use btrfs_set_item_key_safe on RAID stripe-extentsJohannes Thumshirn
Don't use btrfs_set_item_key_safe() to modify the keys in the RAID stripe-tree, as this can lead to corruption of the tree, which is caught by the checks in btrfs_set_item_key_safe(): BTRFS info (device nvme1n1): leaf 49168384 gen 15 total ptrs 194 free space 8329 owner 12 BTRFS info (device nvme1n1): refs 2 lock_owner 1030 current 1030 [ snip ] item 105 key (354549760 230 20480) itemoff 14587 itemsize 16 stride 0 devid 5 physical 67502080 item 106 key (354631680 230 4096) itemoff 14571 itemsize 16 stride 0 devid 1 physical 88559616 item 107 key (354631680 230 32768) itemoff 14555 itemsize 16 stride 0 devid 1 physical 88555520 item 108 key (354717696 230 28672) itemoff 14539 itemsize 16 stride 0 devid 2 physical 67604480 [ snip ] BTRFS critical (device nvme1n1): slot 106 key (354631680 230 32768) new key (354635776 230 4096) ------------[ cut here ]------------ kernel BUG at fs/btrfs/ctree.c:2602! Oops: invalid opcode: 0000 [#1] PREEMPT SMP PTI CPU: 1 UID: 0 PID: 1055 Comm: fsstress Not tainted 6.13.0-rc1+ #1464 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-3-gd478f380-rebuilt.opensuse.org 04/01/2014 RIP: 0010:btrfs_set_item_key_safe+0xf7/0x270 Code: <snip> RSP: 0018:ffffc90001337ab0 EFLAGS: 00010287 RAX: 0000000000000000 RBX: ffff8881115fd000 RCX: 0000000000000000 RDX: 0000000000000001 RSI: 0000000000000001 RDI: 00000000ffffffff RBP: ffff888110ed6f50 R08: 00000000ffffefff R09: ffffffff8244c500 R10: 00000000ffffefff R11: 00000000ffffffff R12: ffff888100586000 R13: 00000000000000c9 R14: ffffc90001337b1f R15: ffff888110f23b58 FS: 00007f7d75c72740(0000) GS:ffff88813bd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fa811652c60 CR3: 0000000111398001 CR4: 0000000000370eb0 Call Trace: <TASK> ? __die_body.cold+0x14/0x1a ? die+0x2e/0x50 ? do_trap+0xca/0x110 ? do_error_trap+0x65/0x80 ? btrfs_set_item_key_safe+0xf7/0x270 ? exc_invalid_op+0x50/0x70 ? btrfs_set_item_key_safe+0xf7/0x270 ? asm_exc_invalid_op+0x1a/0x20 ? btrfs_set_item_key_safe+0xf7/0x270 btrfs_partially_delete_raid_extent+0xc4/0xe0 btrfs_delete_raid_extent+0x227/0x240 __btrfs_free_extent.isra.0+0x57f/0x9c0 ? exc_coproc_segment_overrun+0x40/0x40 __btrfs_run_delayed_refs+0x2fa/0xe80 btrfs_run_delayed_refs+0x81/0xe0 btrfs_commit_transaction+0x2dd/0xbe0 ? preempt_count_add+0x52/0xb0 btrfs_sync_file+0x375/0x4c0 do_fsync+0x39/0x70 __x64_sys_fsync+0x13/0x20 do_syscall_64+0x54/0x110 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7f7d7550ef90 Code: <snip> RSP: 002b:00007ffd70237248 EFLAGS: 00000202 ORIG_RAX: 000000000000004a RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007f7d7550ef90 RDX: 000000000000013a RSI: 000000000040eb28 RDI: 0000000000000004 RBP: 000000000000001b R08: 0000000000000078 R09: 00007ffd7023725c R10: 00007f7d75400390 R11: 0000000000000202 R12: 028f5c28f5c28f5c R13: 8f5c28f5c28f5c29 R14: 000000000040b520 R15: 00007f7d75c726c8 </TASK> While the root cause of the tree order corruption isn't clear, using btrfs_duplicate_item() to copy the item and then adjusting both the key and the per-device physical addresses is a safe way to counter this problem. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-14btrfs: implement hole punching for RAID stripe extentsJohannes Thumshirn
If the stripe extent we want to delete starts before the range we want to delete and ends after the range we want to delete we're punching a hole in the stripe extent: |--- RAID Stripe Extent ---| | keep |--- drop ---| keep | This means we need to a) truncate the existing item and b) create a second item for the remaining range. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-14btrfs: fix deletion of a range spanning parts two RAID stripe extentsJohannes Thumshirn
When a user requests the deletion of a range that spans multiple stripe extents and btrfs_search_slot() returns us the second RAID stripe extent, we need to pick the previous item and truncate it, if there's still a range to delete left, move on to the next item. The following diagram illustrates the operation: |--- RAID Stripe Extent ---||--- RAID Stripe Extent ---| |--- keep ---|--- drop ---| While at it, comment the trivial case of a whole item delete as well. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-14btrfs: fix tail delete of RAID stripe-extentsJohannes Thumshirn
Fix tail delete of RAID stripe-extents, if there is a range to be deleted as well after the tail delete of the extent. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-14btrfs: fix front delete range calculation for RAID stripe extentsJohannes Thumshirn
When deleting the front of a RAID stripe-extent the delete code miscalculates the size on how much to pad the remaining extent part in the front. Fix the calculation so we're always having the sizes we expect. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-14btrfs: assert RAID stripe-extent length is always greater than 0Johannes Thumshirn
When modifying a RAID stripe-extent, ASSERT() that the length of the new RAID stripe-extent is always greater than 0. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-14btrfs: don't try to delete RAID stripe-extents if we don't need toJohannes Thumshirn
Even if the RAID stripe-tree is not enabled in the filesystem, do_free_extent_accounting() still calls into btrfs_delete_raid_extent(). Check if the extent in question is on a block-group that has a profile which is used by RAID stripe-tree before attempting to delete a stripe extent. Return early if it doesn't, otherwise we're doing a unnecessary search. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-14btrfs: selftests: correct RAID stripe-tree feature flag settingJohannes Thumshirn
RAID stripe-tree is an incompatible feature not a read-only compatible, so set the incompat flag not a compat_ro one in the selftest code. Subsequent changes in btrfs_delete_raid_extent() will start checking for this flag. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-14xfs: add a b_iodone callback to struct xfs_bufChristoph Hellwig
Stop open coding the log item completions and instead add a callback into back into the submitter. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-01-14xfs: move b_li_list based retry handling to common codeChristoph Hellwig
The dquot and inode version are very similar, which is expected given the overall b_li_list logic. The differences are that the inode version also clears the XFS_LI_FLUSHING which is defined in common but only ever set by the inode item, and that the dquot version takes the ail_lock over the list iteration. While this seems sensible given that additions and removals from b_li_list are protected by the ail_lock, log items are only added before buffer submission, and are only removed when completing the buffer, so nothing can change the list when retrying a buffer. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-01-14xfs: simplify xfsaild_resubmit_itemChristoph Hellwig
Since commit acc8f8628c37 ("xfs: attach dquot buffer to dquot log item buffer") all buf items that use bp->b_li_list are explicitly checked for in the branch to just clears XFS_LI_FAILED. Remove the dead arm that calls xfs_clear_li_failed. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-01-14xfs: always complete the buffer inline in xfs_buf_submitChristoph Hellwig
xfs_buf_submit now only completes a buffer on error, or for in-memory buftargs. There is no point in using a workqueue for the latter as the completion will just wake up the caller. Optimize this case by avoiding the workqueue roundtrip. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-01-14xfs: remove the extra buffer reference in xfs_buf_submitChristoph Hellwig
Nothing touches the buffer after it has been submitted now, so the need for the extra transient reference went away as well. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-01-14xfs: move invalidate_kernel_vmap_range to xfs_buf_ioendChristoph Hellwig
Invalidating cache lines can be fairly expensive, so don't do it in interrupt context. Note that in practice very few setup will actually do anything here as virtually indexed caches are rather uncommon, but we might as well move the call to the proper place while touching this area. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-01-14xfs: simplify buffer I/O submissionChristoph Hellwig
The code in _xfs_buf_ioapply is unnecessarily complicated because it doesn't take advantage of modern bio features. Simplify it by making use of bio splitting and chaining, that is build a single bio for the pages in the buffer using a simple loop, and then split that bio on the map boundaries for discontiguous multi-FSB buffers and chain the split bios to the main one so that there is only a single I/O completion. This not only simplifies the code to build the buffer, but also removes the need for the b_io_remaining field as buffer ownership is granted to the bio on submit of the final bio with no chance for a completion before that as well as the b_io_error field that is now superfluous because there always is exactly one completion. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-01-14xfs: move in-memory buftarg handling out of _xfs_buf_ioapplyChristoph Hellwig
No I/O to apply for in-memory buffers, so skip the function call entirely. Clean up the b_io_error initialization logic to allow for this. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-01-14xfs: move write verification out of _xfs_buf_ioapplyChristoph Hellwig
Split the write verification logic out of _xfs_buf_ioapply into a new xfs_buf_verify_write helper called by xfs_buf_submit given that it isn't about applying the I/O and doesn't really fit in with the rest of _xfs_buf_ioapply. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-01-14xfs: remove xfs_buf_delwri_submit_buffersChristoph Hellwig
xfs_buf_delwri_submit_buffers has two callers for synchronous and asynchronous writes that share very little logic. Split out a helper for the shared per-buffer loop and otherwise open code the submission in the two callers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-01-14xfs: simplify xfs_buf_delwri_pushbufChristoph Hellwig
xfs_buf_delwri_pushbuf synchronously writes a buffer that is on a delwri list already. Instead of doing a complicated dance with the delwri and wait list, just leave them alone and open code the actual buffer write. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-01-14xfs: move xfs_buf_iowait out of (__)xfs_buf_submitChristoph Hellwig
There is no good reason to pass a bool argument to wait for a buffer when the callers that want that can easily just wait themselves. This means the wait moves out of the extra hold of the buffer, but as the callers of synchronous buffer I/O need to hold a reference anyway that is perfectly fine. Because all async buffer submitters ignore the error return value, and the synchronous ones catch the error condition through b_error and xfs_buf_iowait this also means the new xfs_buf_submit doesn't have to return an error code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-01-14xfs: remove the incorrect comment about the b_pag fieldChristoph Hellwig
The rbtree root is long gone. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-01-14xfs: remove the incorrect comment above xfs_buf_free_mapsChristoph Hellwig
The comment above xfs_buf_free_maps talks about fields not even used in the function and also doesn't add any other value. Remove it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-01-14xfs: fix a double completion for buffers on in-memory targetsChristoph Hellwig
__xfs_buf_submit calls xfs_buf_ioend when b_io_remaining hits zero. For in-memory buftargs b_io_remaining is never incremented from it's initial value of 1, so this always happens. Thus the extra call to xfs_buf_ioend in _xfs_buf_ioapply causes a double completion. Fortunately __xfs_buf_submit is only used for synchronous reads on in-memory buftargs due to the peculiarities of how they work, so this is mostly harmless and just causes a little extra work to be done. Fixes: 5076a6040ca1 ("xfs: support in-memory buffer cache targets") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-01-13mm: perform all memfd seal checks in a single placeLorenzo Stoakes
We no longer actually need to perform these checks in the f_op->mmap() hook any longer. We already moved the operation which clears VM_MAYWRITE on a read-only mapping of a write-sealed memfd in order to work around the restrictions imposed by commit 5de195060b2e ("mm: resolve faulty mmap_region() error path behaviour"). There is no reason for us not to simply go ahead and additionally check to see if any pre-existing seals are in place here rather than defer this to the f_op->mmap() hook. By doing this we remove more logic from shmem_mmap() which doesn't belong there, as well as doing the same for hugetlbfs_file_mmap(). We also remove dubious shared logic in mm.h which simply does not belong there either. It makes sense to do these checks at the earliest opportunity, we know these are shmem (or hugetlbfs) mappings whose relevant VMA flags will not change from the invoking do_mmap() so there is simply no need to wait. This also means the implementation of further memfd seal flags can be done within mm/memfd.c and also have the opportunity to modify VMA flags as necessary early in the mapping logic. [lorenzo.stoakes@oracle.com: fix typos in !memfd inline stub] Link: https://lkml.kernel.org/r/7dee6c5d-480b-4c24-b98e-6fa47dbd8a23@lucifer.local Link: https://lkml.kernel.org/r/20241206212846.210835-1-lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Tested-by: Isaac J. Manjarres <isaacmanjarres@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jeff Xu <jeffxu@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-13mm: abstract get_arg_page() stack expansion and mmap read lockLorenzo Stoakes
Right now fs/exec.c invokes expand_downwards(), an otherwise internal implementation detail of the VMA logic in order to ensure that an arg page can be obtained by get_user_pages_remote(). In order to be able to move the stack expansion logic into mm/vma.c to make it available to userland testing we need to find an alternative approach here. We do so by providing the mmap_read_lock_maybe_expand() function which also helpfully documents what get_arg_page() is doing here and adds an additional check against VM_GROWSDOWN to make explicit that the stack expansion logic is only invoked when the VMA is indeed a downward-growing stack. This allows expand_downwards() to become a static function. Importantly, the VMA referenced by mmap_read_maybe_expand() must NOT be currently user-visible in any way, that is place within an rmap or VMA tree. It must be a newly allocated VMA. This is the case when exec invokes this function. Link: https://lkml.kernel.org/r/5295d1c70c58e6aa63d14be68d4e1de9fa1c8e6d.1733248985.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Kees Cook <kees@kernel.org> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-13btrfs: add the missing error handling inside get_canonical_dev_pathQu Wenruo
Inside function get_canonical_dev_path(), we call d_path() to get the final device path. But d_path() can return error, and in that case the next strscpy() call will trigger an invalid memory access. Add back the missing error handling for d_path(). Reported-by: Boris Burkov <boris@bur.io> Fixes: 7e06de7c83a7 ("btrfs: canonicalize the device path before adding it") Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: add io_uring interface for encoded writesMark Harmstone
Add an io_uring interface for encoded writes, with the same parameters as the BTRFS_IOC_ENCODED_WRITE ioctl. As with the encoded reads code, there's a test program for this at https://github.com/maharmstone/io_uring-encoded, and I'll get this worked into an fstest. How io_uring works is that it initially calls btrfs_uring_cmd with the IO_URING_F_NONBLOCK flag set, and if we return -EAGAIN it tries again in a kthread with the flag cleared. Ideally we'd honour this and call try_lock etc., but there's still a lot of work to be done to create non-blocking versions of all the functions in our write path. Instead, just validate the input in btrfs_uring_encoded_write() on the first pass and return -EAGAIN, with a view to properly optimizing the optimistic path later on. Signed-off-by: Mark Harmstone <maharmstone@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13bcachefs: bcachefs_metadata_version_directory_sizeHongbo Li
This adds another metadata version for accounting directory size. For the new version of the filesystem, when new subdirectory items are created or deleted, the parent directory's size will change accordingly. For the old version of the existed file system, running fsck will automatically upgrade the metadata version, and it will do the check and recalculationg of the directory size. Signed-off-by: Hongbo Li <lihongbo22@huawei.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-01-13bcachefs: make directory i_size meaningfulHongbo Li
The isize of directory is 0 in bcachefs if the directory is empty. With more child dirents created, its size ought to change. Many other filesystems changed as that (ie. xfs and btrfs). And many of them changed as the size of child dirent name. Although the directory size may not seem to convey much, we can still give it some meaning. The formula of dentry size as follow: occupied_size = 40 + ALIGN(9 + namelen, 8) Signed-off-by: Hongbo Li <lihongbo22@huawei.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-01-13f2fs: fix using wrong 'submitted' value in f2fs_write_cache_pageszangyangyang1
When f2fs_write_single_data_page fails, f2fs_write_cache_pages will use the last 'submitted' value incorrectly, which will cause 'nwritten' and 'wbc->nr_to_write' calculation errors Signed-off-by: zangyangyang1 <zangyangyang1@xiaomi.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-01-13f2fs: add parameter @len to f2fs_invalidate_blocks()Yi Sun
New function can process some consecutive blocks at a time. Function f2fs_invalidate_blocks()->down_write() and up_write() are very time-consuming, so if f2fs_invalidate_blocks() can process consecutive blocks at one time, it will save a lot of time. Signed-off-by: Yi Sun <yi.sun@unisoc.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-01-13NFS: Fix potential buffer overflowin nfs_sysfs_link_rpc_client()Zichen Xie
name is char[64] where the size of clnt->cl_program->name remains unknown. Invoking strcat() directly will also lead to potential buffer overflow. Change them to strscpy() and strncat() to fix potential issues. Signed-off-by: Zichen Xie <zichenxie0106@gmail.com> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-01-13Merge tag 'mm-hotfixes-stable-2025-01-13-00-03' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "18 hotfixes. 11 are cc:stable. 13 are MM and 5 are non-MM. All patches are singletons - please see the relevant changelogs for details" * tag 'mm-hotfixes-stable-2025-01-13-00-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: fs/proc: fix softlockup in __read_vmcore (part 2) mm: fix assertion in folio_end_read() mm: vmscan : pgdemote vmstat is not getting updated when MGLRU is enabled. vmstat: disable vmstat_work on vmstat_cpu_down_prep() zram: fix potential UAF of zram table selftests/mm: set allocated memory to non-zero content in cow test mm: clear uffd-wp PTE/PMD state on mremap() module: fix writing of livepatch relocations in ROX text mm: zswap: properly synchronize freeing resources during CPU hotunplug Revert "mm: zswap: fix race between [de]compression and CPU hotunplug" hugetlb: fix NULL pointer dereference in trace_hugetlbfs_alloc_inode mm: fix div by zero in bdi_ratio_from_pages x86/execmem: fix ROX cache usage in Xen PV guests filemap: avoid truncating 64-bit offset to 32 bits tools: fix atomic_set() definition to set the value correctly mm/mempolicy: count MPOL_WEIGHTED_INTERLEAVE to "interleave_hit" scripts/decode_stacktrace.sh: fix decoding of lines with an additional info mm/kmemleak: fix percpu memory leak detection failure
2025-01-13select: Fix unbalanced user_access_end()Christophe Leroy
While working on implementing user access validation on powerpc I got the following warnings on a pmac32_defconfig build: CC fs/select.o fs/select.o: warning: objtool: sys_pselect6+0x1bc: redundant UACCESS disable fs/select.o: warning: objtool: sys_pselect6_time32+0x1bc: redundant UACCESS disable On powerpc/32s, user_read_access_begin/end() are no-ops, but the failure path has a user_access_end() instead of user_read_access_end() which means an access end without any prior access begin. Replace that user_access_end() by user_read_access_end(). Fixes: 7e71609f64ec ("pselect6() and friends: take handling the combined 6th/7th args into helper") Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Link: https://lore.kernel.org/r/a7139e28d767a13e667ee3c79599a8047222ef36.1736751221.git.christophe.leroy@csgroup.eu Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-01-13btrfs: remove the unused locked_folio parameter from ↵Qu Wenruo
btrfs_cleanup_ordered_extents() The function btrfs_cleanup_ordered_extents() is only called in error handling path, and the last caller with a @locked_folio parameter was removed to fix a bug in the btrfs_run_delalloc_range() error handling. There is no need to pass @locked_folio parameter anymore. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: add extra error messages for delalloc range related errorsQu Wenruo
All the error handling bugs I hit so far are all -ENOSPC from either: - cow_file_range() - run_delalloc_nocow() - submit_uncompressed_range() Previously when those functions failed, there was no error message at all, making the debugging much harder. So here we introduce extra error messages for: - cow_file_range() - run_delalloc_nocow() - submit_uncompressed_range() - writepage_delalloc() when btrfs_run_delalloc_range() failed - extent_writepage() when extent_writepage_io() failed One example of the new debug error messages is the following one: run fstests generic/750 at 2024-12-08 12:41:41 BTRFS: device fsid 461b25f5-e240-4543-8deb-e7c2bd01a6d3 devid 1 transid 8 /dev/mapper/test-scratch1 (253:4) scanned by mount (2436600) BTRFS info (device dm-4): first mount of filesystem 461b25f5-e240-4543-8deb-e7c2bd01a6d3 BTRFS info (device dm-4): using crc32c (crc32c-arm64) checksum algorithm BTRFS info (device dm-4): forcing free space tree for sector size 4096 with page size 65536 BTRFS info (device dm-4): using free-space-tree BTRFS warning (device dm-4): read-write for sector size 4096 with page size 65536 is experimental BTRFS info (device dm-4): checking UUID tree BTRFS error (device dm-4): cow_file_range failed, root=363 inode=412 start=503808 len=98304: -28 BTRFS error (device dm-4): run_delalloc_nocow failed, root=363 inode=412 start=503808 len=98304: -28 BTRFS error (device dm-4): failed to run delalloc range, root=363 ino=412 folio=458752 submit_bitmap=11-15 start=503808 len=98304: -28 Which shows an error from cow_file_range() which is called inside a nocow write attempt, along with the extra bitmap from writepage_delalloc(). Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: subpage: dump the involved bitmap when ASSERT() failedQu Wenruo
For btrfs_folio_assert_not_dirty() and btrfs_folio_set_lock(), we call bitmap_test_range_all_zero() to ensure the involved range has no dirty/lock bit already set. However with my recent enhanced delalloc range error handling, I was hitting the ASSERT() inside btrfs_folio_set_lock(), and it turns out that some error handling path is not properly updating the folio flags. So add some extra dumping for the ASSERTs to dump the involved bitmap to help debug. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: subpage: fix the bitmap dump of the locked flagsQu Wenruo
We're dumping the locked bitmap into the @checked_bitmap variable, printing incorrect values during debug. Thankfully even during my development I haven't hit a case where I need to dump the locked bitmap. But for the sake of consistency, fix it by dupping the locked bitmap into @locked_bitmap variable for output. Fixes: 75258f20fb70 ("btrfs: subpage: dump extra subpage bitmaps for debug") Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: do proper folio cleanup when run_delalloc_nocow() failedQu Wenruo
[BUG] With CONFIG_DEBUG_VM set, test case generic/476 has some chance to crash with the following VM_BUG_ON_FOLIO(): BTRFS error (device dm-3): cow_file_range failed, start 1146880 end 1253375 len 106496 ret -28 BTRFS error (device dm-3): run_delalloc_nocow failed, start 1146880 end 1253375 len 106496 ret -28 page: refcount:4 mapcount:0 mapping:00000000592787cc index:0x12 pfn:0x10664 aops:btrfs_aops [btrfs] ino:101 dentry name(?):"f1774" flags: 0x2fffff80004028(uptodate|lru|private|node=0|zone=2|lastcpupid=0xfffff) page dumped because: VM_BUG_ON_FOLIO(!folio_test_locked(folio)) ------------[ cut here ]------------ kernel BUG at mm/page-writeback.c:2992! Internal error: Oops - BUG: 00000000f2000800 [#1] SMP CPU: 2 UID: 0 PID: 3943513 Comm: kworker/u24:15 Tainted: G OE 6.12.0-rc7-custom+ #87 Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022 Workqueue: events_unbound btrfs_async_reclaim_data_space [btrfs] pc : folio_clear_dirty_for_io+0x128/0x258 lr : folio_clear_dirty_for_io+0x128/0x258 Call trace: folio_clear_dirty_for_io+0x128/0x258 btrfs_folio_clamp_clear_dirty+0x80/0xd0 [btrfs] __process_folios_contig+0x154/0x268 [btrfs] extent_clear_unlock_delalloc+0x5c/0x80 [btrfs] run_delalloc_nocow+0x5f8/0x760 [btrfs] btrfs_run_delalloc_range+0xa8/0x220 [btrfs] writepage_delalloc+0x230/0x4c8 [btrfs] extent_writepage+0xb8/0x358 [btrfs] extent_write_cache_pages+0x21c/0x4e8 [btrfs] btrfs_writepages+0x94/0x150 [btrfs] do_writepages+0x74/0x190 filemap_fdatawrite_wbc+0x88/0xc8 start_delalloc_inodes+0x178/0x3a8 [btrfs] btrfs_start_delalloc_roots+0x174/0x280 [btrfs] shrink_delalloc+0x114/0x280 [btrfs] flush_space+0x250/0x2f8 [btrfs] btrfs_async_reclaim_data_space+0x180/0x228 [btrfs] process_one_work+0x164/0x408 worker_thread+0x25c/0x388 kthread+0x100/0x118 ret_from_fork+0x10/0x20 Code: 910a8021 a90363f7 a9046bf9 94012379 (d4210000) ---[ end trace 0000000000000000 ]--- [CAUSE] The first two lines of extra debug messages show the problem is caused by the error handling of run_delalloc_nocow(). E.g. we have the following dirtied range (4K blocksize 4K page size): 0 16K 32K |//////////////////////////////////////| | Pre-allocated | And the range [0, 16K) has a preallocated extent. - Enter run_delalloc_nocow() for range [0, 16K) Which found range [0, 16K) is preallocated, can do the proper NOCOW write. - Enter fallback_to_fow() for range [16K, 32K) Since the range [16K, 32K) is not backed by preallocated extent, we have to go COW. - cow_file_range() failed for range [16K, 32K) So cow_file_range() will do the clean up by clearing folio dirty, unlock the folios. Now the folios in range [16K, 32K) is unlocked. - Enter extent_clear_unlock_delalloc() from run_delalloc_nocow() Which is called with PAGE_START_WRITEBACK to start page writeback. But folios can only be marked writeback when it's properly locked, thus this triggered the VM_BUG_ON_FOLIO(). Furthermore there is another hidden but common bug that run_delalloc_nocow() is not clearing the folio dirty flags in its error handling path. This is the common bug shared between run_delalloc_nocow() and cow_file_range(). [FIX] - Clear folio dirty for range [@start, @cur_offset) Introduce a helper, cleanup_dirty_folios(), which will find and lock the folio in the range, clear the dirty flag and start/end the writeback, with the extra handling for the @locked_folio. - Introduce a helper to clear folio dirty, start and end writeback - Introduce a helper to record the last failed COW range end This is to trace which range we should skip, to avoid double unlocking. - Skip the failed COW range for the error handling CC: stable@vger.kernel.org Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: do proper folio cleanup when cow_file_range() failedQu Wenruo
[BUG] When testing with COW fixup marked as BUG_ON() (this is involved with the new pin_user_pages*() change, which should not result new out-of-band dirty pages), I hit a crash triggered by the BUG_ON() from hitting COW fixup path. This BUG_ON() happens just after a failed btrfs_run_delalloc_range(): BTRFS error (device dm-2): failed to run delalloc range, root 348 ino 405 folio 65536 submit_bitmap 6-15 start 90112 len 106496: -28 ------------[ cut here ]------------ kernel BUG at fs/btrfs/extent_io.c:1444! Internal error: Oops - BUG: 00000000f2000800 [#1] SMP CPU: 0 UID: 0 PID: 434621 Comm: kworker/u24:8 Tainted: G OE 6.12.0-rc7-custom+ #86 Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022 Workqueue: events_unbound btrfs_async_reclaim_data_space [btrfs] pc : extent_writepage_io+0x2d4/0x308 [btrfs] lr : extent_writepage_io+0x2d4/0x308 [btrfs] Call trace: extent_writepage_io+0x2d4/0x308 [btrfs] extent_writepage+0x218/0x330 [btrfs] extent_write_cache_pages+0x1d4/0x4b0 [btrfs] btrfs_writepages+0x94/0x150 [btrfs] do_writepages+0x74/0x190 filemap_fdatawrite_wbc+0x88/0xc8 start_delalloc_inodes+0x180/0x3b0 [btrfs] btrfs_start_delalloc_roots+0x174/0x280 [btrfs] shrink_delalloc+0x114/0x280 [btrfs] flush_space+0x250/0x2f8 [btrfs] btrfs_async_reclaim_data_space+0x180/0x228 [btrfs] process_one_work+0x164/0x408 worker_thread+0x25c/0x388 kthread+0x100/0x118 ret_from_fork+0x10/0x20 Code: aa1403e1 9402f3ef aa1403e0 9402f36f (d4210000) ---[ end trace 0000000000000000 ]--- [CAUSE] That failure is mostly from cow_file_range(), where we can hit -ENOSPC. Although the -ENOSPC is already a bug related to our space reservation code, let's just focus on the error handling. For example, we have the following dirty range [0, 64K) of an inode, with 4K sector size and 4K page size: 0 16K 32K 48K 64K |///////////////////////////////////////| |#######################################| Where |///| means page are still dirty, and |###| means the extent io tree has EXTENT_DELALLOC flag. - Enter extent_writepage() for page 0 - Enter btrfs_run_delalloc_range() for range [0, 64K) - Enter cow_file_range() for range [0, 64K) - Function btrfs_reserve_extent() only reserved one 16K extent So we created extent map and ordered extent for range [0, 16K) 0 16K 32K 48K 64K |////////|//////////////////////////////| |<- OE ->|##############################| And range [0, 16K) has its delalloc flag cleared. But since we haven't yet submit any bio, involved 4 pages are still dirty. - Function btrfs_reserve_extent() returns with -ENOSPC Now we have to run error cleanup, which will clear all EXTENT_DELALLOC* flags and clear the dirty flags for the remaining ranges: 0 16K 32K 48K 64K |////////| | | | | Note that range [0, 16K) still has its pages dirty. - Some time later, writeback is triggered again for the range [0, 16K) since the page range still has dirty flags. - btrfs_run_delalloc_range() will do nothing because there is no EXTENT_DELALLOC flag. - extent_writepage_io() finds page 0 has no ordered flag Which falls into the COW fixup path, triggering the BUG_ON(). Unfortunately this error handling bug dates back to the introduction of btrfs. Thankfully with the abuse of COW fixup, at least it won't crash the kernel. [FIX] Instead of immediately unlocking the extent and folios, we keep the extent and folios locked until either erroring out or the whole delalloc range finished. When the whole delalloc range finished without error, we just unlock the whole range with PAGE_SET_ORDERED (and PAGE_UNLOCK for !keep_locked cases), with EXTENT_DELALLOC and EXTENT_LOCKED cleared. And the involved folios will be properly submitted, with their dirty flags cleared during submission. For the error path, it will be a little more complex: - The range with ordered extent allocated (range (1)) We only clear the EXTENT_DELALLOC and EXTENT_LOCKED, as the remaining flags are cleaned up by btrfs_mark_ordered_io_finished()->btrfs_finish_one_ordered(). For folios we finish the IO (clear dirty, start writeback and immediately finish the writeback) and unlock the folios. - The range with reserved extent but no ordered extent (range(2)) - The range we never touched (range(3)) For both range (2) and range(3) the behavior is not changed. Now even if cow_file_range() failed halfway with some successfully reserved extents/ordered extents, we will keep all folios clean, so there will be no future writeback triggered on them. CC: stable@vger.kernel.org Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: fix error handling of submit_uncompressed_range()Qu Wenruo
[BUG] If we failed to compress the range, or cannot reserve a large enough data extent (e.g. too fragmented free space), we will fall back to submit_uncompressed_range(). But inside submit_uncompressed_range(), run_delalloc_cow() can also fail due to -ENOSPC or any other error. In that case there are 3 bugs in the error handling: 1) Double freeing for the same ordered extent This can lead to crash due to ordered extent double accounting 2) Start/end writeback without updating the subpage writeback bitmap 3) Unlock the folio without clear the subpage lock bitmap Both bugs 2) and 3) will crash the kernel if the btrfs block size is smaller than folio size, as the next time the folio gets writeback/lock updates, subpage will find the bitmap already have the range set, triggering an ASSERT(). [CAUSE] Bug 1) happens in the following call chain: submit_uncompressed_range() |- run_delalloc_cow() | |- cow_file_range() | |- btrfs_reserve_extent() | Failed with -ENOSPC or whatever error | |- btrfs_clean_up_ordered_extents() | |- btrfs_mark_ordered_io_finished() | Which cleans all the ordered extents in the async_extent range. | |- btrfs_mark_ordered_io_finished() Which cleans the folio range. The finished ordered extents may not be immediately removed from the ordered io tree, as they are removed inside a work queue. So the second btrfs_mark_ordered_io_finished() may find the finished but not-yet-removed ordered extents, and double free them. Furthermore, the second btrfs_mark_ordered_io_finished() is not subpage compatible, as it uses fixed folio_pos() with PAGE_SIZE, which can cover other ordered extents. Bugs 2) and 3) are more straightforward, btrfs just calls folio_unlock(), folio_start_writeback() and folio_end_writeback(), other than the helpers which handle subpage cases. [FIX] For bug 1) since the first btrfs_cleanup_ordered_extents() call is handling the whole range, we should not do the second btrfs_mark_ordered_io_finished() call. And for the first btrfs_cleanup_ordered_extents(), we no longer need to pass the @locked_page parameter, as we are already in the async extent context, thus will never rely on the error handling inside btrfs_run_delalloc_range(). So just let the btrfs_clean_up_ordered_extents() handle every folio equally. For bug 2) we should not even call folio_start_writeback()/folio_end_writeback() anymore. As the error handling protocol, cow_file_range() should clear dirty flag and start/finish the writeback for the whole range passed in. For bug 3) just change the folio_unlock() to btrfs_folio_end_lock() helper. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: fix double accounting race when extent_writepage_io() failedQu Wenruo
[BUG] If submit_one_sector() failed inside extent_writepage_io() for sector size < page size cases (e.g. 4K sector size and 64K page size), then we can hit double ordered extent accounting error. This should be very rare, as submit_one_sector() only fails when we failed to grab the extent map, and such extent map should exist inside the memory and has been pinned. [CAUSE] For example we have the following folio layout: 0 4K 32K 48K 60K 64K |//| |//////| |///| Where |///| is the dirty range we need to writeback. The 3 different dirty ranges are submitted for regular COW. Now we hit the following sequence: - submit_one_sector() returned 0 for [0, 4K) - submit_one_sector() returned 0 for [32K, 48K) - submit_one_sector() returned error for [60K, 64K) - btrfs_mark_ordered_io_finished() called for the whole folio This will mark the following ranges as finished: * [0, 4K) * [32K, 48K) Both ranges have their IO already submitted, this cleanup will lead to double accounting. * [60K, 64K) That's the correct cleanup. The only good news is, this error is only theoretical, as the target extent map is always pinned, thus we should directly grab it from memory, other than reading it from the disk. [FIX] Instead of calling btrfs_mark_ordered_io_finished() for the whole folio range, which can touch ranges we should not touch, instead move the error handling inside extent_writepage_io(). So that we can cleanup exact sectors that ought to be submitted but failed. This provides much more accurate cleanup, avoiding the double accounting. CC: stable@vger.kernel.org # 5.15+ Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: fix double accounting race when btrfs_run_delalloc_range() failedQu Wenruo
[BUG] When running btrfs with block size (4K) smaller than page size (64K, aarch64), there is a very high chance to crash the kernel at generic/750, with the following messages: (before the call traces, there are 3 extra debug messages added) BTRFS warning (device dm-3): read-write for sector size 4096 with page size 65536 is experimental BTRFS info (device dm-3): checking UUID tree hrtimer: interrupt took 5451385 ns BTRFS error (device dm-3): cow_file_range failed, root=4957 inode=257 start=1605632 len=69632: -28 BTRFS error (device dm-3): run_delalloc_nocow failed, root=4957 inode=257 start=1605632 len=69632: -28 BTRFS error (device dm-3): failed to run delalloc range, root=4957 ino=257 folio=1572864 submit_bitmap=8-15 start=1605632 len=69632: -28 ------------[ cut here ]------------ WARNING: CPU: 2 PID: 3020984 at ordered-data.c:360 can_finish_ordered_extent+0x370/0x3b8 [btrfs] CPU: 2 UID: 0 PID: 3020984 Comm: kworker/u24:1 Tainted: G OE 6.13.0-rc1-custom+ #89 Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022 Workqueue: events_unbound btrfs_async_reclaim_data_space [btrfs] pc : can_finish_ordered_extent+0x370/0x3b8 [btrfs] lr : can_finish_ordered_extent+0x1ec/0x3b8 [btrfs] Call trace: can_finish_ordered_extent+0x370/0x3b8 [btrfs] (P) can_finish_ordered_extent+0x1ec/0x3b8 [btrfs] (L) btrfs_mark_ordered_io_finished+0x130/0x2b8 [btrfs] extent_writepage+0x10c/0x3b8 [btrfs] extent_write_cache_pages+0x21c/0x4e8 [btrfs] btrfs_writepages+0x94/0x160 [btrfs] do_writepages+0x74/0x190 filemap_fdatawrite_wbc+0x74/0xa0 start_delalloc_inodes+0x17c/0x3b0 [btrfs] btrfs_start_delalloc_roots+0x17c/0x288 [btrfs] shrink_delalloc+0x11c/0x280 [btrfs] flush_space+0x288/0x328 [btrfs] btrfs_async_reclaim_data_space+0x180/0x228 [btrfs] process_one_work+0x228/0x680 worker_thread+0x1bc/0x360 kthread+0x100/0x118 ret_from_fork+0x10/0x20 ---[ end trace 0000000000000000 ]--- BTRFS critical (device dm-3): bad ordered extent accounting, root=4957 ino=257 OE offset=1605632 OE len=16384 to_dec=16384 left=0 BTRFS critical (device dm-3): bad ordered extent accounting, root=4957 ino=257 OE offset=1622016 OE len=12288 to_dec=12288 left=0 Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008 BTRFS critical (device dm-3): bad ordered extent accounting, root=4957 ino=257 OE offset=1634304 OE len=8192 to_dec=4096 left=0 CPU: 1 UID: 0 PID: 3286940 Comm: kworker/u24:3 Tainted: G W OE 6.13.0-rc1-custom+ #89 Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022 Workqueue: btrfs_work_helper [btrfs] (btrfs-endio-write) pstate: 404000c5 (nZcv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : process_one_work+0x110/0x680 lr : worker_thread+0x1bc/0x360 Call trace: process_one_work+0x110/0x680 (P) worker_thread+0x1bc/0x360 (L) worker_thread+0x1bc/0x360 kthread+0x100/0x118 ret_from_fork+0x10/0x20 Code: f84086a1 f9000fe1 53041c21 b9003361 (f9400661) ---[ end trace 0000000000000000 ]--- Kernel panic - not syncing: Oops: Fatal exception SMP: stopping secondary CPUs SMP: failed to stop secondary CPUs 2-3 Dumping ftrace buffer: (ftrace buffer empty) Kernel Offset: 0x275bb9540000 from 0xffff800080000000 PHYS_OFFSET: 0xffff8fbba0000000 CPU features: 0x100,00000070,00801250,8201720b [CAUSE] The above warning is triggered immediately after the delalloc range failure, this happens in the following sequence: - Range [1568K, 1636K) is dirty 1536K 1568K 1600K 1636K 1664K | |/////////|////////| | Where 1536K, 1600K and 1664K are page boundaries (64K page size) - Enter extent_writepage() for page 1536K - Enter run_delalloc_nocow() with locked page 1536K and range [1568K, 1636K) This is due to the inode having preallocated extents. - Enter cow_file_range() with locked page 1536K and range [1568K, 1636K) - btrfs_reserve_extent() only reserved two extents The main loop of cow_file_range() only reserved two data extents, Now we have: 1536K 1568K 1600K 1636K 1664K | |<-->|<--->|/|///////| | 1584K 1596K Range [1568K, 1596K) has an ordered extent reserved. - btrfs_reserve_extent() failed inside cow_file_range() for file offset 1596K This is already a bug in our space reservation code, but for now let's focus on the error handling path. Now cow_file_range() returned -ENOSPC. - btrfs_run_delalloc_range() do error cleanup <<< ROOT CAUSE Call btrfs_cleanup_ordered_extents() with locked folio 1536K and range [1568K, 1636K) Function btrfs_cleanup_ordered_extents() normally needs to skip the ranges inside the folio, as it will normally be cleaned up by extent_writepage(). Such split error handling is already problematic in the first place. What's worse is the folio range skipping itself, which is not taking subpage cases into consideration at all, it will only skip the range if the page start >= the range start. In our case, the page start < the range start, since for subpage cases we can have delalloc ranges inside the folio but not covering the folio. So it doesn't skip the page range at all. This means all the ordered extents, both [1568K, 1584K) and [1584K, 1596K) will be marked as IOERR. And these two ordered extents have no more pending ios, they are marked finished, and *QUEUED* to be deleted from the io tree. - extent_writepage() do error cleanup Call btrfs_mark_ordered_io_finished() for the range [1536K, 1600K). Although ranges [1568K, 1584K) and [1584K, 1596K) are finished, the deletion from io tree is async, it may or may not happen at this time. If the ranges have not yet been removed, we will do double cleaning on those ranges, triggering the above ordered extent warnings. In theory there are other bugs, like the cleanup in extent_writepage() can cause double accounting on ranges that are submitted asynchronously (compression for example). But that's much harder to trigger because normally we do not mix regular and compression delalloc ranges. [FIX] The folio range split is already buggy and not subpage compatible, it was introduced a long time ago where subpage support was not even considered. So instead of splitting the ordered extents cleanup into the folio range and out of folio range, do all the cleanup inside writepage_delalloc(). - Pass @NULL as locked_folio for btrfs_cleanup_ordered_extents() in btrfs_run_delalloc_range() - Skip the btrfs_cleanup_ordered_extents() if writepage_delalloc() failed So all ordered extents are only cleaned up by btrfs_run_delalloc_range(). - Handle the ranges that already have ordered extents allocated If part of the folio already has ordered extent allocated, and btrfs_run_delalloc_range() failed, we also need to cleanup that range. Now we have a concentrated error handling for ordered extents during btrfs_run_delalloc_range(). Fixes: d1051d6ebf8e ("btrfs: Fix error handling in btrfs_cleanup_ordered_extents") CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13xfs/libxfs: replace kmalloc() and memcpy() with kmemdup()Mirsad Todorovac
The source static analysis tool gave the following advice: ./fs/xfs/libxfs/xfs_dir2.c:382:15-22: WARNING opportunity for kmemdup → 382 args->value = kmalloc(len, 383 GFP_KERNEL | __GFP_NOLOCKDEP | __GFP_RETRY_MAYFAIL); 384 if (!args->value) 385 return -ENOMEM; 386 → 387 memcpy(args->value, name, len); 388 args->valuelen = len; 389 return -EEXIST; Replacing kmalloc() + memcpy() with kmemdump() doesn't change semantics. Original code works without fault, so this is not a bug fix but proposed improvement. Link: https://lwn.net/Articles/198928/ Fixes: 94a69db2367ef ("xfs: use __GFP_NOLOCKDEP instead of GFP_NOFS") Fixes: 384f3ced07efd ("[XFS] Return case-insensitive match for dentry cache") Fixes: 2451337dd0439 ("xfs: global error sign conversion") Cc: Carlos Maiolino <cem@kernel.org> Cc: Darrick J. Wong <djwong@kernel.org> Cc: Chandan Babu R <chandanbabu@kernel.org> Cc: Dave Chinner <dchinner@redhat.com> Cc: linux-xfs@vger.kernel.org Cc: linux-kernel@vger.kernel.org Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Mirsad Todorovac <mtodorovac69@gmail.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-01-13xfs: constify feature checksChristoph Hellwig
They will eventually be needed to be const for zoned growfs, but even now having such simpler helpers as const as possible is a good thing. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-01-13xfs: refactor xfs_fs_statfsChristoph Hellwig
Split out helpers for data, rt data and inode related information, and assigning f_bavail once instead of in three places. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>