summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2024-11-12jbd2: remove useless 'block_error' variableYe Bin
The judgement 'if (block_error && success == 0)' is never valid. Just remove useless 'block_error' variable. Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20240930005942.626942-6-yebin@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-11-12jbd2: factor out jbd2_do_replay()Ye Bin
Factor out jbd2_do_replay() no funtional change. Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20240930005942.626942-5-yebin@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-11-12jbd2: refactor JBD2_COMMIT_BLOCK process in do_one_pass()Ye Bin
To make JBD2_COMMIT_BLOCK process more clean, no functional change. Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20240930005942.626942-4-yebin@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-11-12jbd2: unified release of buffer_head in do_one_pass()Ye Bin
Now buffer_head free is very fragmented in do_one_pass(), unified release of buffer_head in do_one_pass() Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20240930005942.626942-3-yebin@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-11-12jbd2: remove redundant judgments for check v1 checksumYe Bin
'need_check_commit_time' is only used by v2/v3 checksum, so there isn't need to add 'need_check_commit_time' judegement for v1 checksum logic. Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20240930005942.626942-2-yebin@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-11-12ext4: use ERR_CAST to return an error-valued pointerYu Jiaoliang
Instead of directly casting and returning an error-valued pointer, use ERR_CAST to make the error handling more explicit and improve code clarity. Signed-off-by: Yu Jiaoliang <yujiaoliang@vivo.com> Link: https://patch.msgid.link/20240920021440.1959243-1-yujiaoliang@vivo.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-11-12ext4: partial zero eof block on unaligned inode size extensionBrian Foster
Using mapped writes, it's technically possible to expose stale post-eof data on a truncate up operation. Consider the following example: $ xfs_io -fc "pwrite 0 2k" -c "mmap 0 4k" -c "mwrite 2k 2k" \ -c "truncate 8k" -c "pread -v 2k 16" <file> ... 00000800: 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 XXXXXXXXXXXXXXXX ... This shows that the post-eof data written via mwrite lands within EOF after a truncate up. While this is deliberate of the test case, behavior is somewhat unpredictable because writeback does post-eof zeroing, and writeback can occur at any time in the background. For example, an fsync inserted between the mwrite and truncate causes the subsequent read to instead return zeroes. This basically means that there is a race window in this situation between any subsequent extending operation and writeback that dictates whether post-eof data is exposed to the file or zeroed. To prevent this problem, perform partial block zeroing as part of the various inode size extending operations that are susceptible to it. For truncate extension, zero around the original eof similar to how truncate down does partial zeroing of the new eof. For extension via writes and fallocate related operations, zero the newly exposed range of the file to cover any partial zeroing that must occur at the original and new eof blocks. Signed-off-by: Brian Foster <bfoster@redhat.com> Link: https://patch.msgid.link/20240919160741.208162-2-bfoster@redhat.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-11-12ext4: disambiguate the return value of ext4_dio_write_end_io()Jinliang Zheng
The commit 91562895f803 ("ext4: properly sync file size update after O_SYNC direct IO") causes confusion about the meaning of the return value of ext4_dio_write_end_io(). Specifically, when the ext4_handle_inode_extension() operation succeeds, ext4_dio_write_end_io() directly returns count instead of 0. This does not cause a bug in the current kernel, but the semantics of the return value of the ext4_dio_write_end_io() function are wrong, which is likely to introduce bugs in the future code evolution. Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20240919082539.381626-1-alexjlzheng@tencent.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-11-12ext4: pass write-hint for buffered IOj.xia
Commit 449813515d3e ("block, fs: Restore the per-bio/request data lifetime fields") restored write-hint support in ext4. But that is applicable only for direct IO. This patch supports passing write-hint for buffered IO from ext4 file system to block layer by filling bi_write_hint of struct bio in io_submit_add_bh(). Signed-off-by: j.xia <j.xia@samsung.com> Link: https://patch.msgid.link/20240919020341.2657646-1-j.xia@samsung.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-11-12ext4: fix race in buffer_head read fault injectionLong Li
When I enabled ext4 debug for fault injection testing, I encountered the following warning: EXT4-fs error (device sda): ext4_read_inode_bitmap:201: comm fsstress: Cannot read inode bitmap - block_group = 8, inode_bitmap = 1051 WARNING: CPU: 0 PID: 511 at fs/buffer.c:1181 mark_buffer_dirty+0x1b3/0x1d0 The root cause of the issue lies in the improper implementation of ext4's buffer_head read fault injection. The actual completion of buffer_head read and the buffer_head fault injection are not atomic, which can lead to the uptodate flag being cleared on normally used buffer_heads in race conditions. [CPU0] [CPU1] [CPU2] ext4_read_inode_bitmap ext4_read_bh() <bh read complete> ext4_read_inode_bitmap if (buffer_uptodate(bh)) return bh jbd2_journal_commit_transaction __jbd2_journal_refile_buffer __jbd2_journal_unfile_buffer __jbd2_journal_temp_unlink_buffer ext4_simulate_fail_bh() clear_buffer_uptodate mark_buffer_dirty <report warning> WARN_ON_ONCE(!buffer_uptodate(bh)) The best approach would be to perform fault injection in the IO completion callback function, rather than after IO completion. However, the IO completion callback function cannot get the fault injection code in sb. Fix it by passing the result of fault injection into the bh read function, we simulate faults within the bh read function itself. This requires adding an extra parameter to the bh read functions that need fault injection. Fixes: 46f870d690fe ("ext4: simulate various I/O and checksum errors when reading metadata") Signed-off-by: Long Li <leo.lilong@huawei.com> Link: https://patch.msgid.link/20240906091746.510163-1-leo.lilong@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-11-12ext4: don't pass full mapping flags to ext4_es_insert_extent()Zhang Yi
When converting a delalloc extent in ext4_es_insert_extent(), since we only want to pass the info of whether the quota has already been claimed if the allocation is a direct allocation from ext4_map_create_blocks(), there is no need to pass full mapping flags, so changes to just pass whether the EXT4_GET_BLOCKS_DELALLOC_RESERVE bit is set. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20240906061401.2980330-1-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-11-12ext4: mark ctx_*_flags() with __maybe_unusedAndy Shevchenko
When ctx_set_flags() is unused, it prevents kernel builds with clang, `make W=1` and CONFIG_WERROR=y: .../ext4/super.c:2120:1: error: unused function 'ctx_set_flags' [-Werror,-Wunused-function] 2120 | EXT4_SET_CTX(flags); /* set only */ | ^~~~~~~~~~~~~~~~~~~ Fix this by marking ctx_*_flags() with __maybe_unused (mark both for the sake of symmetry). See also commit 6863f5643dd7 ("kbuild: allow Clang to find unused static inline functions for W=1 build"). Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Link: https://patch.msgid.link/20240905163229.140522-1-andriy.shevchenko@linux.intel.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-11-12ext4: return error on syncfs after shutdownAmir Goldstein
This is the logic behavior and one that we would like to verify using a generic fstest similar to xfs/546. Link: https://lore.kernel.org/fstests/20240830152648.GE6216@frogsfrogsfrogs/ Suggested-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20240904084657.1062243-1-amir73il@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-11-12fs: ext4: Don't use CMA for buffer_headZhaoyang Huang
cma_alloc() keep failed in our system which thanks to a jh->bh->b_page can not be migrated out of CMA area[1] as the jh has one cp_transaction pending on it because of j_free > j_max_transaction_buffers[2][3][4][5][6]. We temporarily solve this by launching jbd2_log_do_checkpoint forcefully somewhere. Since journal is common mechanism to all JFSs and cp_transaction has a little fewer opportunity to be launched, the cma_alloc() could be affected under the same scenario. This patch would like to have buffer_head of ext4 not use CMA pages when doing sb_getblk. [1] crash_arm64_v8.0.4++> kmem -p|grep ffffff808f0aa150(sb->s_bdev->bd_inode->i_mapping) fffffffe01a51c00 e9470000 ffffff808f0aa150 3 2 8000000008020 lru,private fffffffe03d189c0 174627000 ffffff808f0aa150 4 2 2004000000008020 lru,private fffffffe03d88e00 176238000 ffffff808f0aa150 3f9 2 2008000000008020 lru,private fffffffe03d88e40 176239000 ffffff808f0aa150 6 2 2008000000008020 lru,private fffffffe03d88e80 17623a000 ffffff808f0aa150 5 2 2008000000008020 lru,private fffffffe03d88ec0 17623b000 ffffff808f0aa150 1 2 2008000000008020 lru,private fffffffe03d88f00 17623c000 ffffff808f0aa150 0 2 2008000000008020 lru,private fffffffe040e6540 183995000 ffffff808f0aa150 3f4 2 2004000000008020 lru,private [2] page -> buffer_head crash_arm64_v8.0.4++> struct page.private fffffffe01a51c00 -x private = 0xffffff802fca0c00 [3] buffer_head -> journal_head crash_arm64_v8.0.4++> struct buffer_head.b_private 0xffffff802fca0c00 b_private = 0xffffff8041338e10, [4] journal_head -> b_cp_transaction crash_arm64_v8.0.4++> struct journal_head.b_cp_transaction 0xffffff8041338e10 -x b_cp_transaction = 0xffffff80410f1900, [5] transaction_t -> journal crash_arm64_v8.0.4++> struct transaction_t.t_journal 0xffffff80410f1900 -x t_journal = 0xffffff80e70f3000, [6] j_free & j_max_transaction_buffers crash_arm64_v8.0.4++> struct journal_t.j_free,j_max_transaction_buffers 0xffffff80e70f3000 -x j_free = 0x3f1, j_max_transaction_buffers = 0x100, Suggested-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20240904075300.1148836-1-zhaoyang.huang@unisoc.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-11-12ext4: simplify if conditionJiapeng Chong
The if condition !A || A && B can be simplified to !A || B. ./fs/ext4/fast_commit.c:362:21-23: WARNING !A || A && B is equivalent to !A || B. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=9837 Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20240830071713.40565-1-jiapeng.chong@linux.alibaba.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-11-12ext4: fix FS_IOC_GETFSMAP handlingTheodore Ts'o
The original implementation ext4's FS_IOC_GETFSMAP handling only worked when the range of queried blocks included at least one free (unallocated) block range. This is because how the metadata blocks were emitted was as a side effect of ext4_mballoc_query_range() calling ext4_getfsmap_datadev_helper(), and that function was only called when a free block range was identified. As a result, this caused generic/365 to fail. Fix this by creating a new function ext4_getfsmap_meta_helper() which gets called so that blocks before the first free block range in a block group can get properly reported. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
2024-11-12ext4: WARN if a full dir leaf block has only one dentryBaokun Li
The maximum length of a filename is 255 and the minimum block size is 1024, so it is always guaranteed that the number of entries is greater than or equal to 2 when do_split() is called. So unless ext4_dx_add_entry() and make_indexed_dir() or some other functions are buggy, 'split == 0' will not occur. Setting 'continued' to 0 in this case masks the problem that the file system has become corrupted, even though it prevents possible out-of-bounds access. Hence WARN_ON_ONCE() is used to check if 'split' is 0, and if it is then warns and returns an error to abort split. Suggested-by: Theodore Ts'o <tytso@mit.edu> Link: https://lore.kernel.org/r/20240823160518.GA424729@mit.edu Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20241008121152.3771906-1-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-11-12ext4: show the default enabled prefetch_block_bitmaps optionBaokun Li
After commit 21175ca434c5 ("ext4: make prefetch_block_bitmaps default"), we enable 'prefetch_block_bitmaps' by default, but this is not shown in the '/proc/fs/ext4/sdx/options' procfs interface. This makes it impossible to distinguish whether the feature is enabled by default or not, so 'prefetch_block_bitmaps' is shown in the 'options' procfs interface when prefetch_block_bitmaps is enabled by default. This makes it easy to notice changes to the default mount options between versions through the '/proc/fs/ext4/sdx/options' procfs interface. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20241008120134.3758097-1-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-11-12ufs: ufs_sb_private_info: remove unused s_{2,3}apb fieldsAgathe Porte
These two fields are populated and stored as a "frequently used value" in ufs_fill_super, but are not used afterwards in the driver. Moreover, one of the shifts triggers UBSAN: shift-out-of-bounds when apbshift is 12 because 12 * 3 = 36 and 1 << 36 does not fit in the 32 bit integer used to store the value. Closes: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2087853 Signed-off-by: Agathe Porte <agathe.porte@canonical.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-11-12writeback: wbc_attach_fdatawrite_inode out of lineChristoph Hellwig
This allows exporting this high-level interface only while keeping wbc_attach_and_unlock_inode private in fs-writeback.c and unexporting __inode_attach_wb. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20241112054403.1470586-3-hch@lst.de Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-11-12writeback: add a __releases annoation to wbc_attach_and_unlock_inodeChristoph Hellwig
This shuts up a sparse lock context tracking warning. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20241112054403.1470586-2-hch@lst.de Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-11-12fs: add the ability for statmount() to report the fs_subtypeJeff Layton
/proc/self/mountinfo prints out the sb->s_subtype after the type. This is particularly useful for disambiguating FUSE mounts (at least when the userland driver bothers to set it). Add STATMOUNT_FS_SUBTYPE and claim one of the __spare2 fields to point to the offset into the str[] array. Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ian Kent <raven@themaw.net> Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/20241111-statmount-v4-2-2eaf35d07a80@kernel.org Acked-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-11-12fs: don't let statmount return empty stringsJeff Layton
When one of the statmount_string() handlers doesn't emit anything to seq, the kernel currently sets the corresponding flag and emits an empty string. Given that statmount() returns a mask of accessible fields, just leave the bit unset in this case, and skip any NULL termination. If nothing was emitted to the seq, then the EOVERFLOW and EAGAIN cases aren't applicable and the function can just return immediately. Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/20241111-statmount-v4-1-2eaf35d07a80@kernel.org Acked-by: Miklos Szeredi <mszeredi@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-11-12fs:aio: Remove TODO comment suggesting hash or array usage in io_cancel()Mohammed Anees
The comment suggests a hash or array approach to store the active requests. Currently it iterates through all the active requests and when found deletes the requested request, in the linked list. However io_cancel() isn’t a frequently used operation, and optimizing it wouldn’t bring a substantial benefit to real users and the increased complexity of maintaining a hashtable for this would be significant and will slow down other operation. Therefore remove this TODO to avoid people spending time improving this. Signed-off-by: Mohammed Anees <pvmohammedanees2003@gmail.com> Link: https://lore.kernel.org/r/20241112113906.15825-1-pvmohammedanees2003@gmail.com Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-11-12hfsplus: don't query the device logical block size multiple timesThadeu Lima de Souza Cascardo
Devices block sizes may change. One of these cases is a loop device by using ioctl LOOP_SET_BLOCK_SIZE. While this may cause other issues like IO being rejected, in the case of hfsplus, it will allocate a block by using that size and potentially write out-of-bounds when hfsplus_read_wrapper calls hfsplus_submit_bio and the latter function reads a different io_size. Using a new min_io_size initally set to sb_min_blocksize works for the purposes of the original fix, since it will be set to the max between HFSPLUS_SECTOR_SIZE and the first seen logical block size. We still use the max between HFSPLUS_SECTOR_SIZE and min_io_size in case the latter is not initialized. Tested by mounting an hfsplus filesystem with loop block sizes 512, 1024 and 4096. The produced KASAN report before the fix looks like this: [ 419.944641] ================================================================== [ 419.945655] BUG: KASAN: slab-use-after-free in hfsplus_read_wrapper+0x659/0xa0a [ 419.946703] Read of size 2 at addr ffff88800721fc00 by task repro/10678 [ 419.947612] [ 419.947846] CPU: 0 UID: 0 PID: 10678 Comm: repro Not tainted 6.12.0-rc5-00008-gdf56e0f2f3ca #84 [ 419.949007] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1 04/01/2014 [ 419.950035] Call Trace: [ 419.950384] <TASK> [ 419.950676] dump_stack_lvl+0x57/0x78 [ 419.951212] ? hfsplus_read_wrapper+0x659/0xa0a [ 419.951830] print_report+0x14c/0x49e [ 419.952361] ? __virt_addr_valid+0x267/0x278 [ 419.952979] ? kmem_cache_debug_flags+0xc/0x1d [ 419.953561] ? hfsplus_read_wrapper+0x659/0xa0a [ 419.954231] kasan_report+0x89/0xb0 [ 419.954748] ? hfsplus_read_wrapper+0x659/0xa0a [ 419.955367] hfsplus_read_wrapper+0x659/0xa0a [ 419.955948] ? __pfx_hfsplus_read_wrapper+0x10/0x10 [ 419.956618] ? do_raw_spin_unlock+0x59/0x1a9 [ 419.957214] ? _raw_spin_unlock+0x1a/0x2e [ 419.957772] hfsplus_fill_super+0x348/0x1590 [ 419.958355] ? hlock_class+0x4c/0x109 [ 419.958867] ? __pfx_hfsplus_fill_super+0x10/0x10 [ 419.959499] ? __pfx_string+0x10/0x10 [ 419.960006] ? lock_acquire+0x3e2/0x454 [ 419.960532] ? bdev_name.constprop.0+0xce/0x243 [ 419.961129] ? __pfx_bdev_name.constprop.0+0x10/0x10 [ 419.961799] ? pointer+0x3f0/0x62f [ 419.962277] ? __pfx_pointer+0x10/0x10 [ 419.962761] ? vsnprintf+0x6c4/0xfba [ 419.963178] ? __pfx_vsnprintf+0x10/0x10 [ 419.963621] ? setup_bdev_super+0x376/0x3b3 [ 419.964029] ? snprintf+0x9d/0xd2 [ 419.964344] ? __pfx_snprintf+0x10/0x10 [ 419.964675] ? lock_acquired+0x45c/0x5e9 [ 419.965016] ? set_blocksize+0x139/0x1c1 [ 419.965381] ? sb_set_blocksize+0x6d/0xae [ 419.965742] ? __pfx_hfsplus_fill_super+0x10/0x10 [ 419.966179] mount_bdev+0x12f/0x1bf [ 419.966512] ? __pfx_mount_bdev+0x10/0x10 [ 419.966886] ? vfs_parse_fs_string+0xce/0x111 [ 419.967293] ? __pfx_vfs_parse_fs_string+0x10/0x10 [ 419.967702] ? __pfx_hfsplus_mount+0x10/0x10 [ 419.968073] legacy_get_tree+0x104/0x178 [ 419.968414] vfs_get_tree+0x86/0x296 [ 419.968751] path_mount+0xba3/0xd0b [ 419.969157] ? __pfx_path_mount+0x10/0x10 [ 419.969594] ? kmem_cache_free+0x1e2/0x260 [ 419.970311] do_mount+0x99/0xe0 [ 419.970630] ? __pfx_do_mount+0x10/0x10 [ 419.971008] __do_sys_mount+0x199/0x1c9 [ 419.971397] do_syscall_64+0xd0/0x135 [ 419.971761] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 419.972233] RIP: 0033:0x7c3cb812972e [ 419.972564] Code: 48 8b 0d f5 46 0d 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d c2 46 0d 00 f7 d8 64 89 01 48 [ 419.974371] RSP: 002b:00007ffe30632548 EFLAGS: 00000286 ORIG_RAX: 00000000000000a5 [ 419.975048] RAX: ffffffffffffffda RBX: 00007ffe306328d8 RCX: 00007c3cb812972e [ 419.975701] RDX: 0000000020000000 RSI: 0000000020000c80 RDI: 00007ffe306325d0 [ 419.976363] RBP: 00007ffe30632720 R08: 00007ffe30632610 R09: 0000000000000000 [ 419.977034] R10: 0000000000200008 R11: 0000000000000286 R12: 0000000000000000 [ 419.977713] R13: 00007ffe306328e8 R14: 00005a0eb298bc68 R15: 00007c3cb8356000 [ 419.978375] </TASK> [ 419.978589] Fixes: 6596528e391a ("hfsplus: ensure bio requests are not smaller than the hardware sectors") Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@igalia.com> Link: https://lore.kernel.org/r/20241107114109.839253-1-cascardo@igalia.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-11-12bcachefs: Fix assertion pop in bch2_ptr_swab()Kent Overstreet
This runs on extents that haven't yet been validated, so we don't want to assert that we have a valid entry type. Reported-by: syzbot+4f29c3f12f864d8a8d17@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-11-12bcachefs: Fix journal_entry_dev_usage_to_text() overrunKent Overstreet
If the jset_entry_dev_usage is malformed, and too small, our nr_entries calculation will be incorrect - just bail out. Reported-by: syzbot+05d7520be047c9be86e0@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-11-11ocfs2: fix UBSAN warning in ocfs2_verify_volume()Dmitry Antipov
Syzbot has reported the following splat triggered by UBSAN: UBSAN: shift-out-of-bounds in fs/ocfs2/super.c:2336:10 shift exponent 32768 is too large for 32-bit type 'int' CPU: 2 UID: 0 PID: 5255 Comm: repro Not tainted 6.12.0-rc4-syzkaller-00047-gc2ee9f594da8 #0 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-3.fc41 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x241/0x360 ? __pfx_dump_stack_lvl+0x10/0x10 ? __pfx__printk+0x10/0x10 ? __asan_memset+0x23/0x50 ? lockdep_init_map_type+0xa1/0x910 __ubsan_handle_shift_out_of_bounds+0x3c8/0x420 ocfs2_fill_super+0xf9c/0x5750 ? __pfx_ocfs2_fill_super+0x10/0x10 ? __pfx_validate_chain+0x10/0x10 ? __pfx_validate_chain+0x10/0x10 ? validate_chain+0x11e/0x5920 ? __lock_acquire+0x1384/0x2050 ? __pfx_validate_chain+0x10/0x10 ? string+0x26a/0x2b0 ? widen_string+0x3a/0x310 ? string+0x26a/0x2b0 ? bdev_name+0x2b1/0x3c0 ? pointer+0x703/0x1210 ? __pfx_pointer+0x10/0x10 ? __pfx_format_decode+0x10/0x10 ? __lock_acquire+0x1384/0x2050 ? vsnprintf+0x1ccd/0x1da0 ? snprintf+0xda/0x120 ? __pfx_lock_release+0x10/0x10 ? do_raw_spin_lock+0x14f/0x370 ? __pfx_snprintf+0x10/0x10 ? set_blocksize+0x1f9/0x360 ? sb_set_blocksize+0x98/0xf0 ? setup_bdev_super+0x4e6/0x5d0 mount_bdev+0x20c/0x2d0 ? __pfx_ocfs2_fill_super+0x10/0x10 ? __pfx_mount_bdev+0x10/0x10 ? vfs_parse_fs_string+0x190/0x230 ? __pfx_vfs_parse_fs_string+0x10/0x10 legacy_get_tree+0xf0/0x190 ? __pfx_ocfs2_mount+0x10/0x10 vfs_get_tree+0x92/0x2b0 do_new_mount+0x2be/0xb40 ? __pfx_do_new_mount+0x10/0x10 __se_sys_mount+0x2d6/0x3c0 ? __pfx___se_sys_mount+0x10/0x10 ? do_syscall_64+0x100/0x230 ? __x64_sys_mount+0x20/0xc0 do_syscall_64+0xf3/0x230 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f37cae96fda Code: 48 8b 0d 51 ce 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 1e ce 0c 00 f7 d8 64 89 01 48 RSP: 002b:00007fff6c1aa228 EFLAGS: 00000206 ORIG_RAX: 00000000000000a5 RAX: ffffffffffffffda RBX: 00007fff6c1aa240 RCX: 00007f37cae96fda RDX: 00000000200002c0 RSI: 0000000020000040 RDI: 00007fff6c1aa240 RBP: 0000000000000004 R08: 00007fff6c1aa280 R09: 0000000000000000 R10: 00000000000008c0 R11: 0000000000000206 R12: 00000000000008c0 R13: 00007fff6c1aa280 R14: 0000000000000003 R15: 0000000001000000 </TASK> For a really damaged superblock, the value of 'i_super.s_blocksize_bits' may exceed the maximum possible shift for an underlying 'int'. So add an extra check whether the aforementioned field represents the valid block size, which is 512 bytes, 1K, 2K, or 4K. Link: https://lkml.kernel.org/r/20241106092100.2661330-1-dmantipov@yandex.ru Fixes: ccd979bdbce9 ("[PATCH] OCFS2: The Second Oracle Cluster Filesystem") Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru> Reported-by: syzbot+56f7cd1abe4b8e475180@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=56f7cd1abe4b8e475180 Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Jun Piao <piaojun@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-11nilfs2: fix null-ptr-deref in block_dirty_buffer tracepointRyusuke Konishi
When using the "block:block_dirty_buffer" tracepoint, mark_buffer_dirty() may cause a NULL pointer dereference, or a general protection fault when KASAN is enabled. This happens because, since the tracepoint was added in mark_buffer_dirty(), it references the dev_t member bh->b_bdev->bd_dev regardless of whether the buffer head has a pointer to a block_device structure. In the current implementation, nilfs_grab_buffer(), which grabs a buffer to read (or create) a block of metadata, including b-tree node blocks, does not set the block device, but instead does so only if the buffer is not in the "uptodate" state for each of its caller block reading functions. However, if the uptodate flag is set on a folio/page, and the buffer heads are detached from it by try_to_free_buffers(), and new buffer heads are then attached by create_empty_buffers(), the uptodate flag may be restored to each buffer without the block device being set to bh->b_bdev, and mark_buffer_dirty() may be called later in that state, resulting in the bug mentioned above. Fix this issue by making nilfs_grab_buffer() always set the block device of the super block structure to the buffer head, regardless of the state of the buffer's uptodate flag. Link: https://lkml.kernel.org/r/20241106160811.3316-3-konishi.ryusuke@gmail.com Fixes: 5305cb830834 ("block: add block_{touch|dirty}_buffer tracepoint") Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Ubisectech Sirius <bugreport@valiantsec.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-11nilfs2: fix null-ptr-deref in block_touch_buffer tracepointRyusuke Konishi
Patch series "nilfs2: fix null-ptr-deref bugs on block tracepoints". This series fixes null pointer dereference bugs that occur when using nilfs2 and two block-related tracepoints. This patch (of 2): It has been reported that when using "block:block_touch_buffer" tracepoint, touch_buffer() called from __nilfs_get_folio_block() causes a NULL pointer dereference, or a general protection fault when KASAN is enabled. This happens because since the tracepoint was added in touch_buffer(), it references the dev_t member bh->b_bdev->bd_dev regardless of whether the buffer head has a pointer to a block_device structure. In the current implementation, the block_device structure is set after the function returns to the caller. Here, touch_buffer() is used to mark the folio/page that owns the buffer head as accessed, but the common search helper for folio/page used by the caller function was optimized to mark the folio/page as accessed when it was reimplemented a long time ago, eliminating the need to call touch_buffer() here in the first place. So this solves the issue by eliminating the touch_buffer() call itself. Link: https://lkml.kernel.org/r/20241106160811.3316-1-konishi.ryusuke@gmail.com Link: https://lkml.kernel.org/r/20241106160811.3316-2-konishi.ryusuke@gmail.com Fixes: 5305cb830834 ("block: add block_{touch|dirty}_buffer tracepoint") Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Reported-by: Ubisectech Sirius <bugreport@valiantsec.com> Closes: https://lkml.kernel.org/r/86bd3013-887e-4e38-960f-ca45c657f032.bugreport@valiantsec.com Reported-by: syzbot+9982fb8d18eba905abe2@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=9982fb8d18eba905abe2 Tested-by: syzbot+9982fb8d18eba905abe2@syzkaller.appspotmail.com Cc: Tejun Heo <tj@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-11netfs/fscache: Add a memory barrier for FSCACHE_VOLUME_CREATINGZizhi Wo
In fscache_create_volume(), there is a missing memory barrier between the bit-clearing operation and the wake-up operation. This may cause a situation where, after a wake-up, the bit-clearing operation hasn't been detected yet, leading to an indefinite wait. The triggering process is as follows: [cookie1] [cookie2] [volume_work] fscache_perform_lookup fscache_create_volume fscache_perform_lookup fscache_create_volume fscache_create_volume_work cachefiles_acquire_volume clear_and_wake_up_bit test_and_set_bit test_and_set_bit goto maybe_wait goto no_wait In the above process, cookie1 and cookie2 has the same volume. When cookie1 enters the -no_wait- process, it will clear the bit and wake up the waiting process. If a barrier is missing, it may cause cookie2 to remain in the -wait- process indefinitely. In commit 3288666c7256 ("fscache: Use clear_and_wake_up_bit() in fscache_create_volume_work()"), barriers were added to similar operations in fscache_create_volume_work(), but fscache_create_volume() was missed. By combining the clear and wake operations into clear_and_wake_up_bit() to fix this issue. Fixes: bfa22da3ed65 ("fscache: Provide and use cache methods to lookup/create/free a volume") Signed-off-by: Zizhi Wo <wozizhi@huawei.com> Link: https://lore.kernel.org/r/20241107110649.3980193-6-wozizhi@huawei.com Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-11-11cachefiles: Fix NULL pointer dereference in object->fileZizhi Wo
At present, the object->file has the NULL pointer dereference problem in ondemand-mode. The root cause is that the allocated fd and object->file lifetime are inconsistent, and the user-space invocation to anon_fd uses object->file. Following is the process that triggers the issue: [write fd] [umount] cachefiles_ondemand_fd_write_iter fscache_cookie_state_machine cachefiles_withdraw_cookie if (!file) return -ENOBUFS cachefiles_clean_up_object cachefiles_unmark_inode_in_use fput(object->file) object->file = NULL // file NULL pointer dereference! __cachefiles_write(..., file, ...) Fix this issue by add an additional reference count to the object->file before write/llseek, and decrement after it finished. Fixes: c8383054506c ("cachefiles: notify the user daemon when looking up cookie") Signed-off-by: Zizhi Wo <wozizhi@huawei.com> Link: https://lore.kernel.org/r/20241107110649.3980193-5-wozizhi@huawei.com Reviewed-by: David Howells <dhowells@redhat.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-11-11cachefiles: Clean up in cachefiles_commit_tmpfile()Zizhi Wo
Currently, cachefiles_commit_tmpfile() will only be called if object->flags is set to CACHEFILES_OBJECT_USING_TMPFILE. Only cachefiles_create_file() and cachefiles_invalidate_cookie() set this flag. Both of these functions replace object->file with the new tmpfile, and both are called by fscache_cookie_state_machine(), so there are no concurrency issues. So the equation "d_backing_inode(dentry) == file_inode(object->file)" in cachefiles_commit_tmpfile() will never hold true according to the above conditions. This patch removes this part of the redundant code and does not involve any other logical changes. Signed-off-by: Zizhi Wo <wozizhi@huawei.com> Link: https://lore.kernel.org/r/20241107110649.3980193-4-wozizhi@huawei.com Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-11-11cachefiles: Fix missing pos updates in cachefiles_ondemand_fd_write_iter()Zizhi Wo
In the erofs on-demand loading scenario, read and write operations are usually delivered through "off" and "len" contained in read req in user mode. Naturally, pwrite is used to specify a specific offset to complete write operations. However, if the write(not pwrite) syscall is called multiple times in the read-ahead scenario, we need to manually update ki_pos after each write operation to update file->f_pos. This step is currently missing from the cachefiles_ondemand_fd_write_iter function, added to address this issue. Fixes: c8383054506c ("cachefiles: notify the user daemon when looking up cookie") Signed-off-by: Zizhi Wo <wozizhi@huawei.com> Link: https://lore.kernel.org/r/20241107110649.3980193-3-wozizhi@huawei.com Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-11-11cachefiles: Fix incorrect length return value in ↵Zizhi Wo
cachefiles_ondemand_fd_write_iter() cachefiles_ondemand_fd_write_iter() function first aligns "pos" and "len" to block boundaries. When calling __cachefiles_write(), the aligned "pos" is passed in, but "len" is the original unaligned value(iter->count). Additionally, the returned length of the write operation is the modified "len" aligned by block size, which is unreasonable. The alignment of "pos" and "len" is intended only to check whether the cache has enough space. But the modified len should not be used as the return value of cachefiles_ondemand_fd_write_iter() because the length we passed to __cachefiles_write() is the previous "len". Doing so would result in a mismatch in the data written on-demand. For example, if the length of the user state passed in is not aligned to the block size (the preread scene/DIO writes only need 512 alignment/Fault injection), the length of the write will differ from the actual length of the return. To solve this issue, since the __cachefiles_prepare_write() modifies the size of "len", we pass "aligned_len" to __cachefiles_prepare_write() to calculate the free blocks and use the original "len" as the return value of cachefiles_ondemand_fd_write_iter(). Fixes: c8383054506c ("cachefiles: notify the user daemon when looking up cookie") Signed-off-by: Zizhi Wo <wozizhi@huawei.com> Link: https://lore.kernel.org/r/20241107110649.3980193-2-wozizhi@huawei.com Reviewed-by: David Howells <dhowells@redhat.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-11-11iomap: drop an obsolete comment in iomap_dio_bio_iterChristoph Hellwig
No more zone append special casing in iomap for quite a while. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20241111121340.1390540-1-hch@lst.de Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-11-11btrfs: send: check for read-only send root under critical sectionFilipe Manana
We're checking if the send root is read-only without being under the protection of the root's root_item_lock spinlock, which is what protects the root's flags when clearing the read-only flag, done at btrfs_ioctl_subvol_setflags(). Furthermore, it should be done in the same critical section that increments the root's send_in_progress counter, as btrfs_ioctl_subvol_setflags() clears the read-only flag in the same critical section that checks the counter's value. So fix this by moving the read-only check under the critical section delimited by the root's root_item_lock which also increments the root's send_in_progress counter. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11btrfs: send: check for dead send root under critical sectionFilipe Manana
We're checking if the send root is dead without the protection of the root's root_item_lock spinlock, which is what protects the root's flags. The inverse, setting the dead flag on a root, is done under the protection of that lock, at btrfs_delete_subvolume(). Also checking and updating the root's send_in_progress counter is supposed to be done in the same critical section as checking for or setting the root dead flag, so that these operations are done atomically as a single step (which is correctly done by btrfs_delete_subvolume()). So fix this by checking if the send root is dead in the same critical section that updates the send_in_progress counter, which is protected by the root's root_item_lock spinlock. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11btrfs: remove check for NULL fs_info at btrfs_folio_end_lock_bitmap()Filipe Manana
Smatch complains about possibly dereferencing a NULL fs_info at btrfs_folio_end_lock_bitmap(): fs/btrfs/subpage.c:332 btrfs_folio_end_lock_bitmap() warn: variable dereferenced before check 'fs_info' (see line 326) because we access fs_info to set the 'start_bit' variable before doing the check for a NULL fs_info. However fs_info is never NULL, since in the only caller of btrfs_folio_end_lock_bitmap() is extent_writepage(), where we have an inode which always as a non-NULL fs_info. So remove the check for a NULL fs_info at btrfs_folio_end_lock_bitmap(). Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11btrfs: fix warning on PTR_ERR() against NULL device at btrfs_control_ioctl()Filipe Manana
Smatch complains about calling PTR_ERR() against a NULL pointer: fs/btrfs/super.c:2272 btrfs_control_ioctl() warn: passing zero to 'PTR_ERR' Fix this by calling PTR_ERR() against the device pointer only if it contains an error. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11btrfs: fix a typo in btrfs_use_zone_appendChristoph Hellwig
REQ_OP_ZONE_APPNED -> REQ_OP_ZONE_APPEND. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11btrfs: avoid superfluous calls to free_extent_map() in btrfs_encoded_read()Mark Harmstone
Change the control flow of btrfs_encoded_read() so that it doesn't call free_extent_map() when we know that this has already been done. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Mark Harmstone <maharmstone@fb.com> Suggested-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11btrfs: simplify logic to decrement snapshot counter at btrfs_mksnapshot()Filipe Manana
There's no point in having a 'snapshot_force_cow' variable to track if we need to decrement the root->snapshot_force_cow counter, as we never jump to the 'out' label after incrementing the counter. Simplify this by removing the variable and always decrementing the counter before the 'out' label, right after the call to btrfs_mksubvol(). Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11btrfs: remove hole from struct btrfs_delayed_nodeFilipe Manana
On x86_64 and a release kernel, there's a 4 bytes hole in the structure after the ref count field: struct btrfs_delayed_node { u64 inode_id; /* 0 8 */ u64 bytes_reserved; /* 8 8 */ struct btrfs_root * root; /* 16 8 */ struct list_head n_list; /* 24 16 */ struct list_head p_list; /* 40 16 */ struct rb_root_cached ins_root; /* 56 16 */ /* --- cacheline 1 boundary (64 bytes) was 8 bytes ago --- */ struct rb_root_cached del_root; /* 72 16 */ struct mutex mutex; /* 88 32 */ struct btrfs_inode_item inode_item; /* 120 160 */ /* --- cacheline 4 boundary (256 bytes) was 24 bytes ago --- */ refcount_t refs; /* 280 4 */ /* XXX 4 bytes hole, try to pack */ u64 index_cnt; /* 288 8 */ long unsigned int flags; /* 296 8 */ int count; /* 304 4 */ u32 curr_index_batch_size; /* 308 4 */ u32 index_item_leaves; /* 312 4 */ /* size: 320, cachelines: 5, members: 15 */ /* sum members: 312, holes: 1, sum holes: 4 */ /* padding: 4 */ }; Move the 'count' field, which is 4 bytes long, to just below the ref count field, so we eliminate the hole and reduce the structure size from 320 bytes down to 312 bytes: struct btrfs_delayed_node { u64 inode_id; /* 0 8 */ u64 bytes_reserved; /* 8 8 */ struct btrfs_root * root; /* 16 8 */ struct list_head n_list; /* 24 16 */ struct list_head p_list; /* 40 16 */ struct rb_root_cached ins_root; /* 56 16 */ /* --- cacheline 1 boundary (64 bytes) was 8 bytes ago --- */ struct rb_root_cached del_root; /* 72 16 */ struct mutex mutex; /* 88 32 */ struct btrfs_inode_item inode_item; /* 120 160 */ /* --- cacheline 4 boundary (256 bytes) was 24 bytes ago --- */ refcount_t refs; /* 280 4 */ int count; /* 284 4 */ u64 index_cnt; /* 288 8 */ long unsigned int flags; /* 296 8 */ u32 curr_index_batch_size; /* 304 4 */ u32 index_item_leaves; /* 308 4 */ /* size: 312, cachelines: 5, members: 15 */ /* last cacheline: 56 bytes */ }; This now allows to have 13 delayed nodes per 4K page instead of 12. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11btrfs: update stale comment for struct btrfs_delayed_ref_node::add_listFilipe Manana
The comment refers to a list in the respective delayed ref head that no longer exists (ref_list), it was replaced with a rbtree (ref_tree) in commit 0e0adbcfdc90 ("btrfs: track refs in a rb_tree instead of a list"). So update the stale comment to refer to the rbtree instead of the old list. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11btrfs: add new ioctl to wait for cleaned subvolumesDavid Sterba
Add a new unprivileged ioctl that will let the command 'btrfs subvolume sync' work without the (privileged) SEARCH_TREE ioctl. There are several modes of operation, where the most common ones are to wait on a specific subvolume or all currently queued for cleaning. This is utilized e.g. in backup applications that delete subvolumes and wait until they're cleaned to check for remaining space. The other modes are for flexibility, e.g. for monitoring or checkpoints in the queue of deleted subvolumes, again without the need to use SEARCH_TREE. Notes: - waiting is interruptible, the timeout is set to 1 second and is not configurable - repeated calls to the ioctl see a different state, so this is inherently racy when using e.g. the count or peek next/last Use cases: - a subvolume A was deleted, wait for cleaning (WAIT_FOR_ONE) - a bunch of subvolumes were deleted, wait for all (WAIT_FOR_QUEUED or PEEK_LAST + WAIT_FOR_ONE) - count how many are queued (not blocking), for monitoring purposes - report progress (PEEK_NEXT), may miss some if cleaning is quick - own waiting in user space (PEEK_LAST until it's 0) Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11btrfs: simplify range tracking in cow_file_range()Haisu Wang
Simplify tracking of the range processed by using cur_alloc_size only to store the reserved part that may fail to the allocated extent. Remove the ram_size as well since it is always equal to cur_alloc_size in the context. Advance the start in normal path until extent allocation succeeds and keep the start unchanged in the error handling path. Passed the fstest generic/475 test for a hundred times with quota enabled. And a modified generic/475 test by removing the sleep time for a hundred times. About one tenth of the tests do enter the error handling path due to fail to reserve extent. Suggested-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Haisu Wang <haisuwang@tencent.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11btrfs: remove conditional path allocation in btrfs_read_locked_inode()Leo Martins
Remove conditional path allocation from btrfs_read_locked_inode(). Add an ASSERT(path) to indicate it should never be called with a NULL path. Call btrfs_read_locked_inode() directly from btrfs_iget(). This causes code duplication between btrfs_iget() and btrfs_iget_path(), but I think this is justifiable as it removes the need for conditionally allocating the path inside of btrfs_read_locked_inode(). This makes the code easier to reason about and makes it clear who has the responsibility of allocating and freeing the path. Signed-off-by: Leo Martins <loemra.dev@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11btrfs: push cleanup into btrfs_read_locked_inode()Leo Martins
Move btrfs_add_inode_to_root() so it can be called from btrfs_read_locked_inode(), no changes were made to the function. Move cleanup code from btrfs_iget_path() to btrfs_read_locked_inode. This improves readability and improves a leaky abstraction. Previously btrfs_iget_path() had to handle a positive error case as a result of a call to btrfs_search_slot(), but it makes more sense to handle this closer to the source of the call. Signed-off-by: Leo Martins <loemra.dev@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11btrfs: add struct io_btrfs_cmd as type for io_uring_cmd_to_pdu()Mark Harmstone
Add struct io_btrfs_cmd as a wrapper type for io_uring_cmd_to_pdu(), rather than using a raw pointer. Suggested-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Mark Harmstone <maharmstone@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>