summaryrefslogtreecommitdiff
path: root/fs/ext4
AgeCommit message (Collapse)Author
2023-08-09fs, block: remove bdev->bd_superChristoph Hellwig
bdev->bd_super is unused now, remove it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Message-Id: <20230807112625.652089-5-hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-09ext4: don't use bdev->bd_super in __ext4_journal_get_write_accessChristoph Hellwig
__ext4_journal_get_write_access already has a super_block available, and there is no need to go from that to the bdev to go back to the owning super_block. Signed-off-by: Christoph Hellwig <hch@lst.de> Message-Id: <20230807112625.652089-3-hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-05ext4: don't use CR_BEST_AVAIL_LEN for non-regular filesRitesh Harjani
Using CR_BEST_AVAIL_LEN only make sense for regular files, as for non-regular files we never normalize the allocation request length i.e. goal len is same as original length (ac_g_ex.fe_len == ac_o_ex.fe_len). Hence there is no scope of trimming the goal length to make it satisfy original request len. Thus this patch avoids using CR_BEST_AVAIL_LEN criteria for non-regular files request. Cc: stable@kernel.org Fixes: 33122aa930f1 ("ext4: Add allocation criteria 1.5 (CR1_5)") Reported-by: Eric Whitney <enwlinux@gmail.com> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Tested-by: Eric Whitney <enwlinux@gmail.com> Link: https://lore.kernel.org/r/2a694c748ff8b8c4b416995a24f06f07b55047a8.1689516047.git.ritesh.list@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-08-05ext4: fix memory leaks in ext4_fname_{setup_filename,prepare_lookup}Luís Henriques
If the filename casefolding fails, we'll be leaking memory from the fscrypt_name struct, namely from the 'crypto_buf.name' member. Make sure we free it in the error path on both ext4_fname_setup_filename() and ext4_fname_prepare_lookup() functions. Cc: stable@kernel.org Fixes: 1ae98e295fa2 ("ext4: optimize match for casefolded encrypted dirs") Signed-off-by: Luís Henriques <lhenriques@suse.de> Reviewed-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20230803091713.13239-1-lhenriques@suse.de Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-08-03ext4: correct some stale comment of criteriaKemeng Shi
We named criteria with CR_XXX, correct stale comment to criteria with raw number. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20230801143204.2284343-11-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-08-03ext4: return found group directly in ext4_mb_choose_next_group_best_availKemeng Shi
Return good group when it's found in loop to remove futher check if good group is found after loop. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20230801143204.2284343-10-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-08-03ext4: return found group directly in ext4_mb_choose_next_group_goal_fastKemeng Shi
Return good group when it's found in loop to remove futher check if good group is found after loop. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20230801143204.2284343-9-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-08-03ext4: remove unused ext4_{set}/{clear}_bit_atomicKemeng Shi
Remove ext4_set_bit_atomic and ext4_clear_bit_atomic which are defined but not used. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20230801143204.2284343-8-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-08-03ext4: replace the traditional ternary conditional operator with with max()/min()Kemeng Shi
Replace the traditional ternary conditional operator with with max()/min() Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20230801143204.2284343-7-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-08-03ext4: remove unnecessary return for void functionKemeng Shi
The return at end of void function is unnecessary, just remove it. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20230801143204.2284343-6-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-08-03ext4: use is_power_of_2 helper in ext4_mb_regular_allocatorKemeng Shi
Use intuitive is_power_of_2 helper in ext4_mb_regular_allocator. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20230801143204.2284343-5-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-08-03ext4: return found group directly in ext4_mb_choose_next_group_p2_alignedKemeng Shi
Return good group when it's found in loop to remove unnecessary NULL initialization of grp and futher check if good group is found after loop. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20230801143204.2284343-4-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-08-03ext4: avoid potential data overflow in next_linear_groupKemeng Shi
ngroups is ext4_group_t (unsigned int) while next_linear_group treat it in int. If ngroups is bigger than max number described by int, it will be treat as a negative number. Then "return group + 1 >= ngroups ? 0 : group + 1;" may keep returning 0. Switch int to ext4_group_t in next_linear_group to fix the overflow. Fixes: 196e402adf2e ("ext4: improve cr 0 / cr 1 group scanning") Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20230801143204.2284343-3-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-08-03ext4: correct grp validation in ext4_mb_good_groupKemeng Shi
Group corruption check will access memory of grp and will trigger kernel crash if grp is NULL. So do NULL check before corruption check. Fixes: 5354b2af3406 ("ext4: allow ext4_get_group_info() to fail") Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20230801143204.2284343-2-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-08-03ext4: replace CR_FAST macro with inline function for readabilityOjaswin Mujoo
Replace CR_FAST with ext4_mb_cr_expensive() inline function for better readability. This function returns true if the criteria is one of the expensive/slower ones where lots of disk IO/prefetching is acceptable. No functional changes are intended in this patch. Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20230630085927.140137-1-ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-08-02fs: add CONFIG_BUFFER_HEADChristoph Hellwig
Add a new config option that controls building the buffer_head code, and select it from all file systems and stacking drivers that need it. For the block device nodes and alternative iomap based buffered I/O path is provided when buffer_head support is not enabled, and iomap needs a a small tweak to define the IOMAP_F_BUFFER_HEAD flag to 0 to not call into the buffer_head code when it doesn't exist. Otherwise this is just Kconfig and ifdef changes. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20230801172201.1923299-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-02fs: rename and move block_page_mkwrite_returnChristoph Hellwig
block_page_mkwrite_return is neither block nor mkwrite specific, and should not be under CONFIG_BLOCK. Move it to mm.h and rename it to vmf_fs_error. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Christian Brauner <brauner@kernel.org> Link: https://lore.kernel.org/r/20230801172201.1923299-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-29ext4: replace read-only check for shutdown check in mmp codeJan Kara
The multi-mount protection kthread checks for read-only filesystem and aborts in that case. The remount code actually handles stopping of the kthread on remount so the only purpose of the check is in case of emergency remount read-only. Replace the check for read-only filesystem with a check for shutdown filesystem as running MMP on such is risky anyway and it makes ordering of things during remount simpler. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230616165109.21695-11-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-07-29ext4: drop read-only check from ext4_force_commit()Jan Kara
JBD2 code will quickly return without doing anything when there's nothing to commit so there's no point in the read-only check in ext4_force_commit(). Just drop it. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230616165109.21695-10-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-07-29ext4: drop read-only check in ext4_write_inode()Jan Kara
We should not have dirty inodes on read-only filesystem. Also silently bailing without writing anything would be a problem when we enable quotas during remount while the filesystem is read-only. So drop the read-only check. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230616165109.21695-9-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-07-29ext4: drop read-only check in ext4_init_inode_table()Jan Kara
We better should not be initializing inode tables on read-only filesystem. The following transaction start will warn us and make the function bail anyway so drop the pointless check. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230616165109.21695-8-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-07-29ext4: warn on read-only filesystem in ext4_journal_check_start()Jan Kara
Now that filesystem abort marks the filesystem as shutdown, we shouldn't be ever hitting the sb_rdonly() check in ext4_journal_check_start(). Since this is a suitable place for catching all sorts of programming errors, convert the check to WARN_ON instead of dropping it. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230616165109.21695-7-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-07-29ext4: avoid starting transaction on read-only fs in ext4_quota_off()Jan Kara
When the filesystem gets first remounted read-only and then unmounted, ext4_quota_off() will try to start a transaction (and fail) on read-only filesystem to cleanup inode flags for legacy quota files. Just bail before trying to start a transaction instead since that is going to issue a warning. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230616165109.21695-6-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-07-29ext4: drop EXT4_MF_FS_ABORTED flagJan Kara
EXT4_MF_FS_ABORTED flag has practically the same intent as EXT4_FLAGS_SHUTDOWN flag. The shutdown flag is checked in many more places than the aborted flag which is mostly the historical artifact where we were relying on SB_RDONLY checks instead of the aborted flag checks. There are only three places - ext4_sync_file(), __ext4_remount(), and mballoc debug code - which check aborted flag and not shutdown flag and this is arguably a bug. Avoid these inconsistencies by removing EXT4_MF_FS_ABORTED flag and using EXT4_FLAGS_SHUTDOWN everywhere. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230616165109.21695-5-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-07-29ext4: make 'abort' mount option handling standardJan Kara
'abort' mount option is the only mount option that has special handling and sets a bit in sbi->s_mount_flags. There is not strong reason for that so just simplify the code and make 'abort' set a bit in sbi->s_mount_opt2 as any other mount option. This simplifies the code and will allow us to drop EXT4_MF_FS_ABORTED completely in the following patch. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230616165109.21695-4-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-07-29ext4: make ext4_forced_shutdown() take struct super_blockJan Kara
Currently ext4_forced_shutdown() takes struct ext4_sb_info but most callers need to get it from struct super_block anyway. So just pass in struct super_block to save all callers from some boilerplate code. No functional changes. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230616165109.21695-3-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-07-29ext4: use sb_rdonly() helper for checking read-only flagJan Kara
sb_rdonly() helper instead of directly checking sb->s_flags. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230616165109.21695-2-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-07-29ext4: remove pointless sb_rdonly() checks from freezing codeJan Kara
ext4_freeze() and ext4_unfreeze() checks for sb_rdonly(). However this check is pointless as VFS already checks for read-only filesystem before calling filesystem specific methods. Remove the pointless checks. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230616165109.21695-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-07-29ext4: avoid overlapping preallocations due to overflowBaokun Li
Let's say we want to allocate 2 blocks starting from 4294966386, after predicting the file size, start is aligned to 4294965248, len is changed to 2048, then end = start + size = 0x100000000. Since end is of type ext4_lblk_t, i.e. uint, end is truncated to 0. This causes (pa->pa_lstart >= end) to always hold when checking if the current extent to be allocated crosses already preallocated blocks, so the resulting ac_g_ex may cross already preallocated blocks. Hence we convert the end type to loff_t and use pa_logical_end() to avoid overflow. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20230724121059.11834-4-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-07-29ext4: fix BUG in ext4_mb_new_inode_pa() due to overflowBaokun Li
When we calculate the end position of ext4_free_extent, this position may be exactly where ext4_lblk_t (i.e. uint) overflows. For example, if ac_g_ex.fe_logical is 4294965248 and ac_orig_goal_len is 2048, then the computed end is 0x100000000, which is 0. If ac->ac_o_ex.fe_logical is not the first case of adjusting the best extent, that is, new_bex_end > 0, the following BUG_ON will be triggered: ========================================================= kernel BUG at fs/ext4/mballoc.c:5116! invalid opcode: 0000 [#1] PREEMPT SMP PTI CPU: 3 PID: 673 Comm: xfs_io Tainted: G E 6.5.0-rc1+ #279 RIP: 0010:ext4_mb_new_inode_pa+0xc5/0x430 Call Trace: <TASK> ext4_mb_use_best_found+0x203/0x2f0 ext4_mb_try_best_found+0x163/0x240 ext4_mb_regular_allocator+0x158/0x1550 ext4_mb_new_blocks+0x86a/0xe10 ext4_ext_map_blocks+0xb0c/0x13a0 ext4_map_blocks+0x2cd/0x8f0 ext4_iomap_begin+0x27b/0x400 iomap_iter+0x222/0x3d0 __iomap_dio_rw+0x243/0xcb0 iomap_dio_rw+0x16/0x80 ========================================================= A simple reproducer demonstrating the problem: mkfs.ext4 -F /dev/sda -b 4096 100M mount /dev/sda /tmp/test fallocate -l1M /tmp/test/tmp fallocate -l10M /tmp/test/file fallocate -i -o 1M -l16777203M /tmp/test/file fsstress -d /tmp/test -l 0 -n 100000 -p 8 & sleep 10 && killall -9 fsstress rm -f /tmp/test/tmp xfs_io -c "open -ad /tmp/test/file" -c "pwrite -S 0xff 0 8192" We simply refactor the logic for adjusting the best extent by adding a temporary ext4_free_extent ex and use extent_logical_end() to avoid overflow, which also simplifies the code. Cc: stable@kernel.org # 6.4 Fixes: 93cdf49f6eca ("ext4: Fix best extent lstart adjustment logic in ext4_mb_new_inode_pa()") Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20230724121059.11834-3-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-07-29ext4: add two helper functions extent_logical_end() and pa_logical_end()Baokun Li
When we use lstart + len to calculate the end of free extent or prealloc space, it may exceed the maximum value of 4294967295(0xffffffff) supported by ext4_lblk_t and cause overflow, which may lead to various problems. Therefore, we add two helper functions, extent_logical_end() and pa_logical_end(), to limit the type of end to loff_t, and also convert lstart to loff_t for calculation to avoid overflow. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20230724121059.11834-2-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-07-24ext4: convert to ctime accessor functionsJeff Layton
In later patches, we're going to change how the inode's ctime field is used. Switch to using accessor functions instead of raw accesses of inode->i_ctime. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Jan Kara <jack@suse.cz> Message-Id: <20230705190309.579783-40-jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-07-23ext4: fix rbtree traversal bug in ext4_mb_use_preallocatedOjaswin Mujoo
During allocations, while looking for preallocations(PA) in the per inode rbtree, we can't do a direct traversal of the tree because ext4_mb_discard_group_preallocation() can paralelly mark the pa deleted and that can cause direct traversal to skip some entries. This was leading to a BUG_ON() being hit [1] when we missed a PA that could satisfy our request and ultimately tried to create a new PA that would overlap with the missed one. To makes sure we handle that case while still keeping the performance of the rbtree, we make use of the fact that the only pa that could possibly overlap the original goal start is the one that satisfies the below conditions: 1. It must have it's logical start immediately to the left of (ie less than) original logical start. 2. It must not be deleted To find this pa we use the following traversal method: 1. Descend into the rbtree normally to find the immediate neighboring PA. Here we keep descending irrespective of if the PA is deleted or if it overlaps with our request etc. The goal is to find an immediately adjacent PA. 2. If the found PA is on right of original goal, use rb_prev() to find the left adjacent PA. 3. Check if this PA is deleted and keep moving left with rb_prev() until a non deleted PA is found. 4. This is the PA we are looking for. Now we can check if it can satisfy the original request and proceed accordingly. This approach also takes care of having deleted PAs in the tree. (While we are at it, also fix a possible overflow bug in calculating the end of a PA) [1] https://lore.kernel.org/linux-ext4/CA+G9fYv2FRpLqBZf34ZinR8bU2_ZRAUOjKAD3+tKRFaEQHtt8Q@mail.gmail.com/ Cc: stable@kernel.org # 6.4 Fixes: 3872778664e3 ("ext4: Use rbtrees to manage PAs instead of inode i_prealloc_list") Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org> Reviewed-by: Ritesh Harjani (IBM) ritesh.list@gmail.com Tested-by: Ritesh Harjani (IBM) ritesh.list@gmail.com Link: https://lore.kernel.org/r/edd2efda6a83e6343c5ace9deea44813e71dbe20.1690045963.git.ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-07-23ext4: fix off by one issue in ext4_mb_choose_next_group_best_avail()Ojaswin Mujoo
In ext4_mb_choose_next_group_best_avail(), we want the start order to be 1 less than goal length and the min_order to be, at max, 1 more than the original length. This commit fixes an off by one issue that arose due to the fact that 1 << fls(n) > (n). After all the processing: order = 1 order below goal len min_order = maximum of the three:- - order - trim_order - 1 order below B2C(s_stripe) - 1 order above original len Cc: stable@kernel.org Fixes: 33122aa930 ("ext4: Add allocation criteria 1.5 (CR1_5)") Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://lore.kernel.org/r/20230609103403.112807-1-ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-07-23ext4: correct inline offset when handling xattrs in inode bodyEric Whitney
When run on a file system where the inline_data feature has been enabled, xfstests generic/269, generic/270, and generic/476 cause ext4 to emit error messages indicating that inline directory entries are corrupted. This occurs because the inline offset used to locate inline directory entries in the inode body is not updated when an xattr in that shared region is deleted and the region is shifted in memory to recover the space it occupied. If the deleted xattr precedes the system.data attribute, which points to the inline directory entries, that attribute will be moved further up in the region. The inline offset continues to point to whatever is located in system.data's former location, with unfortunate effects when used to access directory entries or (presumably) inline data in the inode body. Cc: stable@kernel.org Signed-off-by: Eric Whitney <enwlinux@gmail.com> Link: https://lore.kernel.org/r/20230522181520.1570360-1-enwlinux@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-29Merge tag 'fs_for_v6.5-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull misc filesystem updates from Jan Kara: - Rewrite kmap_local() handling in ext2 - Convert ext2 direct IO path to iomap (with some infrastructure tweaks associated with that) - Convert two boilerplate licenses in udf to SPDX identifiers - Other small udf, ext2, and quota fixes and cleanups * tag 'fs_for_v6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: udf: Fix uninitialized array access for some pathnames ext2: Drop fragment support quota: fix warning in dqgrab() quota: Properly disable quotas when add_dquot_ref() fails fs: udf: udftime: Replace LGPL boilerplate with SPDX identifier fs: udf: Replace GPL 2.0 boilerplate license notice with SPDX identifier fs: Drop wait_unfrozen wait queue ext2_find_entry()/ext2_dotdot(): callers don't need page_addr anymore ext2_{set_link,delete_entry}(): don't bother with page_addr ext2_put_page(): accept any pointer within the page ext2_get_page(): saner type ext2: use offset_in_page() instead of open-coding it as subtraction ext2_rename(): set_link and delete_entry may fail ext2: Add direct-io trace points ext2: Move direct-io to use iomap ext2: Use generic_buffers_fsync() implementation ext4: Use generic_buffers_fsync_noflush() implementation fs/buffer.c: Add generic_buffers_fsync*() implementation ext2/dax: Fix ext2_setsize when len is page aligned
2023-06-29Merge tag 'ext4_for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "Various cleanups and bug fixes in ext4's extent status tree, journalling, and block allocator subsystems. Also improve performance for parallel DIO overwrites" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (55 commits) ext4: avoid updating the superblock on a r/o mount if not needed jbd2: skip reading super block if it has been verified ext4: fix to check return value of freeze_bdev() in ext4_shutdown() ext4: refactoring to use the unified helper ext4_quotas_off() ext4: turn quotas off if mount failed after enabling quotas ext4: update doc about journal superblock description ext4: add journal cycled recording support jbd2: continue to record log between each mount jbd2: remove j_format_version jbd2: factor out journal initialization from journal_get_superblock() jbd2: switch to check format version in superblock directly jbd2: remove unused feature macros ext4: ext4_put_super: Remove redundant checking for 'sbi->s_journal_bdev' ext4: Fix reusing stale buffer heads from last failed mounting ext4: allow concurrent unaligned dio overwrites ext4: clean up mballoc criteria comments ext4: make ext4_zeroout_es() return void ext4: make ext4_es_insert_extent() return void ext4: make ext4_es_insert_delayed_block() return void ext4: make ext4_es_remove_extent() return void ...
2023-06-28Merge tag 'mm-stable-2023-06-24-19-15' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull mm updates from Andrew Morton: - Yosry Ahmed brought back some cgroup v1 stats in OOM logs - Yosry has also eliminated cgroup's atomic rstat flushing - Nhat Pham adds the new cachestat() syscall. It provides userspace with the ability to query pagecache status - a similar concept to mincore() but more powerful and with improved usability - Mel Gorman provides more optimizations for compaction, reducing the prevalence of page rescanning - Lorenzo Stoakes has done some maintanance work on the get_user_pages() interface - Liam Howlett continues with cleanups and maintenance work to the maple tree code. Peng Zhang also does some work on maple tree - Johannes Weiner has done some cleanup work on the compaction code - David Hildenbrand has contributed additional selftests for get_user_pages() - Thomas Gleixner has contributed some maintenance and optimization work for the vmalloc code - Baolin Wang has provided some compaction cleanups, - SeongJae Park continues maintenance work on the DAMON code - Huang Ying has done some maintenance on the swap code's usage of device refcounting - Christoph Hellwig has some cleanups for the filemap/directio code - Ryan Roberts provides two patch series which yield some rationalization of the kernel's access to pte entries - use the provided APIs rather than open-coding accesses - Lorenzo Stoakes has some fixes to the interaction between pagecache and directio access to file mappings - John Hubbard has a series of fixes to the MM selftesting code - ZhangPeng continues the folio conversion campaign - Hugh Dickins has been working on the pagetable handling code, mainly with a view to reducing the load on the mmap_lock - Catalin Marinas has reduced the arm64 kmalloc() minimum alignment from 128 to 8 - Domenico Cerasuolo has improved the zswap reclaim mechanism by reorganizing the LRU management - Matthew Wilcox provides some fixups to make gfs2 work better with the buffer_head code - Vishal Moola also has done some folio conversion work - Matthew Wilcox has removed the remnants of the pagevec code - their functionality is migrated over to struct folio_batch * tag 'mm-stable-2023-06-24-19-15' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (380 commits) mm/hugetlb: remove hugetlb_set_page_subpool() mm: nommu: correct the range of mmap_sem_read_lock in task_mem() hugetlb: revert use of page_cache_next_miss() Revert "page cache: fix page_cache_next/prev_miss off by one" mm/vmscan: fix root proactive reclaim unthrottling unbalanced node mm: memcg: rename and document global_reclaim() mm: kill [add|del]_page_to_lru_list() mm: compaction: convert to use a folio in isolate_migratepages_block() mm: zswap: fix double invalidate with exclusive loads mm: remove unnecessary pagevec includes mm: remove references to pagevec mm: rename invalidate_mapping_pagevec to mapping_try_invalidate mm: remove struct pagevec net: convert sunrpc from pagevec to folio_batch i915: convert i915_gpu_error to use a folio_batch pagevec: rename fbatch_count() mm: remove check_move_unevictable_pages() drm: convert drm_gem_put_pages() to use a folio_batch i915: convert shmem_sg_free_table() to use a folio_batch scatterlist: add sg_set_folio() ...
2023-06-26ext4: avoid updating the superblock on a r/o mount if not neededTheodore Ts'o
This was noticed by a user who noticied that the mtime of a file backing a loopback device was getting bumped when the loopback device is mounted read/only. Note: This doesn't show up when doing a loopback mount of a file directly, via "mount -o ro /tmp/foo.img /mnt", since the loop device is set read-only when mount automatically creates loop device. However, this is noticeable for a LUKS loop device like this: % cryptsetup luksOpen /tmp/foo.img test % mount -o ro /dev/loop0 /mnt ; umount /mnt or, if LUKS is not in use, if the user manually creates the loop device like this: % losetup /dev/loop0 /tmp/foo.img % mount -o ro /dev/loop0 /mnt ; umount /mnt The modified mtime causes rsync to do a rolling checksum scan of the file on the local and remote side, incrementally increasing the time to rsync the not-modified-but-touched image file. Fixes: eee00237fa5e ("ext4: commit super block if fs record error when journal record without error") Cc: stable@kernel.org Link: https://lore.kernel.org/r/ZIauBR7YiV3rVAHL@glitch Reported-by: Sean Greenslade <sean@seangreenslade.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-26ext4: fix to check return value of freeze_bdev() in ext4_shutdown()Chao Yu
freeze_bdev() can fail due to a lot of reasons, it needs to check its reason before later process. Fixes: 783d94854499 ("ext4: add EXT4_IOC_GOINGDOWN ioctl") Cc: stable@kernel.org Signed-off-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20230606073203.1310389-1-chao@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-26ext4: refactoring to use the unified helper ext4_quotas_off()Baokun Li
Rename ext4_quota_off_umount() to ext4_quotas_off(), and add type parameter to replace open code in ext4_enable_quotas(). Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230327141630.156875-3-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-26ext4: turn quotas off if mount failed after enabling quotasBaokun Li
Yi found during a review of the patch "ext4: don't BUG on inconsistent journal feature" that when ext4_mark_recovery_complete() returns an error value, the error handling path does not turn off the enabled quotas, which triggers the following kmemleak: ================================================================ unreferenced object 0xffff8cf68678e7c0 (size 64): comm "mount", pid 746, jiffies 4294871231 (age 11.540s) hex dump (first 32 bytes): 00 90 ef 82 f6 8c ff ff 00 00 00 00 41 01 00 00 ............A... c7 00 00 00 bd 00 00 00 0a 00 00 00 48 00 00 00 ............H... backtrace: [<00000000c561ef24>] __kmem_cache_alloc_node+0x4d4/0x880 [<00000000d4e621d7>] kmalloc_trace+0x39/0x140 [<00000000837eee74>] v2_read_file_info+0x18a/0x3a0 [<0000000088f6c877>] dquot_load_quota_sb+0x2ed/0x770 [<00000000340a4782>] dquot_load_quota_inode+0xc6/0x1c0 [<0000000089a18bd5>] ext4_enable_quotas+0x17e/0x3a0 [ext4] [<000000003a0268fa>] __ext4_fill_super+0x3448/0x3910 [ext4] [<00000000b0f2a8a8>] ext4_fill_super+0x13d/0x340 [ext4] [<000000004a9489c4>] get_tree_bdev+0x1dc/0x370 [<000000006e723bf1>] ext4_get_tree+0x1d/0x30 [ext4] [<00000000c7cb663d>] vfs_get_tree+0x31/0x160 [<00000000320e1bed>] do_new_mount+0x1d5/0x480 [<00000000c074654c>] path_mount+0x22e/0xbe0 [<0000000003e97a8e>] do_mount+0x95/0xc0 [<000000002f3d3736>] __x64_sys_mount+0xc4/0x160 [<0000000027d2140c>] do_syscall_64+0x3f/0x90 ================================================================ To solve this problem, we add a "failed_mount10" tag, and call ext4_quota_off_umount() in this tag to release the enabled qoutas. Fixes: 11215630aada ("ext4: don't BUG on inconsistent journal feature") Cc: stable@kernel.org Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230327141630.156875-2-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-26ext4: add journal cycled recording supportZhang Yi
Always enable 'JBD2_CYCLE_RECORD' journal option on ext4, letting the jbd2 continue to record new journal transactions from the recovered journal head or the checkpointed transactions in the previous mount. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230322013353.1843306-3-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-26ext4: ext4_put_super: Remove redundant checking for 'sbi->s_journal_bdev'Zhihao Cheng
As discussed in [1], 'sbi->s_journal_bdev != sb->s_bdev' will always become true if sbi->s_journal_bdev exists. Filesystem block device and journal block device are both opened with 'FMODE_EXCL' mode, so these two devices can't be same one. Then we can remove the redundant checking 'sbi->s_journal_bdev != sb->s_bdev' if 'sbi->s_journal_bdev' exists. [1] https://lore.kernel.org/lkml/f86584f6-3877-ff18-47a1-2efaa12d18b2@huawei.com/ Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230315013128.3911115-3-chengzhihao1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-26ext4: Fix reusing stale buffer heads from last failed mountingZhihao Cheng
Following process makes ext4 load stale buffer heads from last failed mounting in a new mounting operation: mount_bdev ext4_fill_super | ext4_load_and_init_journal | ext4_load_journal | jbd2_journal_load | load_superblock | journal_get_superblock | set_buffer_verified(bh) // buffer head is verified | jbd2_journal_recover // failed caused by EIO | goto failed_mount3a // skip 'sb->s_root' initialization deactivate_locked_super kill_block_super generic_shutdown_super if (sb->s_root) // false, skip ext4_put_super->invalidate_bdev-> // invalidate_mapping_pages->mapping_evict_folio-> // filemap_release_folio->try_to_free_buffers, which // cannot drop buffer head. blkdev_put blkdev_put_whole if (atomic_dec_and_test(&bdev->bd_openers)) // false, systemd-udev happens to open the device. Then // blkdev_flush_mapping->kill_bdev->truncate_inode_pages-> // truncate_inode_folio->truncate_cleanup_folio-> // folio_invalidate->block_invalidate_folio-> // filemap_release_folio->try_to_free_buffers will be skipped, // dropping buffer head is missed again. Second mount: ext4_fill_super ext4_load_and_init_journal ext4_load_journal ext4_get_journal jbd2_journal_init_inode journal_init_common bh = getblk_unmovable bh = __find_get_block // Found stale bh in last failed mounting journal->j_sb_buffer = bh jbd2_journal_load load_superblock journal_get_superblock if (buffer_verified(bh)) // true, skip journal->j_format_version = 2, value is 0 jbd2_journal_recover do_one_pass next_log_block += count_tags(journal, bh) // According to journal_tag_bytes(), 'tag_bytes' calculating is // affected by jbd2_has_feature_csum3(), jbd2_has_feature_csum3() // returns false because 'j->j_format_version >= 2' is not true, // then we get wrong next_log_block. The do_one_pass may exit // early whenoccuring non JBD2_MAGIC_NUMBER in 'next_log_block'. The filesystem is corrupted here, journal is partially replayed, and new journal sequence number actually is already used by last mounting. The invalidate_bdev() can drop all buffer heads even racing with bare reading block device(eg. systemd-udev), so we can fix it by invalidating bdev in error handling path in __ext4_fill_super(). Fetch a reproducer in [Link]. Link: https://bugzilla.kernel.org/show_bug.cgi?id=217171 Fixes: 25ed6e8a54df ("jbd2: enable journal clients to enable v2 checksumming") Cc: stable@vger.kernel.org # v3.5 Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230315013128.3911115-2-chengzhihao1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-26ext4: allow concurrent unaligned dio overwritesBrian Foster
We've had reports of significant performance regression of sub-block (unaligned) direct writes due to the added exclusivity restrictions in ext4. The purpose of the exclusivity requirement for unaligned direct writes is to avoid data corruption caused by unserialized partial block zeroing in the iomap dio layer across overlapping writes. XFS has similar requirements for the same underlying reasons, yet doesn't suffer the extreme performance regression that ext4 does. The reason for this is that XFS utilizes IOMAP_DIO_OVERWRITE_ONLY mode, which allows for optimistic submission of concurrent unaligned I/O and kicks back writes that require partial block zeroing such that they can be submitted in a safe, exclusive context. Since ext4 already performs most of these checks pre-submission, it can support something similar without necessarily relying on the iomap flag and associated retry mechanism. Update the dio write submission path to allow concurrent submission of unaligned direct writes that are purely overwrite and so will not require block zeroing. To improve readability of the various related checks, move the unaligned I/O handling down into ext4_dio_write_checks(), where the dio draining and force wait logic can immediately follow the locking requirement checks. Finally, the IOMAP_DIO_OVERWRITE_ONLY flag is set to enable a warning check as a precaution should the ext4 overwrite logic ever become inconsistent with the zeroing expectations of iomap dio. The performance improvement of sub-block direct write I/O is shown in the following fio test on a 64xcpu guest vm: Test: fio --name=test --ioengine=libaio --direct=1 --group_reporting --overwrite=1 --thread --size=10G --filename=/mnt/fio --readwrite=write --ramp_time=10s --runtime=60s --numjobs=8 --blocksize=2k --iodepth=256 --allow_file_create=0 v6.2: write: IOPS=4328, BW=8724KiB/s v6.2 (patched): write: IOPS=801k, BW=1565MiB/s Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230314130759.642710-1-bfoster@redhat.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-26ext4: clean up mballoc criteria commentsTheodore Ts'o
Line wrap and slightly clarify the comments describing mballoc's cirtiera. Define EXT4_MB_NUM_CRS as part of the enum, so that it will automatically get updated when criteria is added or removed. Also fix a potential unitialized use of 'cr' variable if CONFIG_EXT4_DEBUG is enabled. Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-26ext4: make ext4_zeroout_es() return voidBaokun Li
After ext4_es_insert_extent() returns void, the return value in ext4_zeroout_es() is also unnecessary, so make it return void too. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230424033846.4732-13-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-26ext4: make ext4_es_insert_extent() return voidBaokun Li
Now ext4_es_insert_extent() never return error, so make it return void. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230424033846.4732-12-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-26ext4: make ext4_es_insert_delayed_block() return voidBaokun Li
Now it never fails when inserting a delay extent, so the return value in ext4_es_insert_delayed_block is no longer necessary, let it return void. [ Fixed bug which caused system hangs during bigalloc test runs. See https://lore.kernel.org/r/20230612030405.GH1436857@mit.edu for more details. -- TYT ] Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230424033846.4732-11-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>