summaryrefslogtreecommitdiff
path: root/fs/ext4/inode.c
AgeCommit message (Collapse)Author
2024-12-22ext4: use inode_set_cached_link()Mateusz Guzik
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://lore.kernel.org/r/20241120112037.822078-3-mjguzik@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-11-18Merge tag 'ext4_for_linus-6.13-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "A lot of miscellaneous ext4 bug fixes and cleanups this cycle, most notably in the journaling code, bufered I/O, and compiler warning cleanups" * tag 'ext4_for_linus-6.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (33 commits) jbd2: Fix comment describing journal_init_common() ext4: prevent an infinite loop in the lazyinit thread ext4: use struct_size() to improve ext4_htree_store_dirent() ext4: annotate struct fname with __counted_by() jbd2: avoid dozens of -Wflex-array-member-not-at-end warnings ext4: use str_yes_no() helper function ext4: prevent delalloc to nodelalloc on remount jbd2: make b_frozen_data allocation always succeed ext4: cleanup variable name in ext4_fc_del() ext4: use string choices helpers jbd2: remove the 'success' parameter from the jbd2_do_replay() function jbd2: remove useless 'block_error' variable jbd2: factor out jbd2_do_replay() jbd2: refactor JBD2_COMMIT_BLOCK process in do_one_pass() jbd2: unified release of buffer_head in do_one_pass() jbd2: remove redundant judgments for check v1 checksum ext4: use ERR_CAST to return an error-valued pointer mm: zero range of eof folio exposed by inode size extension ext4: partial zero eof block on unaligned inode size extension ext4: disambiguate the return value of ext4_dio_write_end_io() ...
2024-11-12ext4: partial zero eof block on unaligned inode size extensionBrian Foster
Using mapped writes, it's technically possible to expose stale post-eof data on a truncate up operation. Consider the following example: $ xfs_io -fc "pwrite 0 2k" -c "mmap 0 4k" -c "mwrite 2k 2k" \ -c "truncate 8k" -c "pread -v 2k 16" <file> ... 00000800: 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 XXXXXXXXXXXXXXXX ... This shows that the post-eof data written via mwrite lands within EOF after a truncate up. While this is deliberate of the test case, behavior is somewhat unpredictable because writeback does post-eof zeroing, and writeback can occur at any time in the background. For example, an fsync inserted between the mwrite and truncate causes the subsequent read to instead return zeroes. This basically means that there is a race window in this situation between any subsequent extending operation and writeback that dictates whether post-eof data is exposed to the file or zeroed. To prevent this problem, perform partial block zeroing as part of the various inode size extending operations that are susceptible to it. For truncate extension, zero around the original eof similar to how truncate down does partial zeroing of the new eof. For extension via writes and fallocate related operations, zero the newly exposed range of the file to cover any partial zeroing that must occur at the original and new eof blocks. Signed-off-by: Brian Foster <bfoster@redhat.com> Link: https://patch.msgid.link/20240919160741.208162-2-bfoster@redhat.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-11-12ext4: fix race in buffer_head read fault injectionLong Li
When I enabled ext4 debug for fault injection testing, I encountered the following warning: EXT4-fs error (device sda): ext4_read_inode_bitmap:201: comm fsstress: Cannot read inode bitmap - block_group = 8, inode_bitmap = 1051 WARNING: CPU: 0 PID: 511 at fs/buffer.c:1181 mark_buffer_dirty+0x1b3/0x1d0 The root cause of the issue lies in the improper implementation of ext4's buffer_head read fault injection. The actual completion of buffer_head read and the buffer_head fault injection are not atomic, which can lead to the uptodate flag being cleared on normally used buffer_heads in race conditions. [CPU0] [CPU1] [CPU2] ext4_read_inode_bitmap ext4_read_bh() <bh read complete> ext4_read_inode_bitmap if (buffer_uptodate(bh)) return bh jbd2_journal_commit_transaction __jbd2_journal_refile_buffer __jbd2_journal_unfile_buffer __jbd2_journal_temp_unlink_buffer ext4_simulate_fail_bh() clear_buffer_uptodate mark_buffer_dirty <report warning> WARN_ON_ONCE(!buffer_uptodate(bh)) The best approach would be to perform fault injection in the IO completion callback function, rather than after IO completion. However, the IO completion callback function cannot get the fault injection code in sb. Fix it by passing the result of fault injection into the bh read function, we simulate faults within the bh read function itself. This requires adding an extra parameter to the bh read functions that need fault injection. Fixes: 46f870d690fe ("ext4: simulate various I/O and checksum errors when reading metadata") Signed-off-by: Long Li <leo.lilong@huawei.com> Link: https://patch.msgid.link/20240906091746.510163-1-leo.lilong@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-11-12ext4: don't pass full mapping flags to ext4_es_insert_extent()Zhang Yi
When converting a delalloc extent in ext4_es_insert_extent(), since we only want to pass the info of whether the quota has already been claimed if the allocation is a direct allocation from ext4_map_create_blocks(), there is no need to pass full mapping flags, so changes to just pass whether the EXT4_GET_BLOCKS_DELALLOC_RESERVE bit is set. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20240906061401.2980330-1-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-11-12fs: ext4: Don't use CMA for buffer_headZhaoyang Huang
cma_alloc() keep failed in our system which thanks to a jh->bh->b_page can not be migrated out of CMA area[1] as the jh has one cp_transaction pending on it because of j_free > j_max_transaction_buffers[2][3][4][5][6]. We temporarily solve this by launching jbd2_log_do_checkpoint forcefully somewhere. Since journal is common mechanism to all JFSs and cp_transaction has a little fewer opportunity to be launched, the cma_alloc() could be affected under the same scenario. This patch would like to have buffer_head of ext4 not use CMA pages when doing sb_getblk. [1] crash_arm64_v8.0.4++> kmem -p|grep ffffff808f0aa150(sb->s_bdev->bd_inode->i_mapping) fffffffe01a51c00 e9470000 ffffff808f0aa150 3 2 8000000008020 lru,private fffffffe03d189c0 174627000 ffffff808f0aa150 4 2 2004000000008020 lru,private fffffffe03d88e00 176238000 ffffff808f0aa150 3f9 2 2008000000008020 lru,private fffffffe03d88e40 176239000 ffffff808f0aa150 6 2 2008000000008020 lru,private fffffffe03d88e80 17623a000 ffffff808f0aa150 5 2 2008000000008020 lru,private fffffffe03d88ec0 17623b000 ffffff808f0aa150 1 2 2008000000008020 lru,private fffffffe03d88f00 17623c000 ffffff808f0aa150 0 2 2008000000008020 lru,private fffffffe040e6540 183995000 ffffff808f0aa150 3f4 2 2004000000008020 lru,private [2] page -> buffer_head crash_arm64_v8.0.4++> struct page.private fffffffe01a51c00 -x private = 0xffffff802fca0c00 [3] buffer_head -> journal_head crash_arm64_v8.0.4++> struct buffer_head.b_private 0xffffff802fca0c00 b_private = 0xffffff8041338e10, [4] journal_head -> b_cp_transaction crash_arm64_v8.0.4++> struct journal_head.b_cp_transaction 0xffffff8041338e10 -x b_cp_transaction = 0xffffff80410f1900, [5] transaction_t -> journal crash_arm64_v8.0.4++> struct transaction_t.t_journal 0xffffff80410f1900 -x t_journal = 0xffffff80e70f3000, [6] j_free & j_max_transaction_buffers crash_arm64_v8.0.4++> struct journal_t.j_free,j_max_transaction_buffers 0xffffff80e70f3000 -x j_free = 0x3f1, j_max_transaction_buffers = 0x100, Suggested-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20240904075300.1148836-1-zhaoyang.huang@unisoc.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-11-05ext4: Do not fallback to buffered-io for DIO atomic writeRitesh Harjani (IBM)
atomic writes is currently only supported for single fsblock and only for direct-io. We should not return -ENOTBLK for atomic writes since we want the atomic write request to either complete fully or fail otherwise. Hence, we should never fallback to buffered-io in case of DIO atomic write requests. Let's also catch if this ever happens by adding some WARN_ON_ONCE before buffered-io handling for direct-io atomic writes. More details of the discussion [1]. While at it let's add an inline helper ext4_want_directio_fallback() which simplifies the logic checks and inherently fixes condition on when to return -ENOTBLK which otherwise was always returning true for any write or directio in ext4_iomap_end(). It was ok since ext4 only supports direct-io via iomap. [1]: https://lore.kernel.org/linux-xfs/cover.1729825985.git.ritesh.list@gmail.com/T/#m9dbecc11bed713ed0d7a486432c56b105b555f04 Suggested-by: Darrick J. Wong <djwong@kernel.org> # inline helper Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz>
2024-11-05ext4: Add statx support for atomic writesRitesh Harjani (IBM)
This patch adds base support for atomic writes via statx getattr. On bs < ps systems, we can create FS with say bs of 16k. That means both atomic write min and max unit can be set to 16k for supporting atomic writes. Co-developed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz>
2024-09-20Merge tag 'ext4_for_linus-6.12-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "Lots of cleanups and bug fixes this cycle, primarily in the block allocation, extent management, fast commit, and journalling" * tag 'ext4_for_linus-6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (93 commits) ext4: convert EXT4_B2C(sbi->s_stripe) users to EXT4_NUM_B2C ext4: check stripe size compatibility on remount as well ext4: fix i_data_sem unlock order in ext4_ind_migrate() ext4: remove the special buffer dirty handling in do_journal_get_write_access ext4: fix a potential assertion failure due to improperly dirtied buffer ext4: hoist ext4_block_write_begin and replace the __block_write_begin ext4: persist the new uptodate buffers in ext4_journalled_zero_new_buffers ext4: dax: keep orphan list before truncate overflow allocated blocks ext4: fix error message when rejecting the default hash ext4: save unnecessary indentation in ext4_ext_create_new_leaf() ext4: make some fast commit functions reuse extents path ext4: refactor ext4_swap_extents() to reuse extents path ext4: get rid of ppath in convert_initialized_extent() ext4: get rid of ppath in ext4_ext_handle_unwritten_extents() ext4: get rid of ppath in ext4_ext_convert_to_initialized() ext4: get rid of ppath in ext4_convert_unwritten_extents_endio() ext4: get rid of ppath in ext4_split_convert_extents() ext4: get rid of ppath in ext4_split_extent() ext4: get rid of ppath in ext4_force_split_extent_at() ext4: get rid of ppath in ext4_split_extent_at() ...
2024-09-03ext4: remove the special buffer dirty handling in do_journal_get_write_accessShida Zhang
This kinda revert the commit 56d35a4cd13e("ext4: Fix dirtying of journalled buffers in data=journal mode") made by Jan 14 years ago, since the do_get_write_access() itself can deal with the extra unexpected buf dirting things in a proper way now. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Shida Zhang <zhangshida@kylinos.cn> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20240830053739.3588573-5-zhangshida@kylinos.cn Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-09-03ext4: fix a potential assertion failure due to improperly dirtied bufferShida Zhang
On an old kernel version(4.19, ext3, data=journal, pagesize=64k), an assertion failure will occasionally be triggered by the line below: ----------- jbd2_journal_commit_transaction { ... J_ASSERT_BH(bh, !buffer_dirty(bh)); /* * The buffer on BJ_Forget list and not jbddirty means ... } ----------- The same condition may also be applied to the lattest kernel version. When blocksize < pagesize and we truncate a file, there can be buffers in the mapping tail page beyond i_size. These buffers will be filed to transaction's BJ_Forget list by ext4_journalled_invalidatepage() during truncation. When the transaction doing truncate starts committing, we can grow the file again. This calls __block_write_begin() which allocates new blocks under these buffers in the tail page we go through the branch: if (buffer_new(bh)) { clean_bdev_bh_alias(bh); if (folio_test_uptodate(folio)) { clear_buffer_new(bh); set_buffer_uptodate(bh); mark_buffer_dirty(bh); continue; } ... } Hence buffers on BJ_Forget list of the committing transaction get marked dirty and this triggers the jbd2 assertion. Teach ext4_block_write_begin() to properly handle files with data journalling by avoiding dirtying them directly. Instead of folio_zero_new_buffers() we use ext4_journalled_zero_new_buffers() which takes care of handling journalling. We also don't need to mark new uptodate buffers as dirty in ext4_block_write_begin(). That will be either done either by block_commit_write() in case of success or by folio_zero_new_buffers() in case of failure. Reported-by: Baolin Liu <liubaolin@kylinos.cn> Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Shida Zhang <zhangshida@kylinos.cn> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20240830053739.3588573-4-zhangshida@kylinos.cn Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-09-03ext4: hoist ext4_block_write_begin and replace the __block_write_beginShida Zhang
Using __block_write_begin() make it inconvenient to journal the user data dirty process. We can't tell the block layer maintainer, ‘Hey, we want to trace the dirty user data in ext4, can we add some special code for ext4 in __block_write_begin?’:P So use ext4_block_write_begin() instead. The two functions are basically doing the same thing except for the fscrypt related code. Remove the unnecessary #ifdef since fscrypt_inode_uses_fs_layer_crypto() returns false (and it's known at compile time) when !CONFIG_FS_ENCRYPTION. And hoist the ext4_block_write_begin so that it can be used in other files. Suggested-by: Jan Kara <jack@suse.cz> Suggested-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Shida Zhang <zhangshida@kylinos.cn> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20240830053739.3588573-3-zhangshida@kylinos.cn Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-09-03ext4: persist the new uptodate buffers in ext4_journalled_zero_new_buffersShida Zhang
For new uptodate buffers we also need to call write_end_fn() to persist the uptodate content, similarly as folio_zero_new_buffers() does it. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Shida Zhang <zhangshida@kylinos.cn> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20240830053739.3588573-2-zhangshida@kylinos.cn Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-09-02ext4: drop ext4_es_is_delonly()Zhang Yi
Since we don't add delayed flag in unwritten extents, so there is no difference between ext4_es_is_delayed() and ext4_es_is_delonly(), just drop ext4_es_is_delonly(). Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20240813123452.2824659-12-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-09-02ext4: use ext4_map_query_blocks() in ext4_map_blocks()Zhang Yi
The blocks map querying logic in ext4_map_blocks() are the same as ext4_map_query_blocks(), so switch to directly use it. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20240813123452.2824659-9-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-09-02ext4: passing block allocation information to ext4_es_insert_extent()Zhang Yi
Just pass the block allocation flag to ext4_es_insert_extent() when we replacing a current extent after an actually block allocation or extent status conversion, this flag will be used by later changes. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20240813123452.2824659-6-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-09-02ext4: don't set EXTENT_STATUS_DELAYED on allocated blocksZhang Yi
Currently, we release delayed allocation reservation when removing delayed extent from extent status tree (which also happens when overwriting one extent with another one). When we allocated unwritten extent under some delayed allocated extent, we don't need the reservation anymore and hence we don't need to preserve the EXT4_MAP_DELAYED status bit. Allocating the new extent blocks will properly release the reservation. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20240813123452.2824659-4-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-09-02ext4: optimize the EXT4_GET_BLOCKS_DELALLOC_RESERVE flag setZhang Yi
When doing block allocation, magic EXT4_GET_BLOCKS_DELALLOC_RESERVE means the allocating range covers a range of delayed allocated clusters, the blocks and quotas have already been reserved in ext4_da_map_blocks(), we should update the reserved space and don't need to claim them again. At the moment, we only set this magic in mpage_map_one_extent() when allocating a range of delayed allocated clusters in the write back path, it makes things complicated since we have to notice and deal with the case of allocating non-delayed allocated clusters separately in ext4_ext_map_blocks(). For example, it we fallocate some blocks that have been delayed allocated, free space would be claimed again in ext4_mb_new_blocks() (this is wrong exactily), and we can't claim quota space again, we have to release the quota reservations made for that previously delayed allocated clusters. Move the position thats set the EXT4_GET_BLOCKS_DELALLOC_RESERVE to where we actually do block allocation, it could simplify above handling a lot, it means that we always set this magic once the allocation range covers delalloc blocks, no need to take care of the allocation path. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20240813123452.2824659-3-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-09-02ext4: factor out ext4_map_create_blocks() to allocate new blocksZhang Yi
Factor out a common helper ext4_map_create_blocks() from ext4_map_blocks() to do a real blocks allocation, no logic changes. [ Note: this first patch of a ten patch series named "v3: simplify the counting and management of delalloc reserved blocks". The link to the v1 and v2 patch series are below. -- TYT ] Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20240802115120.362902-1-yi.zhang@huaweicloud.com # v2 of patch series Link: https://patch.msgid.link/20240601034149.2169771-1-yi.zhang@huaweicloud.com # v1 of the patch series Link: https://patch.msgid.link/20240813123452.2824659-2-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-08-26ext4: fix incorrect tid assumption in ext4_wait_for_tail_page_commit()Luis Henriques (SUSE)
Function ext4_wait_for_tail_page_commit() assumes that '0' is not a valid value for transaction IDs, which is incorrect. Don't assume that and invoke jbd2_log_wait_commit() if the journal had a committing transaction instead. Signed-off-by: Luis Henriques (SUSE) <luis.henriques@linux.dev> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20240724161119.13448-2-luis.henriques@linux.dev Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
2024-08-07buffer: Convert __block_write_begin() to take a folioMatthew Wilcox (Oracle)
Almost all callers have a folio now, so change __block_write_begin() to take a folio and remove a call to compound_head(). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-07fs: Convert aops->write_begin to take a folioMatthew Wilcox (Oracle)
Convert all callers from working on a page to working on one page of a folio (support for working on an entire folio can come later). Removes a lot of folio->page->folio conversions. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-07fs: Convert aops->write_end to take a folioMatthew Wilcox (Oracle)
Most callers have a folio, and most implementations operate on a folio, so remove the conversion from folio->page->folio to fit through this interface. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-07buffer: Convert block_write_end() to take a folioMatthew Wilcox (Oracle)
All callers now have a folio, so pass it in instead of converting from a folio to a page and back to a folio again. Saves a call to compound_head(). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-07-08ext4: sanity check for NULL pointer after ext4_force_shutdownWojciech Gładysz
Test case: 2 threads write short inline data to a file. In ext4_page_mkwrite the resulting inline data is converted. Handling ext4_grp_locked_error with description "block bitmap and bg descriptor inconsistent: X vs Y free clusters" calls ext4_force_shutdown. The conversion clears EXT4_STATE_MAY_INLINE_DATA but fails for ext4_destroy_inline_data_nolock and ext4_mark_iloc_dirty due to ext4_forced_shutdown. The restoration of inline data fails for the same reason not setting EXT4_STATE_MAY_INLINE_DATA. Without the flag set a regular process path in ext4_da_write_end follows trying to dereference page folio private pointer that has not been set. The fix calls early return with -EIO error shall the pointer to private be NULL. Sample crash report: Unable to handle kernel paging request at virtual address dfff800000000004 KASAN: null-ptr-deref in range [0x0000000000000020-0x0000000000000027] Mem abort info: ESR = 0x0000000096000005 EC = 0x25: DABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 FSC = 0x05: level 1 translation fault Data abort info: ISV = 0, ISS = 0x00000005, ISS2 = 0x00000000 CM = 0, WnR = 0, TnD = 0, TagAccess = 0 GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 [dfff800000000004] address between user and kernel address ranges Internal error: Oops: 0000000096000005 [#1] PREEMPT SMP Modules linked in: CPU: 1 PID: 20274 Comm: syz-executor185 Not tainted 6.9.0-rc7-syzkaller-gfda5695d692c #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/27/2024 pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : __block_commit_write+0x64/0x2b0 fs/buffer.c:2167 lr : __block_commit_write+0x3c/0x2b0 fs/buffer.c:2160 sp : ffff8000a1957600 x29: ffff8000a1957610 x28: dfff800000000000 x27: ffff0000e30e34b0 x26: 0000000000000000 x25: dfff800000000000 x24: dfff800000000000 x23: fffffdffc397c9e0 x22: 0000000000000020 x21: 0000000000000020 x20: 0000000000000040 x19: fffffdffc397c9c0 x18: 1fffe000367bd196 x17: ffff80008eead000 x16: ffff80008ae89e3c x15: 00000000200000c0 x14: 1fffe0001cbe4e04 x13: 0000000000000000 x12: 0000000000000000 x11: 0000000000000001 x10: 0000000000ff0100 x9 : 0000000000000000 x8 : 0000000000000004 x7 : 0000000000000000 x6 : 0000000000000000 x5 : fffffdffc397c9c0 x4 : 0000000000000020 x3 : 0000000000000020 x2 : 0000000000000040 x1 : 0000000000000020 x0 : fffffdffc397c9c0 Call trace: __block_commit_write+0x64/0x2b0 fs/buffer.c:2167 block_write_end+0xb4/0x104 fs/buffer.c:2253 ext4_da_do_write_end fs/ext4/inode.c:2955 [inline] ext4_da_write_end+0x2c4/0xa40 fs/ext4/inode.c:3028 generic_perform_write+0x394/0x588 mm/filemap.c:3985 ext4_buffered_write_iter+0x2c0/0x4ec fs/ext4/file.c:299 ext4_file_write_iter+0x188/0x1780 call_write_iter include/linux/fs.h:2110 [inline] new_sync_write fs/read_write.c:497 [inline] vfs_write+0x968/0xc3c fs/read_write.c:590 ksys_write+0x15c/0x26c fs/read_write.c:643 __do_sys_write fs/read_write.c:655 [inline] __se_sys_write fs/read_write.c:652 [inline] __arm64_sys_write+0x7c/0x90 fs/read_write.c:652 __invoke_syscall arch/arm64/kernel/syscall.c:34 [inline] invoke_syscall+0x98/0x2b8 arch/arm64/kernel/syscall.c:48 el0_svc_common+0x130/0x23c arch/arm64/kernel/syscall.c:133 do_el0_svc+0x48/0x58 arch/arm64/kernel/syscall.c:152 el0_svc+0x54/0x168 arch/arm64/kernel/entry-common.c:712 el0t_64_sync_handler+0x84/0xfc arch/arm64/kernel/entry-common.c:730 el0t_64_sync+0x190/0x194 arch/arm64/kernel/entry.S:598 Code: 97f85911 f94002da 91008356 d343fec8 (38796908) ---[ end trace 0000000000000000 ]--- ---------------- Code disassembly (best guess): 0: 97f85911 bl 0xffffffffffe16444 4: f94002da ldr x26, [x22] 8: 91008356 add x22, x26, #0x20 c: d343fec8 lsr x8, x22, #3 * 10: 38796908 ldrb w8, [x8, x25] <-- trapping instruction Reported-by: syzbot+18df508cf00a0598d9a6@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=18df508cf00a0598d9a6 Link: https://lore.kernel.org/all/000000000000f19a1406109eb5c5@google.com/T/ Signed-off-by: Wojciech Gładysz <wojciech.gladysz@infogain.com> Link: https://patch.msgid.link/20240703070112.10235-1-wojciech.gladysz@infogain.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-06-27ext4: make ext4_da_map_blocks() buffer_head unawareZhang Yi
After calling the ext4_da_map_blocks(), a delalloc extent state could be identified through the EXT4_MAP_DELAYED flag in map. So factor out buffer_head related handles in ext4_da_map_blocks(), make this function buffer_head unaware and becomes a common helper, and also update the stale function commtents, preparing for the iomap da write path in the future. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20240517124005.347221-11-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-06-27ext4: make ext4_insert_delayed_block() insert multi-blocksZhang Yi
Rename ext4_insert_delayed_block() to ext4_insert_delayed_blocks(), pass length parameter to make it insert multiple delalloc blocks at a time. For non-bigalloc case, just reserve len blocks and insert delalloc extent. For bigalloc case, we can ensure that the clusters in the middle of a extent must be unallocated, we only need to check whether the start and end clusters are delayed/allocated. We should subtract the space for the start and/or end block(s) if they are allocated. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20240517124005.347221-10-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-06-27ext4: factor out a helper to check the cluster allocation stateZhang Yi
Factor out a common helper ext4_clu_alloc_state(), check whether the cluster containing a delalloc block to be added has been allocated or has delalloc reservation, no logic changes. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20240517124005.347221-9-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-06-27ext4: make ext4_da_reserve_space() reserve multi-clustersZhang Yi
Add 'nr_resv' parameter to ext4_da_reserve_space(), which indicates the number of clusters wants to reserve, make it reserve multiple clusters at a time. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20240517124005.347221-8-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-06-27ext4: make ext4_es_insert_delayed_block() insert multi-blocksZhang Yi
Rename ext4_es_insert_delayed_block() to ext4_es_insert_delayed_extent() and pass length parameter to make it insert multiple delalloc blocks at a time. For the case of bigalloc, split the allocated parameter to lclu_allocated and end_allocated. lclu_allocated indicates the allocation state of the cluster which is containing the lblk, end_allocated indicates the allocation state of the extent end, clusters in the middle of delay allocated extent must be unallocated. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20240517124005.347221-7-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-06-27ext4: drop iblock parameterZhang Yi
The start block of the delalloc extent to be inserted is equal to map->m_lblk, just drop the duplicate iblock input parameter. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://patch.msgid.link/20240517124005.347221-6-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-06-27ext4: trim delalloc extentZhang Yi
In ext4_da_map_blocks(), we could find four kind of extents in the extent status tree: hole, unwritten, written and delayed extent. Now we only trim the map len if we found an unwritten extent or a written extent. This is okay now since map->m_len is always set to one and we always insert one delayed block at a time. But this will become isn't okay for other two cases if ext4_insert_delayed_block() and ext4_da_map_blocks() support inserting multiple map->len blocks later. 1. If we found a hole in the extent status tree which es->es_len is shorter than the length we want to write, we should trim the map->m_len to prevent adding extra delay more blocks than we expected. For example, assume we write data [A, C) to a file that contains a hole extent [A, B) and a written extent [B, D) in the cache. A B C D before da write: ...hhhhhh|wwwwww.... Then we will get extent [A, B), we should trim map->m_len to B-A before inserting new delalloc blocks, if not, the range [B, C) will be duplicated. 2. If we found a delayed extent in the extent status tree which es->es_len is shorter than the length we want to write, we should trim the map->m_len to es->es_len and return directly since the front part of this map has been delayed, we can't insert the delalloc extent that contains the latter part in this round, we should return the delayed length and the caller should increase the position and call ext4_da_map_blocks() again. For example, assume we write data [A, C) to a file that contains a delayed extent [A, B) in the cache. A B C before da write: ...dddddd|hhh.... Then we will get delayed extent [A, B), we should also trim map->m_len to B-A and return, if not, we will incorrectly assume that the write is complete and won't insert [B, C). So we need to always trim the map->m_len if the found es->es_len in the extent status tree is shorter than the map->m_len, prearing for inserting a extent with multiple delalloc blocks. This patch only does a pre-fix, the handle is crude and ext4_da_map_blocks() deserve a cleanup, we will do that later. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20240517124005.347221-5-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-06-27ext4: check the extent status again before inserting delalloc blockZhang Yi
ext4_da_map_blocks looks up for any extent entry in the extent status tree (w/o i_data_sem) and then the looks up for any ondisk extent mapping (with i_data_sem in read mode). If it finds a hole in the extent status tree or if it couldn't find any entry at all, it then takes the i_data_sem in write mode to add a da entry into the extent status tree. This can actually race with page mkwrite & fallocate path. Note that this is ok between 1. ext4 buffered-write path v/s ext4_page_mkwrite(), because of the folio lock 2. ext4 buffered write path v/s ext4 fallocate because of the inode lock. But this can race between ext4_page_mkwrite() & ext4 fallocate path ext4_page_mkwrite() ext4_fallocate() block_page_mkwrite() ext4_da_map_blocks() //find hole in extent status tree ext4_alloc_file_blocks() ext4_map_blocks() //allocate block and unwritten extent ext4_insert_delayed_block() ext4_da_reserve_space() //reserve one more block ext4_es_insert_delayed_block() //drop unwritten extent and add delayed extent by mistake Then, the delalloc extent is wrong until writeback and the extra reserved block can't be released any more and it triggers below warning: EXT4-fs (pmem2): Inode 13 (00000000bbbd4d23): i_reserved_data_blocks(1) not cleared! Fix the problem by looking up extent status tree again while the i_data_sem is held in write mode. If it still can't find any entry, then we insert a new da entry into the extent status tree. Cc: stable@vger.kernel.org Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20240517124005.347221-3-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-06-27ext4: factor out a common helper to query extent mapZhang Yi
Factor out a new common helper ext4_map_query_blocks() from the ext4_da_map_blocks(), it query and return the extent map status on the inode's extent path, no logic changes. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://patch.msgid.link/20240517124005.347221-2-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-05-07ext4: remove the redundant folio_wait_stable()Zhang Yi
__filemap_get_folio() with FGP_WRITEBEGIN parameter has already wait for stable folio, so remove the redundant folio_wait_stable() in ext4_da_write_begin(), it was left over from the commit cc883236b792 ("ext4: drop unnecessary journal handle in delalloc write") that removed the retry getting page logic. Fixes: cc883236b792 ("ext4: drop unnecessary journal handle in delalloc write") Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240419023005.2719050-1-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-05-02ext4: set FMODE_CAN_ODIRECT instead of a dummy direct_IO methodChristoph Hellwig
Since commit a2ad63daa88b ("VFS: add FMODE_CAN_ODIRECT file flag") file systems can just set the FMODE_CAN_ODIRECT flag at open time instead of wiring up a dummy direct_IO method to indicate support for direct I/O. Signed-off-by: Christoph Hellwig <hch@lst.de> [RH: Rebased to upstream] Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/e5797bb597219a49043e53e4e90aa494b97dc328.1709215665.git.ritesh.list@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-05-02ext4: Remove PAGE_MASK dependency on mpage_submit_folioRitesh Harjani (IBM)
This patch simply removes the PAGE_MASK dependency since mpage_submit_folio() is already converted to work with folio. Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/d6eadb090334ea49ceef4e643b371fabfcea328f.1709182251.git.ritesh.list@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-05-02ext4: Fixes len calculation in mpage_journal_page_buffersRitesh Harjani (IBM)
Truncate operation can race with writeback, in which inode->i_size can get truncated and therefore size - folio_pos() can be negative. This fixes the len calculation. However this path doesn't get easily triggered even with data journaling. Cc: stable@kernel.org # v6.5 Fixes: 80be8c5cc925 ("Fixes: ext4: Make mpage_journal_page_buffers use folio") Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/cff4953b5c9306aba71e944ab176a5d396b9a1b7.1709182250.git.ritesh.list@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-02-21ext4: fix the comment of ext4_map_blocks()/ext4_ext_map_blocks()Cheng Nie
this comment of ext4_map_blocks()/ext4_ext_map_blocks() need update after commit c21770573319("ext4: Define a new set of flags for ext4_get_blocks()"). Signed-off-by: Cheng Nie <niecheng1@uniontech.com> Link: https://lore.kernel.org/r/20240118062511.28276-1-niecheng1@uniontech.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-02-01ext4: make ext4_set_iomap() recognize IOMAP_DELALLOC map typeZhang Yi
Since ext4_map_blocks() can recognize a delayed allocated only extent, make ext4_set_iomap() can also recognize it, and remove the useless separate check in ext4_iomap_begin_report(). Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240127015825.1608160-7-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-02-01ext4: make ext4_map_blocks() distinguish delalloc only extentZhang Yi
Add a new map flag EXT4_MAP_DELAYED to indicate the mapping range is a delayed allocated only (not unwritten) one, and making ext4_map_blocks() can distinguish it, no longer mixing it with holes. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240127015825.1608160-6-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-02-01ext4: add a hole extent entry in cache after punchZhang Yi
In order to cache hole extents in the extent status tree and keep the hole length as long as possible, re-add a hole entry to the cache just after punching a hole. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240127015825.1608160-5-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-02-01ext4: convert to exclusive lock while inserting delalloc extentsZhang Yi
ext4_da_map_blocks() only hold i_data_sem in shared mode and i_rwsem when inserting delalloc extents, it could be raced by another querying path of ext4_map_blocks() without i_rwsem, .e.g buffered read path. Suppose we buffered read a file containing just a hole, and without any cached extents tree, then it is raced by another delayed buffered write to the same area or the near area belongs to the same hole, and the new delalloc extent could be overwritten to a hole extent. pread() pwrite() filemap_read_folio() ext4_mpage_readpages() ext4_map_blocks() down_read(i_data_sem) ext4_ext_determine_hole() //find hole ext4_ext_put_gap_in_cache() ext4_es_find_extent_range() //no delalloc extent ext4_da_map_blocks() down_read(i_data_sem) ext4_insert_delayed_block() //insert delalloc extent ext4_es_insert_extent() //overwrite delalloc extent to hole This race could lead to inconsistent delalloc extents tree and incorrect reserved space counter. Fix this by converting to hold i_data_sem in exclusive mode when adding a new delalloc extent in ext4_da_map_blocks(). Cc: stable@vger.kernel.org Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Suggested-by: Jan Kara <jack@suse.cz> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240127015825.1608160-3-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-02-01ext4: refactor ext4_da_map_blocks()Zhang Yi
Refactor and cleanup ext4_da_map_blocks(), reduce some unnecessary parameters and branches, no logic changes. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240127015825.1608160-2-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-01-18ext4: remove unnecessary parameter "needed" in ext4_discard_preallocationsKemeng Shi
The "needed" controls the number of ext4_prealloc_space to discard in ext4_discard_preallocations. Function ext4_discard_preallocations is supposed to discard all non-used preallocated blocks when "needed" is 0 and now ext4_discard_preallocations is always called with "needed" = 0. Remove unnecessary parameter "needed" and remove all non-used preallocated spaces in ext4_discard_preallocations to simplify the code. Note: If count of non-used preallocated spaces could be more than UINT_MAX, there was a memory leak as some non-used preallocated spaces are left ununsed and this commit will fix it. Otherwise, there is no behavior change. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240105092102.496631-9-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-01-10Merge tag 'ext4_for_linus-6.8-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "Various ext4 bug fixes and cleanups. The fixes are mostly in the fstrim and mballoc code paths. Also enable dioread_nolock in the case where the block size is less than the page size (dioread_nolock has been default in the bs == ps case for quite some time)" * tag 'ext4_for_linus-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: fix inconsistent between segment fstrim and full fstrim ext4: fallback to complex scan if aligned scan doesn't work ext4: convert ext4_da_do_write_end() to take a folio ext4: allow for the last group to be marked as trimmed ext4: move ext4_check_bdev_write_error() into nojournal mode jbd2: abort journal when detecting metadata writeback error of fs dev jbd2: remove unused 'JBD2_CHECKPOINT_IO_ERROR' and 'j_atomic_flags' jbd2: replace journal state flag by checking errseq jbd2: add errseq to detect client fs's bdev writeback error ext4: improving calculation of 'fe_{len|start}' in mb_find_extent() ext4: clarify handling of unwritten bh in __ext4_block_zero_page_range() ext4: treat end of range as exclusive in ext4_zero_range() ext4: enable dioread_nolock as default for bs < ps case ext4: delete redundant calculations in ext4_mb_get_buddy_page_lock() ext4: reduce unnecessary memory allocation in alloc_flex_gd() ext4: avoid online resizing failures due to oversized flex bg ext4: remove unnecessary check from alloc_flex_gd() ext4: unify the type of flexbg_size to unsigned int
2024-01-10ext4: convert ext4_da_do_write_end() to take a folioMatthew Wilcox (Oracle)
There's nothing page-specific happening in ext4_da_do_write_end(); it's merely used for its refcount & lock, both of which are folio properties. Saves four calls to compound_head(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20231214053035.1018876-1-willy@infradead.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-01-09Merge tag 'mm-stable-2024-01-08-15-31' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "Many singleton patches against the MM code. The patch series which are included in this merge do the following: - Peng Zhang has done some mapletree maintainance work in the series 'maple_tree: add mt_free_one() and mt_attr() helpers' 'Some cleanups of maple tree' - In the series 'mm: use memmap_on_memory semantics for dax/kmem' Vishal Verma has altered the interworking between memory-hotplug and dax/kmem so that newly added 'device memory' can more easily have its memmap placed within that newly added memory. - Matthew Wilcox continues folio-related work (including a few fixes) in the patch series 'Add folio_zero_tail() and folio_fill_tail()' 'Make folio_start_writeback return void' 'Fix fault handler's handling of poisoned tail pages' 'Convert aops->error_remove_page to ->error_remove_folio' 'Finish two folio conversions' 'More swap folio conversions' - Kefeng Wang has also contributed folio-related work in the series 'mm: cleanup and use more folio in page fault' - Jim Cromie has improved the kmemleak reporting output in the series 'tweak kmemleak report format'. - In the series 'stackdepot: allow evicting stack traces' Andrey Konovalov to permits clients (in this case KASAN) to cause eviction of no longer needed stack traces. - Charan Teja Kalla has fixed some accounting issues in the page allocator's atomic reserve calculations in the series 'mm: page_alloc: fixes for high atomic reserve caluculations'. - Dmitry Rokosov has added to the samples/ dorectory some sample code for a userspace memcg event listener application. See the series 'samples: introduce cgroup events listeners'. - Some mapletree maintanance work from Liam Howlett in the series 'maple_tree: iterator state changes'. - Nhat Pham has improved zswap's approach to writeback in the series 'workload-specific and memory pressure-driven zswap writeback'. - DAMON/DAMOS feature and maintenance work from SeongJae Park in the series 'mm/damon: let users feed and tame/auto-tune DAMOS' 'selftests/damon: add Python-written DAMON functionality tests' 'mm/damon: misc updates for 6.8' - Yosry Ahmed has improved memcg's stats flushing in the series 'mm: memcg: subtree stats flushing and thresholds'. - In the series 'Multi-size THP for anonymous memory' Ryan Roberts has added a runtime opt-in feature to transparent hugepages which improves performance by allocating larger chunks of memory during anonymous page faults. - Matthew Wilcox has also contributed some cleanup and maintenance work against eh buffer_head code int he series 'More buffer_head cleanups'. - Suren Baghdasaryan has done work on Andrea Arcangeli's series 'userfaultfd move option'. UFFDIO_MOVE permits userspace heap compaction algorithms to move userspace's pages around rather than UFFDIO_COPY'a alloc/copy/free. - Stefan Roesch has developed a 'KSM Advisor', in the series 'mm/ksm: Add ksm advisor'. This is a governor which tunes KSM's scanning aggressiveness in response to userspace's current needs. - Chengming Zhou has optimized zswap's temporary working memory use in the series 'mm/zswap: dstmem reuse optimizations and cleanups'. - Matthew Wilcox has performed some maintenance work on the writeback code, both code and within filesystems. The series is 'Clean up the writeback paths'. - Andrey Konovalov has optimized KASAN's handling of alloc and free stack traces for secondary-level allocators, in the series 'kasan: save mempool stack traces'. - Andrey also performed some KASAN maintenance work in the series 'kasan: assorted clean-ups'. - David Hildenbrand has gone to town on the rmap code. Cleanups, more pte batching, folio conversions and more. See the series 'mm/rmap: interface overhaul'. - Kinsey Ho has contributed some maintenance work on the MGLRU code in the series 'mm/mglru: Kconfig cleanup'. - Matthew Wilcox has contributed lruvec page accounting code cleanups in the series 'Remove some lruvec page accounting functions'" * tag 'mm-stable-2024-01-08-15-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (361 commits) mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDER mm, treewide: introduce NR_PAGE_ORDERS selftests/mm: add separate UFFDIO_MOVE test for PMD splitting selftests/mm: skip test if application doesn't has root privileges selftests/mm: conform test to TAP format output selftests: mm: hugepage-mmap: conform to TAP format output selftests/mm: gup_test: conform test to TAP format output mm/selftests: hugepage-mremap: conform test to TAP format output mm/vmstat: move pgdemote_* out of CONFIG_NUMA_BALANCING mm: zsmalloc: return -ENOSPC rather than -EINVAL in zs_malloc while size is too large mm/memcontrol: remove __mod_lruvec_page_state() mm/khugepaged: use a folio more in collapse_file() slub: use a folio in __kmalloc_large_node slub: use folio APIs in free_large_kmalloc() slub: use alloc_pages_node() in alloc_slab_page() mm: remove inc/dec lruvec page state functions mm: ratelimit stat flush from workingset shrinker kasan: stop leaking stack trace handles mm/mglru: remove CONFIG_TRANSPARENT_HUGEPAGE mm/mglru: add dummy pmd_dirty() ...
2024-01-04ext4: clarify handling of unwritten bh in __ext4_block_zero_page_range()Ojaswin Mujoo
As an optimization, I was trying to work on exiting early from this function if dealing with unwritten extent since they anyways read 0. However, it was realised that there are certain code paths that can end up calling ext4_block_zero_page_range() for an unwritten bh that might still have data in pagecache. In this case, we can't exit early and we do require to process the bh and zero out the pagecache to ensure that a writeback can't kick in at a later time and flush the stale pagecache to disk. Since, adding the logic to exit early for unwritten bh was turning out to be much more nuanced and the current code already handles it well, just add a comment to explicitly document this behavior. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/d859b7ae5fe42e6626479b91ed9f4da3aae4c597.1698856309.git.ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-12-10fs: convert error_remove_page to error_remove_folioMatthew Wilcox (Oracle)
There were already assertions that we were not passing a tail page to error_remove_page(), so make the compiler enforce that by converting everything to pass and use a folio. Link: https://lkml.kernel.org/r/20231117161447.2461643-7-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>