summaryrefslogtreecommitdiff
path: root/fs/ext4/inode.c
AgeCommit message (Collapse)Author
2023-05-30ext4: disallow ea_inodes with extended attributesTheodore Ts'o
An ea_inode stores the value of an extended attribute; it can not have extended attributes itself, or this will cause recursive nightmares. Add a check in ext4_iget() to make sure this is the case. Cc: stable@kernel.org Reported-by: syzbot+e44749b6ba4d0434cd47@syzkaller.appspotmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Link: https://lore.kernel.org/r/20230524034951.779531-4-tytso@mit.edu Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-05-28ext4: add EA_INODE checking to ext4_iget()Theodore Ts'o
Add a new flag, EXT4_IGET_EA_INODE which indicates whether the inode is expected to have the EA_INODE flag or not. If the flag is not set/clear as expected, then fail the iget() operation and mark the file system as corrupted. This commit also makes the ext4_iget() always perform the is_bad_inode() check even when the inode is already inode cache. This allows us to remove the is_bad_inode() check from the callers of ext4_iget() in the ea_inode code. Reported-by: syzbot+cbb68193bdb95af4340a@syzkaller.appspotmail.com Reported-by: syzbot+62120febbd1ee3c3c860@syzkaller.appspotmail.com Reported-by: syzbot+edce54daffee36421b4c@syzkaller.appspotmail.com Cc: stable@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu> Link: https://lore.kernel.org/r/20230524034951.779531-2-tytso@mit.edu Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-05-13ext4: check iomap type only if ext4_iomap_begin() does not failBaokun Li
When ext4_iomap_overwrite_begin() calls ext4_iomap_begin() map blocks may fail for some reason (e.g. memory allocation failure, bare disk write), and later because "iomap->type ! = IOMAP_MAPPED" triggers WARN_ON(). When ext4 iomap_begin() returns an error, it is normal that the type of iomap->type may not match the expectation. Therefore, we only determine if iomap->type is as expected when ext4_iomap_begin() is executed successfully. Cc: stable@kernel.org Reported-by: syzbot+08106c4b7d60702dbc14@syzkaller.appspotmail.com Link: https://lore.kernel.org/all/00000000000015760b05f9b4eee9@google.com Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230505132429.714648-1-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-05-13ext4: avoid deadlock in fs reclaim with page writebackJan Kara
Ext4 has a filesystem wide lock protecting ext4_writepages() calls to avoid races with switching of journalled data flag or inode format. This lock can however cause a deadlock like: CPU0 CPU1 ext4_writepages() percpu_down_read(sbi->s_writepages_rwsem); ext4_change_inode_journal_flag() percpu_down_write(sbi->s_writepages_rwsem); - blocks, all readers block from now on ext4_do_writepages() ext4_init_io_end() kmem_cache_zalloc(io_end_cachep, GFP_KERNEL) fs_reclaim frees dentry... dentry_unlink_inode() iput() - last ref => iput_final() - inode dirty => write_inode_now()... ext4_writepages() tries to acquire sbi->s_writepages_rwsem and blocks forever Make sure we cannot recurse into filesystem reclaim from writeback code to avoid the deadlock. Reported-by: syzbot+6898da502aef574c5f8a@syzkaller.appspotmail.com Link: https://lore.kernel.org/all/0000000000004c66b405fa108e27@google.com Fixes: c8585c6fcaf2 ("ext4: fix races between changing inode journal mode and ext4_writepages") CC: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230504124723.20205-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-05-01Merge tag 'ext4_for_linus_stable' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 fixes from Ted Ts'o: "Some ext4 regression and bug fixes" * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: clean up error handling in __ext4_fill_super() ext4: reflect error codes from ext4_multi_mount_protect() to its callers ext4: fix lost error code reporting in __ext4_fill_super() ext4: fix unused iterator variable warnings ext4: fix use-after-free read in ext4_find_extent for bigalloc + inline ext4: fix i_disksize exceeding i_size problem in paritally written case
2023-04-28ext4: fix i_disksize exceeding i_size problem in paritally written caseZhihao Cheng
It is possible for i_disksize can exceed i_size, triggering a warning. generic_perform_write copied = iov_iter_copy_from_user_atomic(len) // copied < len ext4_da_write_end | ext4_update_i_disksize | new_i_size = pos + copied; | WRITE_ONCE(EXT4_I(inode)->i_disksize, newsize) // update i_disksize | generic_write_end | copied = block_write_end(copied, len) // copied = 0 | if (unlikely(copied < len)) | if (!PageUptodate(page)) | copied = 0; | if (pos + copied > inode->i_size) // return false if (unlikely(copied == 0)) goto again; if (unlikely(iov_iter_fault_in_readable(i, bytes))) { status = -EFAULT; break; } We get i_disksize greater than i_size here, which could trigger WARNING check 'i_size_read(inode) < EXT4_I(inode)->i_disksize' while doing dio: ext4_dio_write_iter iomap_dio_rw __iomap_dio_rw // return err, length is not aligned to 512 ext4_handle_inode_extension WARN_ON_ONCE(i_size_read(inode) < EXT4_I(inode)->i_disksize) // Oops WARNING: CPU: 2 PID: 2609 at fs/ext4/file.c:319 CPU: 2 PID: 2609 Comm: aa Not tainted 6.3.0-rc2 RIP: 0010:ext4_file_write_iter+0xbc7 Call Trace: vfs_write+0x3b1 ksys_write+0x77 do_syscall_64+0x39 Fix it by updating 'copied' value before updating i_disksize just like ext4_write_inline_data_end() does. A reproducer can be found in the buganizer link below. Link: https://bugzilla.kernel.org/show_bug.cgi?id=217209 Fixes: 64769240bd07 ("ext4: Add delayed allocation support in data=writeback mode") Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230321013721.89818-1-chengzhihao1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-04-27Merge tag 'mm-stable-2023-04-27-15-30' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Nick Piggin's "shoot lazy tlbs" series, to improve the peformance of switching from a user process to a kernel thread. - More folio conversions from Kefeng Wang, Zhang Peng and Pankaj Raghav. - zsmalloc performance improvements from Sergey Senozhatsky. - Yue Zhao has found and fixed some data race issues around the alteration of memcg userspace tunables. - VFS rationalizations from Christoph Hellwig: - removal of most of the callers of write_one_page() - make __filemap_get_folio()'s return value more useful - Luis Chamberlain has changed tmpfs so it no longer requires swap backing. Use `mount -o noswap'. - Qi Zheng has made the slab shrinkers operate locklessly, providing some scalability benefits. - Keith Busch has improved dmapool's performance, making part of its operations O(1) rather than O(n). - Peter Xu adds the UFFD_FEATURE_WP_UNPOPULATED feature to userfaultd, permitting userspace to wr-protect anon memory unpopulated ptes. - Kirill Shutemov has changed MAX_ORDER's meaning to be inclusive rather than exclusive, and has fixed a bunch of errors which were caused by its unintuitive meaning. - Axel Rasmussen give userfaultfd the UFFDIO_CONTINUE_MODE_WP feature, which causes minor faults to install a write-protected pte. - Vlastimil Babka has done some maintenance work on vma_merge(): cleanups to the kernel code and improvements to our userspace test harness. - Cleanups to do_fault_around() by Lorenzo Stoakes. - Mike Rapoport has moved a lot of initialization code out of various mm/ files and into mm/mm_init.c. - Lorenzo Stoakes removd vmf_insert_mixed_prot(), which was added for DRM, but DRM doesn't use it any more. - Lorenzo has also coverted read_kcore() and vread() to use iterators and has thereby removed the use of bounce buffers in some cases. - Lorenzo has also contributed further cleanups of vma_merge(). - Chaitanya Prakash provides some fixes to the mmap selftesting code. - Matthew Wilcox changes xfs and afs so they no longer take sleeping locks in ->map_page(), a step towards RCUification of pagefaults. - Suren Baghdasaryan has improved mmap_lock scalability by switching to per-VMA locking. - Frederic Weisbecker has reworked the percpu cache draining so that it no longer causes latency glitches on cpu isolated workloads. - Mike Rapoport cleans up and corrects the ARCH_FORCE_MAX_ORDER Kconfig logic. - Liu Shixin has changed zswap's initialization so we no longer waste a chunk of memory if zswap is not being used. - Yosry Ahmed has improved the performance of memcg statistics flushing. - David Stevens has fixed several issues involving khugepaged, userfaultfd and shmem. - Christoph Hellwig has provided some cleanup work to zram's IO-related code paths. - David Hildenbrand has fixed up some issues in the selftest code's testing of our pte state changing. - Pankaj Raghav has made page_endio() unneeded and has removed it. - Peter Xu contributed some rationalizations of the userfaultfd selftests. - Yosry Ahmed has fixed an issue around memcg's page recalim accounting. - Chaitanya Prakash has fixed some arm-related issues in the selftests/mm code. - Longlong Xia has improved the way in which KSM handles hwpoisoned pages. - Peter Xu fixes a few issues with uffd-wp at fork() time. - Stefan Roesch has changed KSM so that it may now be used on a per-process and per-cgroup basis. * tag 'mm-stable-2023-04-27-15-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (369 commits) mm,unmap: avoid flushing TLB in batch if PTE is inaccessible shmem: restrict noswap option to initial user namespace mm/khugepaged: fix conflicting mods to collapse_file() sparse: remove unnecessary 0 values from rc mm: move 'mmap_min_addr' logic from callers into vm_unmapped_area() hugetlb: pte_alloc_huge() to replace huge pte_alloc_map() maple_tree: fix allocation in mas_sparse_area() mm: do not increment pgfault stats when page fault handler retries zsmalloc: allow only one active pool compaction context selftests/mm: add new selftests for KSM mm: add new KSM process and sysfs knobs mm: add new api to enable ksm per process mm: shrinkers: fix debugfs file permissions mm: don't check VMA write permissions if the PTE/PMD indicates write permissions migrate_pages_batch: fix statistics for longterm pin retry userfaultfd: use helper function range_in_vma() lib/show_mem.c: use for_each_populated_zone() simplify code mm: correct arg in reclaim_pages()/reclaim_clean_pages_from_list() fs/buffer: convert create_page_buffers to folio_create_buffers fs/buffer: add folio_create_empty_buffers helper ...
2023-04-14Revert "ext4: Fix warnings when freezing filesystem with journaled data"Jan Kara
After making ext4_writepages() properly clean all pages there is no need for special treatment of filesystem freezing. Revert commit e6c28a26b799c7640b77daff3e4a67808c74381c. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230329154950.19720-13-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-04-14ext4: Update comment in mpage_prepare_extent_to_map()Jan Kara
Since filemap_write_and_wait() is now enough to get journalled data to final location update the comment in mpage_prepare_extent_to_map(). Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230329154950.19720-12-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-04-14ext4: Simplify handling of journalled data in ext4_bmap()Jan Kara
Now that ext4_writepages() gets journalled data into its final location we just use filemap_write_and_wait() instead of special handling of journalled data in ext4_bmap(). We can also drop EXT4_STATE_JDATA flag as it is not used anymore. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230329154950.19720-11-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-04-14ext4: Drop special handling of journalled data from ext4_evict_inode()Jan Kara
Now that ext4_writepages() makes sure journalled data is on stable storage, write_inode_now() call in iput_final() is enough to make pagecache pages with journalled data really clean (data committed and checkpointed). So we can drop special handling of journalled data in ext4_evict_inode(). Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230329154950.19720-9-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-04-14ext4: Commit transaction before writing back pages in data=journal modeJan Kara
When journalling data we currently just walk over pages, journal those that are marked for delayed dirtying (only pinned pages dirtied behing our back these days) and checkpoint other dirty pages. Because some pages may be part of running transaction the result is that after filemap_write_and_wait() we are not guaranteed pages are stable on disk. Thus places that want to flush current pagecache content need to jump through hoops to make sure journalled data is not lost. This is manageable in cases completely controlled by ext4 (such as extent shifting operations or inode eviction) but it gets ugly for stuff like fsverity. Furthermore it is rather error prone as people often do not realize journalled data needs special handling. So change ext4_writepages() to commit transaction with inode's data before going through the writeback loop in WB_SYNC_ALL mode. As a result filemap_write_and_wait() is now really getting pages to stable storage and makes pagecache pages safe to reclaim. Consequently we can remove the special handling of journalled data from several places in follow up patches. Note that this will make fsync(2) for journalled data more expensive as we will end up not only committing the transaction we need but also checkpointing the data (which we may have previously skipped if the data was part of the running transaction). If we really cared, we would need to introduce special VFS function for writing out & invalidating page cache for a range, use ->launder_page callback to perform checkpointing, and use it from all the places that need this functionality. But at this point I'm not convinced the complexity is worth it. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230329154950.19720-5-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-04-14ext4: Clear dirty bit from pages without data to writeJan Kara
With journalled data it can happen that checkpointing code will write out page contents without clearing the page dirty bit. The logic in ext4_page_nomap_can_writeout() then results in us never calling mpage_submit_page() and thus clearing the dirty bit. Drop the optimization with ext4_page_nomap_can_writeout() and just always call to mpage_submit_page(). ext4_bio_write_page() knows when to redirty the page and the additional clearing & setting of page dirty bit for ordered mode writeout is not that expensive to jump through the hoops for it. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230329154950.19720-4-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-04-14ext4: Keep pages with journalled data dirtyJan Kara
Currently we clear page dirty bit when we checkpoint some buffers from a page with journalled data or when we perform delayed dirtying of a page in ext4_writepages(). In a quest to simplify handling of journalled data we want to keep page dirty as long as it has either buffers to checkpoint or journalled dirty data. So make sure to keep page dirty in ext4_writepages() if it still has journalled data attached to it. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230329154950.19720-3-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-04-14ext4: Mark pages with journalled data dirtyJan Kara
Currently pages with journalled data written by write(2) or modified by block zeroing during truncate(2) are not marked as dirty. They are dirtied only once the transaction commits. This however makes writeback code think inode has no pages to write and so ext4_writepages() is not called to make pages with journalled data persistent. Mark pages with journalled data dirty (similarly as it happens for writes through mmap) so that writeback code knows about them and ext4_writepages() can do what it needs to to the inode. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230329154950.19720-2-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-04-06ext4: Use a folio in ext4_page_mkwrite()Matthew Wilcox
Convert to the folio API, saving a few calls to compound_head(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://lore.kernel.org/r/20230324180129.1220691-26-willy@infradead.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-04-06ext4: Convert ext4_block_write_begin() to take a folioMatthew Wilcox
All the callers now have a folio, so pass that in and operate on folios. Removes four calls to compound_head(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20230324180129.1220691-25-willy@infradead.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-04-06ext4: Convert ext4_mpage_readpages() to work on foliosMatthew Wilcox
This definitely doesn't include support for large folios; there are all kinds of assumptions about the number of buffers attached to a folio. But it does remove several calls to compound_head(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://lore.kernel.org/r/20230324180129.1220691-24-willy@infradead.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-04-06ext4: Use a folio in ext4_da_write_begin()Matthew Wilcox
Remove a few calls to compound_head(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://lore.kernel.org/r/20230324180129.1220691-23-willy@infradead.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-04-06ext4: Convert ext4_page_nomap_can_writeout to ext4_folio_nomap_can_writeoutMatthew Wilcox
Its one caller already uses a folio. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Link: https://lore.kernel.org/r/20230324180129.1220691-22-willy@infradead.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-04-06ext4: Convert __ext4_block_zero_page_range() to use a folioMatthew Wilcox
Use folio APIs throughout. Saves many calls to compound_head(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20230324180129.1220691-21-willy@infradead.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-04-06ext4: Convert ext4_journalled_zero_new_buffers() to use a folioMatthew Wilcox
Remove a call to compound_head(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://lore.kernel.org/r/20230324180129.1220691-20-willy@infradead.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-04-06ext4: Use a folio in ext4_journalled_write_end()Matthew Wilcox
Convert the incoming page to a folio to remove a few calls to compound_head(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Theodore Ts'o <tytso@mit.edu> Link: https://lore.kernel.org/r/20230324180129.1220691-19-willy@infradead.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-04-06ext4: Convert ext4_write_end() to use a folioMatthew Wilcox
Convert the incoming struct page to a folio. Replaces two implicit calls to compound_head() with one explicit call. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Theodore Ts'o <tytso@mit.edu> Link: https://lore.kernel.org/r/20230324180129.1220691-18-willy@infradead.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-04-06ext4: Convert ext4_write_begin() to use a folioMatthew Wilcox
Remove a lot of calls to compound_head(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://lore.kernel.org/r/20230324180129.1220691-17-willy@infradead.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-04-06ext4: Convert ext4_readpage_inline() to take a folioMatthew Wilcox
Use the folio API in this function, saves a few calls to compound_head(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Theodore Ts'o <tytso@mit.edu> Link: https://lore.kernel.org/r/20230324180129.1220691-10-willy@infradead.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-04-06ext4: Convert ext4_bio_write_page() to ext4_bio_write_folio()Matthew Wilcox
The only caller now has a folio so pass it in directly and avoid the call to page_folio() at the beginning. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Theodore Ts'o <tytso@mit.edu> Link: https://lore.kernel.org/r/20230324180129.1220691-9-willy@infradead.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-04-06ext4: Convert mpage_page_done() to mpage_folio_done()Matthew Wilcox
All callers now have a folio so we can pass one in and use the folio APIs to support large folios as well as save instructions by eliminating a call to compound_head(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://lore.kernel.org/r/20230324180129.1220691-8-willy@infradead.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-04-06ext4: Convert mpage_submit_page() to mpage_submit_folio()Matthew Wilcox
All callers now have a folio so we can pass one in and use the folio APIs to support large folios as well as save instructions by eliminating calls to compound_head(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Theodore Ts'o <tytso@mit.edu> Link: https://lore.kernel.org/r/20230324180129.1220691-7-willy@infradead.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-04-06ext4: Turn mpage_process_page() into mpage_process_folio()Matthew Wilcox
The page/folio is only used to extract the buffers, so this is a simple change. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Theodore Ts'o <tytso@mit.edu> Link: https://lore.kernel.org/r/20230324180129.1220691-6-willy@infradead.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-04-05mm: return an ERR_PTR from __filemap_get_folioChristoph Hellwig
Instead of returning NULL for all errors, distinguish between: - no entry found and not asked to allocated (-ENOENT) - failed to allocate memory (-ENOMEM) - would block (-EAGAIN) so that callers don't have to guess the error based on the passed in flags. Also pass through the error through the direct callers: filemap_get_folio, filemap_lock_folio filemap_grab_folio and filemap_get_incore_folio. [hch@lst.de: fix null-pointer deref] Link: https://lkml.kernel.org/r/20230310070023.GA13563@lst.de Link: https://lkml.kernel.org/r/20230310043137.GA1624890@u2004 Link: https://lkml.kernel.org/r/20230307143410.28031-8-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> [nilfs2] Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-23ext4: fix comment: "start start" -> "start" in mpage_prepare_extent_to_map()Theodore Ts'o
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-03-23ext4: Fix warnings when freezing filesystem with journaled dataJan Kara
Test generic/390 in data=journal mode often triggers a warning that ext4_do_writepages() tries to start a transaction on frozen filesystem. This happens because although all dirty data is properly written, jbd2 checkpointing code writes data through submit_bh() and as a result only buffer dirty bits are cleared but page dirty bits stay set. Later when the filesystem is frozen, writeback code comes, tries to write supposedly dirty pages and the warning triggers. Fix the problem by calling sync_filesystem() once more after flushing the whole journal to clear stray page dirty bits. [ Applied fixup patches to address crashes when running data=journal tests; see links for more details -- TYT ] Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230308142528.12384-1-jack@suse.cz Reported-by: Eric Biggers <ebiggers@kernel.org> Link: https://lore.kernel.org/all/20230319183617.GA896@sol.localdomain Link: https://lore.kernel.org/r/20230323145404.21381-1-jack@suse.cz Link: https://lore.kernel.org/r/20230323145404.21381-2-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-03-23ext4: Convert data=journal writeback to use ext4_writepages()Jan Kara
Add support for writeback of journalled data directly into ext4_writepages() instead of offloading it to write_cache_pages(). This actually significantly simplifies the code and reduces code duplication. For checkpointing of committed data we can use ext4_writepages() rightaway the same way as writeback of ordered data uses it on transaction commit. For journalling of dirty mapped pages, we need to add a special case to mpage_prepare_extent_to_map() to add all page buffers to the journal. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Link: https://lore.kernel.org/r/20230228051319.4085470-8-tytso@mit.edu
2023-03-23ext4: Move mpage_page_done() calls after error handlingJan Kara
In case mpage_submit_page() returns error, it doesn't really matter whether we call mpage_page_done() and then return error or whether we return directly because in that case page cleanup will be done by mpage_release_unused_pages() instead. Logically, it makes more sense to leave the cleanup to mpage_release_unused_pages() because we didn't succeed in writing the page. So move mpage_page_done() calls after the error handling. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Link: https://lore.kernel.org/r/20230228051319.4085470-7-tytso@mit.edu
2023-03-23ext4: Move page unlocking out of mpage_submit_page()Jan Kara
Move page unlocking during page writeback out of mpage_submit_page() into the callers. This will allow writeback in data=journal mode to keep the page locked for a bit longer. Since page unlocking it tightly connected to increment of mpd->first_page (as that determines cleanup of locked but unwritten pages), move page unlocking as well as mpd->first_page handling into a helper function. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Link: https://lore.kernel.org/r/20230228051319.4085470-6-tytso@mit.edu
2023-03-23ext4: Don't unlock page in ext4_bio_write_page()Jan Kara
Do not unlock the written page in ext4_bio_write_page(). Instead leave the page locked and unlock it in the callers. We'll need to keep the page locked for data=journal writeback for a bit longer. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Link: https://lore.kernel.org/r/20230228051319.4085470-5-tytso@mit.edu
2023-03-23ext4: Mark page for delayed dirtying only if it is pinnedJan Kara
In data=journal mode, page should be dirtied only when it has buffers for checkpoint or it is writeably mapped. In the first case, we don't need to do anything special. In the second case, page was already added to the journal by ext4_page_mkwrite() and since transaction commit writeprotects mapped pages again, page should be writeable (and thus dirtied) only while it is part of the running transaction. So nothing needs to be done either. The only special case is when someone pins the page and uses this pin for modifying page data. So recognize this special case and only then mark the page as having data that needs adding to the journal. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Link: https://lore.kernel.org/r/20230228051319.4085470-4-tytso@mit.edu
2023-03-23ext4: Use nr_to_write directly in mpage_prepare_extent_to_map()Jan Kara
When looking up extent of pages to map in mpage_prepare_extent_to_map() we count how many pages we still need to find in a copy of wbc->nr_to_write counter. With more complex page handling for data=journal mode, it will be easier to use wbc->nr_to_write directly so that we don't forget to carry over changes back to nr_to_write counter. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Link: https://lore.kernel.org/r/20230228051319.4085470-3-tytso@mit.edu
2023-03-23ext4: Update stale comment about write constraintsJan Kara
The comment above do_journal_get_write_access() is very stale. Most of it just does not refer to what the function does today or how jbd2 works. The bit about transaction handling during write(2) is still correct so just update the function names in that part and move the comment to a more appropriate place. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Link: https://lore.kernel.org/r/20230228051319.4085470-2-tytso@mit.edu
2023-03-12Merge tag 'ext4_for_linus_stable' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 fixes from Ted Ts'o: "Bug fixes and regressions for ext4, the most serious of which is a potential deadlock during directory renames that was introduced during the merge window discovered by a combination of syzbot and lockdep" * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: zero i_disksize when initializing the bootloader inode ext4: make sure fs error flag setted before clear journal error ext4: commit super block if fs record error when journal record without error ext4, jbd2: add an optimized bmap for the journal inode ext4: fix WARNING in ext4_update_inline_data ext4: move where set the MAY_INLINE_DATA flag is set ext4: Fix deadlock during directory rename ext4: Fix comment about the 64BIT feature docs: ext4: modify the group desc size to 64 ext4: fix another off-by-one fsmap error on 1k block filesystems ext4: fix RENAME_WHITEOUT handling for inline directories ext4: make kobj_type structures constant ext4: fix cgroup writeback accounting with fs-layer encryption
2023-03-11ext4: move where set the MAY_INLINE_DATA flag is setYe Bin
The only caller of ext4_find_inline_data_nolock() that needs setting of EXT4_STATE_MAY_INLINE_DATA flag is ext4_iget_extra_inode(). In ext4_write_inline_data_end() we just need to update inode->i_inline_off. Since we are going to add one more caller that does not need to set EXT4_STATE_MAY_INLINE_DATA, just move setting of EXT4_STATE_MAY_INLINE_DATA out to ext4_iget_extra_inode(). Signed-off-by: Ye Bin <yebin10@huawei.com> Cc: stable@kernel.org Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230307015253.2232062-2-yebin@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-02-28Merge tag 'ext4_for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "Improve performance for ext4 by allowing multiple process to perform direct I/O writes to preallocated blocks by using a shared inode lock instead of taking an exclusive lock. In addition, multiple bug fixes and cleanups" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: fix incorrect options show of original mount_opt and extend mount_opt2 ext4: Fix possible corruption when moving a directory ext4: init error handle resource before init group descriptors ext4: fix task hung in ext4_xattr_delete_inode jbd2: fix data missing when reusing bh which is ready to be checkpointed ext4: update s_journal_inum if it changes after journal replay ext4: fail ext4_iget if special inode unallocated ext4: fix function prototype mismatch for ext4_feat_ktype ext4: remove unnecessary variable initialization ext4: fix inode tree inconsistency caused by ENOMEM ext4: refuse to create ea block when umounted ext4: optimize ea_inode block expansion ext4: remove dead code in updating backup sb ext4: dio take shared inode lock when overwriting preallocated blocks ext4: don't show commit interval if it is zero ext4: use ext4_fc_tl_mem in fast-commit replay path ext4: improve xattr consistency checking and error reporting
2023-02-23Merge tag 'mm-nonmm-stable-2023-02-20-15-29' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: "There is no particular theme here - mainly quick hits all over the tree. Most notable is a set of zlib changes from Mikhail Zaslonko which enhances and fixes zlib's use of S390 hardware support: 'lib/zlib: Set of s390 DFLTCC related patches for kernel zlib'" * tag 'mm-nonmm-stable-2023-02-20-15-29' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (55 commits) Update CREDITS file entry for Jesper Juhl sparc: allow PM configs for sparc32 COMPILE_TEST hung_task: print message when hung_task_warnings gets down to zero. arch/Kconfig: fix indentation scripts/tags.sh: fix the Kconfig tags generation when using latest ctags nilfs2: prevent WARNING in nilfs_dat_commit_end() lib/zlib: remove redundation assignement of avail_in dfltcc_gdht() lib/Kconfig.debug: do not enable DEBUG_PREEMPT by default lib/zlib: DFLTCC always switch to software inflate for Z_PACKET_FLUSH option lib/zlib: DFLTCC support inflate with small window lib/zlib: Split deflate and inflate states for DFLTCC lib/zlib: DFLTCC not writing header bits when avail_out == 0 lib/zlib: fix DFLTCC ignoring flush modes when avail_in == 0 lib/zlib: fix DFLTCC not flushing EOBS when creating raw streams lib/zlib: implement switching between DFLTCC and software lib/zlib: adjust offset calculation for dfltcc_state nilfs2: replace WARN_ONs for invalid DAT metadata block requests scripts/spelling.txt: add "exsits" pattern and fix typo instances fs: gracefully handle ->get_block not mapping bh in __mpage_writepage cramfs: Kconfig: fix spelling & punctuation ...
2023-02-23Merge tag 'mm-stable-2023-02-20-13-37' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Daniel Verkamp has contributed a memfd series ("mm/memfd: add F_SEAL_EXEC") which permits the setting of the memfd execute bit at memfd creation time, with the option of sealing the state of the X bit. - Peter Xu adds a patch series ("mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare") which addresses a rare race condition related to PMD unsharing. - Several folioification patch serieses from Matthew Wilcox, Vishal Moola, Sidhartha Kumar and Lorenzo Stoakes - Johannes Weiner has a series ("mm: push down lock_page_memcg()") which does perform some memcg maintenance and cleanup work. - SeongJae Park has added DAMOS filtering to DAMON, with the series "mm/damon/core: implement damos filter". These filters provide users with finer-grained control over DAMOS's actions. SeongJae has also done some DAMON cleanup work. - Kairui Song adds a series ("Clean up and fixes for swap"). - Vernon Yang contributed the series "Clean up and refinement for maple tree". - Yu Zhao has contributed the "mm: multi-gen LRU: memcg LRU" series. It adds to MGLRU an LRU of memcgs, to improve the scalability of global reclaim. - David Hildenbrand has added some userfaultfd cleanup work in the series "mm: uffd-wp + change_protection() cleanups". - Christoph Hellwig has removed the generic_writepages() library function in the series "remove generic_writepages". - Baolin Wang has performed some maintenance on the compaction code in his series "Some small improvements for compaction". - Sidhartha Kumar is doing some maintenance work on struct page in his series "Get rid of tail page fields". - David Hildenbrand contributed some cleanup, bugfixing and generalization of pte management and of pte debugging in his series "mm: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE on all architectures with swap PTEs". - Mel Gorman and Neil Brown have removed the __GFP_ATOMIC allocation flag in the series "Discard __GFP_ATOMIC". - Sergey Senozhatsky has improved zsmalloc's memory utilization with his series "zsmalloc: make zspage chain size configurable". - Joey Gouly has added prctl() support for prohibiting the creation of writeable+executable mappings. The previous BPF-based approach had shortcomings. See "mm: In-kernel support for memory-deny-write-execute (MDWE)". - Waiman Long did some kmemleak cleanup and bugfixing in the series "mm/kmemleak: Simplify kmemleak_cond_resched() & fix UAF". - T.J. Alumbaugh has contributed some MGLRU cleanup work in his series "mm: multi-gen LRU: improve". - Jiaqi Yan has provided some enhancements to our memory error statistics reporting, mainly by presenting the statistics on a per-node basis. See the series "Introduce per NUMA node memory error statistics". - Mel Gorman has a second and hopefully final shot at fixing a CPU-hog regression in compaction via his series "Fix excessive CPU usage during compaction". - Christoph Hellwig does some vmalloc maintenance work in the series "cleanup vfree and vunmap". - Christoph Hellwig has removed block_device_operations.rw_page() in ths series "remove ->rw_page". - We get some maple_tree improvements and cleanups in Liam Howlett's series "VMA tree type safety and remove __vma_adjust()". - Suren Baghdasaryan has done some work on the maintainability of our vm_flags handling in the series "introduce vm_flags modifier functions". - Some pagemap cleanup and generalization work in Mike Rapoport's series "mm, arch: add generic implementation of pfn_valid() for FLATMEM" and "fixups for generic implementation of pfn_valid()" - Baoquan He has done some work to make /proc/vmallocinfo and /proc/kcore better represent the real state of things in his series "mm/vmalloc.c: allow vread() to read out vm_map_ram areas". - Jason Gunthorpe rationalized the GUP system's interface to the rest of the kernel in the series "Simplify the external interface for GUP". - SeongJae Park wishes to migrate people from DAMON's debugfs interface over to its sysfs interface. To support this, we'll temporarily be printing warnings when people use the debugfs interface. See the series "mm/damon: deprecate DAMON debugfs interface". - Andrey Konovalov provided the accurately named "lib/stackdepot: fixes and clean-ups" series. - Huang Ying has provided a dramatic reduction in migration's TLB flush IPI rates with the series "migrate_pages(): batch TLB flushing". - Arnd Bergmann has some objtool fixups in "objtool warning fixes". * tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (505 commits) include/linux/migrate.h: remove unneeded externs mm/memory_hotplug: cleanup return value handing in do_migrate_range() mm/uffd: fix comment in handling pte markers mm: change to return bool for isolate_movable_page() mm: hugetlb: change to return bool for isolate_hugetlb() mm: change to return bool for isolate_lru_page() mm: change to return bool for folio_isolate_lru() objtool: add UACCESS exceptions for __tsan_volatile_read/write kmsan: disable ftrace in kmsan core code kasan: mark addr_has_metadata __always_inline mm: memcontrol: rename memcg_kmem_enabled() sh: initialize max_mapnr m68k/nommu: add missing definition of ARCH_PFN_OFFSET mm: percpu: fix incorrect size in pcpu_obj_full_size() maple_tree: reduce stack usage with gcc-9 and earlier mm: page_alloc: call panic() when memoryless node allocation fails mm: multi-gen LRU: avoid futile retries migrate_pages: move THP/hugetlb migration support check to simplify code migrate_pages: batch flushing TLB migrate_pages: share more code between _unmap and _move ...
2023-02-20Merge tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fsverity/linuxLinus Torvalds
Pull fsverity updates from Eric Biggers: "Fix the longstanding implementation limitation that fsverity was only supported when the Merkle tree block size, filesystem block size, and PAGE_SIZE were all equal. Specifically, add support for Merkle tree block sizes less than PAGE_SIZE, and make ext4 support fsverity on filesystems where the filesystem block size is less than PAGE_SIZE. Effectively, this means that fsverity can now be used on systems with non-4K pages, at least on ext4. These changes have been tested using the verity group of xfstests, newly updated to cover the new code paths. Also update fs/verity/ to support verifying data from large folios. There's also a similar patch for fs/crypto/, to support decrypting data from large folios, which I'm including in here to avoid a merge conflict between the fscrypt and fsverity branches" * tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fsverity/linux: fscrypt: support decrypting data from large folios fsverity: support verifying data from large folios fsverity.rst: update git repo URL for fsverity-utils ext4: allow verity with fs block size < PAGE_SIZE fs/buffer.c: support fsverity in block_read_full_folio() f2fs: simplify f2fs_readpage_limit() ext4: simplify ext4_readpage_limit() fsverity: support enabling with tree block size < PAGE_SIZE fsverity: support verification with tree block size < PAGE_SIZE fsverity: replace fsverity_hash_page() with fsverity_hash_block() fsverity: use EFBIG for file too large to enable verity fsverity: store log2(digest_size) precomputed fsverity: simplify Merkle tree readahead size calculation fsverity: use unsigned long for level_start fsverity: remove debug messages and CONFIG_FS_VERITY_DEBUG fsverity: pass pos and size to ->write_merkle_tree_block fsverity: optimize fsverity_cleanup_inode() on non-verity files fsverity: optimize fsverity_prepare_setattr() on non-verity files fsverity: optimize fsverity_file_open() on non-verity files
2023-02-19ext4: fail ext4_iget if special inode unallocatedBaokun Li
In ext4_fill_super(), EXT4_ORPHAN_FS flag is cleared after ext4_orphan_cleanup() is executed. Therefore, when __ext4_iget() is called to get an inode whose i_nlink is 0 when the flag exists, no error is returned. If the inode is a special inode, a null pointer dereference may occur. If the value of i_nlink is 0 for any inodes (except boot loader inodes) got by using the EXT4_IGET_SPECIAL flag, the current file system is corrupted. Therefore, make the ext4_iget() function return an error if it gets such an abnormal special inode. Link: https://bugzilla.kernel.org/show_bug.cgi?id=199179 Link: https://bugzilla.kernel.org/show_bug.cgi?id=216541 Link: https://bugzilla.kernel.org/show_bug.cgi?id=216539 Reported-by: Luís Henriques <lhenriques@suse.de> Suggested-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230107032126.4165860-2-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-02-19ext4: remove unnecessary variable initializationXU pengfei
Variables are assigned first and then used. Initialization is not required. Signed-off-by: XU pengfei <xupengfei@nfschina.com> Link: https://lore.kernel.org/r/20230104055229.3663-1-xupengfei@nfschina.com
2023-02-02fs/ext4: use try_cmpxchg in ext4_update_bh_stateUros Bizjak
Use try_cmpxchg instead of cmpxchg (*ptr, old, new) == old in ext4_update_bh_state. x86 CMPXCHG instruction returns success in ZF flag, so this change saves a compare after cmpxchg (and related move instruction in front of cmpxchg). Also, try_cmpxchg implicitly assigns old *ptr value to "old" when cmpxchg fails. There is no need to re-read the value in the loop. No functional change intended. Link: https://lkml.kernel.org/r/20221102071147.6642-1-ubizjak@gmail.com Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02fs: convert writepage_t callback to pass a folioMatthew Wilcox (Oracle)
Patch series "Convert writepage_t to use a folio". More folioisation. I split out the mpage work from everything else because it completely dominated the patch, but some implementations I just converted outright. This patch (of 2): We always write back an entire folio, but that's currently passed as the head page. Convert all filesystems that use write_cache_pages() to expect a folio instead of a page. Link: https://lkml.kernel.org/r/20230126201255.1681189-1-willy@infradead.org Link: https://lkml.kernel.org/r/20230126201255.1681189-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>