summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2024-05-04eventfs: Free all of the eventfs_inode after RCUSteven Rostedt (Google)
The freeing of eventfs_inode via a kfree_rcu() callback. But the content of the eventfs_inode was being freed after the last kref. This is dangerous, as changes are being made that can access the content of an eventfs_inode from an RCU loop. Instead of using kfree_rcu() use call_rcu() that calls a function to do all the freeing of the eventfs_inode after a RCU grace period has expired. Link: https://lore.kernel.org/linux-trace-kernel/20240502200905.370261163@goodmis.org Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Fixes: 43aa6f97c2d03 ("eventfs: Get rid of dentry pointers without refcounts") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-05-04eventfs/tracing: Add callback for release of an eventfs_inodeSteven Rostedt (Google)
Synthetic events create and destroy tracefs files when they are created and removed. The tracing subsystem has its own file descriptor representing the state of the events attached to the tracefs files. There's a race between the eventfs files and this file descriptor of the tracing system where the following can cause an issue: With two scripts 'A' and 'B' doing: Script 'A': echo "hello int aaa" > /sys/kernel/tracing/synthetic_events while : do echo 0 > /sys/kernel/tracing/events/synthetic/hello/enable done Script 'B': echo > /sys/kernel/tracing/synthetic_events Script 'A' creates a synthetic event "hello" and then just writes zero into its enable file. Script 'B' removes all synthetic events (including the newly created "hello" event). What happens is that the opening of the "enable" file has: { struct trace_event_file *file = inode->i_private; int ret; ret = tracing_check_open_get_tr(file->tr); [..] But deleting the events frees the "file" descriptor, and a "use after free" happens with the dereference at "file->tr". The file descriptor does have a reference counter, but there needs to be a way to decrement it from the eventfs when the eventfs_inode is removed that represents this file descriptor. Add an optional "release" callback to the eventfs_entry array structure, that gets called when the eventfs file is about to be removed. This allows for the creating on the eventfs file to increment the tracing file descriptor ref counter. When the eventfs file is deleted, it can call the release function that will call the put function for the tracing file descriptor. This will protect the tracing file from being freed while a eventfs file that references it is being opened. Link: https://lore.kernel.org/linux-trace-kernel/20240426073410.17154-1-Tze-nan.Wu@mediatek.com/ Link: https://lore.kernel.org/linux-trace-kernel/20240502090315.448cba46@gandalf.local.home Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Fixes: 5790b1fb3d672 ("eventfs: Remove eventfs_file and just use eventfs_inode") Reported-by: Tze-nan wu <Tze-nan.Wu@mediatek.com> Tested-by: Tze-nan Wu (吳澤南) <Tze-nan.Wu@mediatek.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-05-03gfs2: Convert gfs2_aspace_writepage() to use a folioMatthew Wilcox (Oracle)
Convert the incoming struct page to a folio and use it throughout. Saves six calls to compound_head(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2024-05-03gfs2: Add a migrate_folio operation for journalled filesMatthew Wilcox (Oracle)
For journalled data, folio migration currently works by writing the folio back, freeing the folio and faulting the new folio back in. We can bypass that by telling the migration code to migrate the buffer_heads attached to our folios. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2024-05-03fsverity: use register_sysctl_init() to avoid kmemleak warningEric Biggers
Since the fsverity sysctl registration runs as a builtin initcall, there is no corresponding sysctl deregistration and the resulting struct ctl_table_header is not used. This can cause a kmemleak warning just after the system boots up. (A pointer to the ctl_table_header is stored in the fsverity_sysctl_header static variable, which kmemleak should detect; however, the compiler can optimize out that variable.) Avoid the kmemleak warning by using register_sysctl_init() which is intended for use by builtin initcalls and uses kmemleak_not_leak(). Reported-by: Yi Zhang <yi.zhang@redhat.com> Closes: https://lore.kernel.org/r/CAHj4cs8DTSvR698UE040rs_pX1k-WVe7aR6N2OoXXuhXJPDC-w@mail.gmail.com Cc: stable@vger.kernel.org Reviewed-by: Joel Granados <j.granados@samsung.com> Link: https://lore.kernel.org/r/20240501025331.594183-1-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>
2024-05-03ext2: Remove LEGACY_DIRECT_IO dependencyRitesh Harjani (IBM)
commit fb5de4358e1a ("ext2: Move direct-io to use iomap"), converted ext2 direct-io to iomap which killed the call to blockdev_direct_IO(). So let's remove LEGACY_DIRECT_IO config dependency from ext2 Kconfig. Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Message-Id: <f3303addc0b5cd7e5760beb2374b7e538a49d898.1714727887.git.ritesh.list@gmail.com>
2024-05-03nilfs_attach_log_writer(): use ->bd_mapping->host instead of ->bd_inodeAl Viro
I suspect that inode_attach_wb() use is rather unidiomatic, but that's a separate story - in any case, its use is a few times per mount *and* the route by which we access that inode is "the host of address_space a page belongs to". Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-05-03gfs2: more obvious initializations of mapping->hostAl Viro
what's going on is copying the ->host of bdev's address_space Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Link: https://lore.kernel.org/r/20240411145346.2516848-4-viro@zeniv.linux.org.uk Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-05-03fs/buffer.c: massage the remaining users of ->bd_inode to ->bd_mappingAl Viro
both for ->i_blkbits and both want the address_space in question anyway. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-05-03grow_dev_folio(): we only want ->bd_inode->i_mapping thereAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Link: https://lore.kernel.org/r/20240411145346.2516848-3-viro@zeniv.linux.org.uk Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-05-03use ->bd_mapping instead of ->bd_inode->i_mappingAl Viro
Just the low-hanging fruit... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Link: https://lore.kernel.org/r/20240411145346.2516848-2-viro@zeniv.linux.org.uk Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-05-03Merge branch 'misc.erofs' into work.bdevAl Viro
2024-05-03bcachefs: remove dead function bdev_sectors()Yu Kuai
bdev_sectors() is not used hence remove it. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Link: https://lore.kernel.org/r/20240411145346.2516848-10-viro@zeniv.linux.org.uk Acked-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-05-03ext4: remove block_device_ejected()Yu Kuai
block_device_ejected() is added by commit bdfe0cbd746a ("Revert "ext4: remove block_device_ejected"") in 2015. At that time 'bdi->wb' is destroyed synchronized from del_gendisk(), hence if ext4 is still mounted, and then mark_buffer_dirty() will reference destroyed 'wb'. However, such problem doesn't exist anymore: - commit d03f6cdc1fc4 ("block: Dynamically allocate and refcount backing_dev_info") switch bdi to use refcounting; - commit 13eec2363ef0 ("fs: Get proper reference for s_bdi"), will grab additional reference of bdi while mounting, so that 'bdi->wb' will not be destroyed until generic_shutdown_super(). Hence remove this dead function block_device_ejected(). Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Link: https://lore.kernel.org/r/20240411145346.2516848-7-viro@zeniv.linux.org.uk Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-05-03xfs: simplify iext overflow checking and upgradeChristoph Hellwig
Currently the calls to xfs_iext_count_may_overflow and xfs_iext_count_upgrade are always paired. Merge them into a single function to simplify the callers and the actual check and upgrade logic itself. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-05-03xfs: remove a racy if_bytes check in xfs_reflink_end_cow_extentChristoph Hellwig
Accessing if_bytes without the ilock is racy. Remove the initial if_bytes == 0 check in xfs_reflink_end_cow_extent and let ext_iext_lookup_extent fail for this case after we've taken the ilock. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-05-03xfs: upgrade the extent counters in xfs_reflink_end_cow_extent laterChristoph Hellwig
Defer the extent counter size upgrade until we know we're going to modify the extent mapping. This also defers dirtying the transaction and will allow us safely back out later in the function in later changes. Fixes: 4f86bb4b66c9 ("xfs: Conditionally upgrade existing inodes to use large extent counters") Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-05-03xfs: xfs_quota_unreserve_blkres can't failChristoph Hellwig
Unreserving quotas can't fail due to quota limits, and we'll notice a shut down file system a bit later in all the callers anyway. Return void and remove the error checking and propagation in the callers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-05-03xfs: consolidate the xfs_quota_reserve_blkres definitionsChristoph Hellwig
xfs_trans_reserve_quota_nblks is already stubbed out if quota support is disabled, no need for an extra xfs_quota_reserve_blkres stub. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-05-03xfs: clean up buffer allocation in xlog_do_recovery_passChristoph Hellwig
Merge the initial xlog_alloc_buffer calls, and pass the variable designating the length that is initialized to 1 above instead of passing the open coded 1 directly. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-05-03xfs: fix log recovery buffer allocation for the legacy h_size fixupChristoph Hellwig
Commit a70f9fe52daa ("xfs: detect and handle invalid iclog size set by mkfs") added a fixup for incorrect h_size values used for the initial umount record in old xfsprogs versions. Later commit 0c771b99d6c9 ("xfs: clean up calculation of LR header blocks") cleaned up the log reover buffer calculation, but stoped using the fixed up h_size value to size the log recovery buffer, which can lead to an out of bounds access when the incorrect h_size does not come from the old mkfs tool, but a fuzzer. Fix this by open coding xlog_logrec_hblks and taking the fixed h_size into account for this calculation. Fixes: 0c771b99d6c9 ("xfs: clean up calculation of LR header blocks") Reported-by: Sam Sun <samsun1006219@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-05-03ext4: open coding repeated check in next_linear_groupKemeng Shi
Open coding repeated check in next_linear_group. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240424061904.987525-6-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-05-03ext4: use correct criteria name instead stale integer number in commentKemeng Shi
Use correct criteria name instead stale integer number in comment Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240424061904.987525-5-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-05-03ext4: call ext4_mb_mark_free_simple to free continuous bits in found chunkKemeng Shi
In mb_mark_used, we will find free chunk and mark it inuse. For chunk in mid of passed range, we could simply mark whole chunk inuse. For chunk at end of range, we may need to mark a continuous bits at end of part of chunk inuse and keep rest part of chunk free. To only mark a part of chunk inuse, we firstly mark whole chunk inuse and then mark a continuous range at end of chunk free. Function mb_mark_used does several times of "mb_find_buddy; mb_clear_bit; ..." to mark a continuous range free which can be done by simply calling ext4_mb_mark_free_simple which free continuous bits in a more effective way. Just call ext4_mb_mark_free_simple in mb_mark_used to use existing and effective code to free continuous blocks in chunk at end of passed range. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240424061904.987525-4-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-05-03ext4: add test_mb_mark_used_cost to estimate cost of mb_mark_usedKemeng Shi
Add test_mb_mark_used_cost to estimate cost of mb_mark_used Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Link: https://lore.kernel.org/r/20240424061904.987525-3-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-05-03ext4: keep "prefetch_grp" and "nr" consistentKemeng Shi
Keep "prefetch_grp" and "nr" consistent to avoid to call ext4_mb_prefetch_fini with non-prefetched groups. When we step into next criteria, "prefetch_grp" is set to prefetch start of new criteria while "nr" is number of the prefetched group in previous criteria. If previous criteria and next criteria are both inexpensive (< CR_GOAL_LEN_SLOW) and prefetch_ios reachs sbi->s_mb_prefetch_limit in previous criteria, "prefetch_grp" and "nr" will be inconsistent and may introduce unexpected cost to do ext4_mb_init_group for non-prefetched groups. Reset "nr" to 0 when we reset "prefetch_grp" to goal group to keep them consistent. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240424061904.987525-2-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-05-03ext4: implement filesystem specific alloc_inode in unit testKemeng Shi
We expect inode with ext4_info_info type as following: mbt_kunit_init mbt_mb_init ext4_mb_init ext4_mb_init_backend sbi->s_buddy_cache = new_inode(sb); EXT4_I(sbi->s_buddy_cache)->i_disksize = 0; Implement alloc_inode ionde with ext4_inode_info type to avoid out-of-bounds write. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reported-by: Guenter Roeck <linux@roeck-us.net> Tested-by: Guenter Roeck <linux@roeck-us.net> Link: https://lore.kernel.org/r/20240322165518.8147-1-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-05-03ext4: do not create EA inode under buffer lockJan Kara
ext4_xattr_set_entry() creates new EA inodes while holding buffer lock on the external xattr block. This is problematic as it nests all the allocation locking (which acquires locks on other buffers) under the buffer lock. This can even deadlock when the filesystem is corrupted and e.g. quota file is setup to contain xattr block as data block. Move the allocation of EA inode out of ext4_xattr_set_entry() into the callers. Reported-by: syzbot+a43d4f48b8397d0e41a9@syzkaller.appspotmail.com Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240321162657.27420-2-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-05-03Revert "ext4: drop duplicate ea_inode handling in ext4_xattr_block_set()"Jan Kara
This reverts commit 7f48212678e91a057259b3e281701f7feb1ee397. We will need the special cleanup handling once we move allocation of EA inode outside of the buffer lock in the following patch. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240321162657.27420-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-05-02ext4: replace deprecated strncpy with alternativesJustin Stitt
strncpy() is deprecated for use on NUL-terminated destination strings [1] and as such we should prefer more robust and less ambiguous string interfaces. in file.c: s_last_mounted is marked as __nonstring meaning it does not need to be NUL-terminated. Let's instead use strtomem_pad() to copy bytes from the string source to the byte array destination -- while also ensuring to pad with zeroes. in ioctl.c: We can drop the memset and size argument in favor of using the new 2-argument version of strscpy_pad() -- which was introduced with Commit e6584c3964f2f ("string: Allow 2-argument strscpy()"). This guarantees NUL-termination and NUL-padding on the destination buffer -- which seems to be a requirement judging from this comment: | static int ext4_ioctl_getlabel(struct ext4_sb_info *sbi, char __user *user_label) | { | char label[EXT4_LABEL_MAX + 1]; | | /* | * EXT4_LABEL_MAX must always be smaller than FSLABEL_MAX because | * FSLABEL_MAX must include terminating null byte, while s_volume_name | * does not have to. | */ in super.c: s_first_error_func is marked as __nonstring meaning we can take the same approach as in file.c; just use strtomem_pad() Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings [1] Link: https://manpages.debian.org/testing/linux-manual-4.8/strscpy.9.en.html [2] Link: https://github.com/KSPP/linux/issues/90 Cc: linux-hardening@vger.kernel.org Signed-off-by: Justin Stitt <justinstitt@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20240321-strncpy-fs-ext4-file-c-v1-1-36a6a09fef0c@google.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-05-02ext4: clean up s_mb_rb_lock to fix build warnings with C=1Baokun Li
Running sparse (make C=1) on mballoc.c we get the following warning: fs/ext4/mballoc.c:3194:13: warning: context imbalance in 'ext4_mb_seq_structs_summary_start' - wrong count at exit This is because __acquires(&EXT4_SB(sb)->s_mb_rb_lock) was called in ext4_mb_seq_structs_summary_start(), but s_mb_rb_lock was removed in commit 83e80a6e3543 ("ext4: use buckets for cr 1 block scan instead of rbtree"), so remove the __acquires to silence the warning. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240319113325.3110393-10-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-05-02ext4: set the type of max_zeroout to unsigned int to avoid overflowBaokun Li
The max_zeroout is of type int and the s_extent_max_zeroout_kb is of type uint, and the s_extent_max_zeroout_kb can be freely modified via the sysfs interface. When the block size is 1024, max_zeroout may overflow, so declare it as unsigned int to avoid overflow. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240319113325.3110393-9-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-05-02ext4: set type of ac_groups_linear_remaining to __u32 to avoid overflowBaokun Li
Now ac_groups_linear_remaining is of type __u16 and s_mb_max_linear_groups is of type unsigned int, so an overflow occurs when setting a value above 65535 through the mb_max_linear_groups sysfs interface. Therefore, the type of ac_groups_linear_remaining is set to __u32 to avoid overflow. Fixes: 196e402adf2e ("ext4: improve cr 0 / cr 1 group scanning") CC: stable@kernel.org Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240319113325.3110393-8-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-05-02ext4: add positive int attr pointer to avoid sysfs variables overflowBaokun Li
The following variables controlled by the sysfs interface are of type int and are normally used in the range [0, INT_MAX], but are declared as attr_pointer_ui, and thus may be set to values that exceed INT_MAX and result in overflows to get negative values. err_ratelimit_burst msg_ratelimit_burst warning_ratelimit_burst err_ratelimit_interval_ms msg_ratelimit_interval_ms warning_ratelimit_interval_ms Therefore, we add attr_pointer_pi (aka positive int attr pointer) with a value range of 0-INT_MAX to avoid overflow. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240319113325.3110393-7-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-05-02ext4: add new attr pointer attr_mb_orderBaokun Li
The s_mb_best_avail_max_trim_order is of type unsigned int, and has a range of values well beyond the normal use of the mb_order. Although the mballoc code is careful enough that large numbers don't matter there, but this can mislead the sysadmin into thinking that it's normal to set such values. Hence add a new attr_id attr_mb_order with values in the range [0, 64] to avoid storing garbage values and make us more resilient to surprises in the future. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240319113325.3110393-6-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-05-02ext4: fix slab-out-of-bounds in ext4_mb_find_good_group_avg_frag_lists()Baokun Li
We can trigger a slab-out-of-bounds with the following commands: mkfs.ext4 -F /dev/$disk 10G mount /dev/$disk /tmp/test echo 2147483647 > /sys/fs/ext4/$disk/mb_group_prealloc echo test > /tmp/test/file && sync ================================================================== BUG: KASAN: slab-out-of-bounds in ext4_mb_find_good_group_avg_frag_lists+0x8a/0x200 [ext4] Read of size 8 at addr ffff888121b9d0f0 by task kworker/u2:0/11 CPU: 0 PID: 11 Comm: kworker/u2:0 Tainted: GL 6.7.0-next-20240118 #521 Call Trace: dump_stack_lvl+0x2c/0x50 kasan_report+0xb6/0xf0 ext4_mb_find_good_group_avg_frag_lists+0x8a/0x200 [ext4] ext4_mb_regular_allocator+0x19e9/0x2370 [ext4] ext4_mb_new_blocks+0x88a/0x1370 [ext4] ext4_ext_map_blocks+0x14f7/0x2390 [ext4] ext4_map_blocks+0x569/0xea0 [ext4] ext4_do_writepages+0x10f6/0x1bc0 [ext4] [...] ================================================================== The flow of issue triggering is as follows: // Set s_mb_group_prealloc to 2147483647 via sysfs ext4_mb_new_blocks ext4_mb_normalize_request ext4_mb_normalize_group_request ac->ac_g_ex.fe_len = EXT4_SB(sb)->s_mb_group_prealloc ext4_mb_regular_allocator ext4_mb_choose_next_group ext4_mb_choose_next_group_best_avail mb_avg_fragment_size_order order = fls(len) - 2 = 29 ext4_mb_find_good_group_avg_frag_lists frag_list = &sbi->s_mb_avg_fragment_size[order] if (list_empty(frag_list)) // Trigger SOOB! At 4k block size, the length of the s_mb_avg_fragment_size list is 14, but an oversized s_mb_group_prealloc is set, causing slab-out-of-bounds to be triggered by an attempt to access an element at index 29. Add a new attr_id attr_clusters_in_group with values in the range [0, sbi->s_clusters_per_group] and declare mb_group_prealloc as that type to fix the issue. In addition avoid returning an order from mb_avg_fragment_size_order() greater than MB_NUM_ORDERS(sb) and reduce some useless loops. Fixes: 7e170922f06b ("ext4: Add allocation criteria 1.5 (CR1_5)") CC: stable@vger.kernel.org Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://lore.kernel.org/r/20240319113325.3110393-5-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-05-02ext4: refactor out ext4_generic_attr_show()Baokun Li
Refactor out the function ext4_generic_attr_show() to handle the reading of values of various common types, with no functional changes. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240319113325.3110393-4-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-05-02ext4: refactor out ext4_generic_attr_store()Baokun Li
Refactor out the function ext4_generic_attr_store() to handle the setting of values of various common types, with no functional changes. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240319113325.3110393-3-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-05-02ext4: avoid overflow when setting values via sysfsBaokun Li
When setting values of type unsigned int through sysfs, we use kstrtoul() to parse it and then truncate part of it as the final set value, when the set value is greater than UINT_MAX, the set value will not match what we see because of the truncation. As follows: $ echo 4294967296 > /sys/fs/ext4/sda/mb_max_linear_groups $ cat /sys/fs/ext4/sda/mb_max_linear_groups 0 So we use kstrtouint() to parse the attr_pointer_ui type to avoid the inconsistency described above. In addition, a judgment is added to avoid setting s_resv_clusters less than 0. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240319113325.3110393-2-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-05-02Revert "ext4: apply umask if ACL support is disabled"Max Kellermann
This reverts commit 484fd6c1de13b336806a967908a927cc0356e312. The commit caused a regression because now the umask was applied to symlinks and the fix is unnecessary because the umask/O_TMPFILE bug has been fixed somewhere else already. Fixes: https://lore.kernel.org/lkml/28DSITL9912E1.2LSZUVTGTO52Q@mforney.org/ Signed-off-by: Max Kellermann <max.kellermann@ionos.com> Reviewed-by: Christian Brauner <brauner@kernel.org> Tested-by: Michael Forney <mforney@mforney.org> Link: https://lore.kernel.org/r/20240315142956.2420360-1-max.kellermann@ionos.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-05-02set_blocksize(): switch to passing struct file *Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-05-02btrfs_get_bdev_and_sb(): call set_blocksize() only for exclusive opensAl Viro
btrfs_get_bdev_and_sb() has two callers - btrfs_open_one_device(), which asks for open to be exclusive and btrfs_get_dev_args_from_path(), which doesn't. Currently it does set_blocksize() in all cases. I'm rather dubious about the need to do set_blocksize() anywhere in btrfs, to be honest - there's some access to page cache of underlying block devices in there, but it's nowhere near the hot paths, AFAICT. In any case, btrfs_get_dev_args_from_path() only needs to read the on-disk superblock and copy several fields out of it; all callers are only interested in devices that are already opened and brought into per-filesystem set, so setting the block size is redundant for those and actively harmful if we are given a pathname of unrelated device. So we only need btrfs_get_bdev_and_sb() to call set_blocksize() when it's asked to open exclusive. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-05-02btrfs: make sure that WRITTEN is set on all metadata blocksJosef Bacik
We previously would call btrfs_check_leaf() if we had the check integrity code enabled, which meant that we could only run the extended leaf checks if we had WRITTEN set on the header flags. This leaves a gap in our checking, because we could end up with corruption on disk where WRITTEN isn't set on the leaf, and then the extended leaf checks don't get run which we rely on to validate all of the item pointers to make sure we don't access memory outside of the extent buffer. However, since 732fab95abe2 ("btrfs: check-integrity: remove CONFIG_BTRFS_FS_CHECK_INTEGRITY option") we no longer call btrfs_check_leaf() from btrfs_mark_buffer_dirty(), which means we only ever call it on blocks that are being written out, and thus have WRITTEN set, or that are being read in, which should have WRITTEN set. Add checks to make sure we have WRITTEN set appropriately, and then make sure __btrfs_check_leaf() always does the item checking. This will protect us from file systems that have been corrupted and no longer have WRITTEN set on some of the blocks. This was hit on a crafted image tweaking the WRITTEN bit and reported by KASAN as out-of-bound access in the eb accessors. The example is a dir item at the end of an eb. [2.042] BTRFS warning (device loop1): bad eb member start: ptr 0x3fff start 30572544 member offset 16410 size 2 [2.040] general protection fault, probably for non-canonical address 0xe0009d1000000003: 0000 [#1] PREEMPT SMP KASAN NOPTI [2.537] KASAN: maybe wild-memory-access in range [0x0005088000000018-0x000508800000001f] [2.729] CPU: 0 PID: 2587 Comm: mount Not tainted 6.8.2 #1 [2.729] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014 [2.621] RIP: 0010:btrfs_get_16+0x34b/0x6d0 [2.621] RSP: 0018:ffff88810871fab8 EFLAGS: 00000206 [2.621] RAX: 0000a11000000003 RBX: ffff888104ff8720 RCX: ffff88811b2288c0 [2.621] RDX: dffffc0000000000 RSI: ffffffff81dd8aca RDI: ffff88810871f748 [2.621] RBP: 000000000000401a R08: 0000000000000001 R09: ffffed10210e3ee9 [2.621] R10: ffff88810871f74f R11: 205d323430333737 R12: 000000000000001a [2.621] R13: 000508800000001a R14: 1ffff110210e3f5d R15: ffffffff850011e8 [2.621] FS: 00007f56ea275840(0000) GS:ffff88811b200000(0000) knlGS:0000000000000000 [2.621] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [2.621] CR2: 00007febd13b75c0 CR3: 000000010bb50000 CR4: 00000000000006f0 [2.621] Call Trace: [2.621] <TASK> [2.621] ? show_regs+0x74/0x80 [2.621] ? die_addr+0x46/0xc0 [2.621] ? exc_general_protection+0x161/0x2a0 [2.621] ? asm_exc_general_protection+0x26/0x30 [2.621] ? btrfs_get_16+0x33a/0x6d0 [2.621] ? btrfs_get_16+0x34b/0x6d0 [2.621] ? btrfs_get_16+0x33a/0x6d0 [2.621] ? __pfx_btrfs_get_16+0x10/0x10 [2.621] ? __pfx_mutex_unlock+0x10/0x10 [2.621] btrfs_match_dir_item_name+0x101/0x1a0 [2.621] btrfs_lookup_dir_item+0x1f3/0x280 [2.621] ? __pfx_btrfs_lookup_dir_item+0x10/0x10 [2.621] btrfs_get_tree+0xd25/0x1910 Reported-by: lei lu <llfamsec@gmail.com> CC: stable@vger.kernel.org # 6.7+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ copy more details from report ] Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-02btrfs: qgroup: do not check qgroup inherit if qgroup is disabledQu Wenruo
[BUG] After kernel commit 86211eea8ae1 ("btrfs: qgroup: validate btrfs_qgroup_inherit parameter"), user space tool snapper will fail to create snapshot using its timeline feature. [CAUSE] It turns out that, if using timeline snapper would unconditionally pass btrfs_qgroup_inherit parameter (assigning the new snapshot to qgroup 1/0) for snapshot creation. In that case, since qgroup is disabled there would be no qgroup 1/0, and btrfs_qgroup_check_inherit() would return -ENOENT and fail the whole snapshot creation. [FIX] Just skip the check if qgroup is not enabled. This is to keep the older behavior for user space tools, as if the kernel behavior changed for user space, it is a regression of kernel. Thankfully snapper is also fixing the behavior by detecting if qgroup is running in the first place, so the effect should not be that huge. Link: https://github.com/openSUSE/snapper/issues/894 Fixes: 86211eea8ae1 ("btrfs: qgroup: validate btrfs_qgroup_inherit parameter") CC: stable@vger.kernel.org # 6.8+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-02ext4: add support for FS_IOC_GETFSSYSFSPATHKent Overstreet
The new sysfs path ioctl lets us get the /sys/fs/ path for a given filesystem in a fs agnostic way, potentially nudging us towards standarizing some of our reporting. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: linux-ext4@vger.kernel.org Link: https://lore.kernel.org/r/20240315035308.3563511-4-kent.overstreet@linux.dev Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-05-02Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR. Conflicts: include/linux/filter.h kernel/bpf/core.c 66e13b615a0c ("bpf: verifier: prevent userspace memory access") d503a04f8bc0 ("bpf: Add support for certain atomics in bpf_arena to x86 JIT") https://lore.kernel.org/all/20240429114939.210328b0@canb.auug.org.au/ No adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-02ext4: avoid excessive credit estimate in ext4_tmpfile()Jan Kara
A user with minimum journal size (1024 blocks these days) complained about the following error triggered by generic/697 test in ext4_tmpfile(): run fstests generic/697 at 2024-02-28 05:34:46 JBD2: vfstest wants too many credits credits:260 rsv_credits:0 max:256 EXT4-fs error (device loop0) in __ext4_new_inode:1083: error 28 Indeed the credit estimate in ext4_tmpfile() is huge. EXT4_MAXQUOTAS_INIT_BLOCKS() is 219, then 10 credits from ext4_tmpfile() itself and then ext4_xattr_credits_for_new_inode() adds more credits needed for security attributes and ACLs. Now the EXT4_MAXQUOTAS_INIT_BLOCKS() is in fact unnecessary because we've already initialized quotas with dquot_init() shortly before and so EXT4_MAXQUOTAS_TRANS_BLOCKS() is enough (which boils down to 3 credits). Fixes: af51a2ac36d1 ("ext4: ->tmpfile() support") Signed-off-by: Jan Kara <jack@suse.cz> Tested-by: Luis Henriques <lhenriques@suse.de> Tested-by: Disha Goel <disgoel@linux.ibm.com> Link: https://lore.kernel.org/r/20240307115320.28949-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-05-02ext4: remove unneeded if checks before kfreeThorsten Blum
kfree already checks if its argument is NULL. This fixes two Coccinelle/coccicheck warnings reported by ifnullfree.cocci. Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20240317153638.2136-2-thorsten.blum@toblux.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-05-02ovl: remove upper umask handling from ovl_create_upper()Miklos Szeredi
This is already done by vfs_prepare_mode() when creating the upper object by vfs_create(), vfs_mkdir() and vfs_mknod(). No regressions have been observed in xfstests run with posix acls turned off for the upper filesystem. Fixes: 1639a49ccdce ("fs: move S_ISGID stripping into the vfs_*() helpers") Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-05-02ovl: implement tmpfileMiklos Szeredi
Combine inode creation with opening a file. There are six separate objects that are being set up: the backing inode, dentry and file, and the overlay inode, dentry and file. Cleanup in case of an error is a bit of a challenge and is difficult to test, so careful review is needed. All tmpfile testcases except generic/509 now run/pass, and no regressions are observed with full xfstests. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com>