linux.git - Linus' kernel tree

Age	Commit message (Collapse)	Author
2023-06-08	ceph: fix use-after-free bug for inodes when flushing capsnaps	Xiubo Li
	There is a race between capsnaps flush and removing the inode from 'mdsc->snap_flush_list' list: == Thread A == == Thread B == ceph_queue_cap_snap() -> allocate 'capsnapA' ->ihold('&ci->vfs_inode') ->add 'capsnapA' to 'ci->i_cap_snaps' ->add 'ci' to 'mdsc->snap_flush_list' ... == Thread C == ceph_flush_snaps() ->__ceph_flush_snaps() ->__send_flush_snap() handle_cap_flushsnap_ack() ->iput('&ci->vfs_inode') this also will release 'ci' ... == Thread D == ceph_handle_snap() ->flush_snaps() ->iterate 'mdsc->snap_flush_list' ->get the stale 'ci' ->remove 'ci' from ->ihold(&ci->vfs_inode) this 'mdsc->snap_flush_list' will WARNING To fix this we will increase the inode's i_count ref when adding 'ci' to the 'mdsc->snap_flush_list' list. [ idryomov: need_put int -> bool ] Cc: stable@vger.kernel.org Link: https://bugzilla.redhat.com/show_bug.cgi?id=2209299 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Milind Changire <mchangir@redhat.com> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-06-07	fs: avoid empty option when generating legacy mount string	Thomas Weißschuh
	As each option string fragment is always prepended with a comma it would happen that the whole string always starts with a comma. This could be interpreted by filesystem drivers as an empty option and may produce errors. For example the NTFS driver from ntfs.ko behaves like this and fails when mounted via the new API. Link: https://github.com/util-linux/util-linux/issues/2298 Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Fixes: 3e1aeb00e6d1 ("vfs: Implement a filesystem superblock creation/configuration context") Cc: stable@vger.kernel.org Message-Id: <20230607-fs-empty-option-v1-1-20c8dbf4671b@weissschuh.net> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-06-07	afs: Fix setting of mtime when creating a file/dir/symlink	David Howells
	kafs incorrectly passes a zero mtime (ie. 1st Jan 1970) to the server when creating a file, dir or symlink because the mtime recorded in the afs_operation struct gets passed to the server by the marshalling routines, but the afs_mkdir(), afs_create() and afs_symlink() functions don't set it. This gets masked if a file or directory is subsequently modified. Fix this by filling in op->mtime before calling the create op. Fixes: e49c7b2f6de7 ("afs: Build an abstraction around an "operation" concept") Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeffrey Altman <jaltman@auristor.com> Reviewed-by: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org cc: linux-fsdevel@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-06-07	fs: Restrict lock_two_nondirectories() to non-directory inodes	Jan Kara
	Currently lock_two_nondirectories() is skipping any passed directories. After vfs_rename() uses lock_two_inodes(), all the remaining four users of this function pass only regular files to it. So drop the somewhat unusual "skip directory" logic and instead warn if anybody passes directory to it. This also allows us to use lock_two_inodes() in lock_two_nondirectories() to concentrate the lock ordering logic in less places. Signed-off-by: Jan Kara <jack@suse.cz> Message-Id: <20230601105830.13168-6-jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-06-06	btrfs: properly enable async discard when switching from RO->RW	Chris Mason
	The async discard uses the BTRFS_FS_DISCARD_RUNNING bit in the fs_info to force discards off when the filesystem has aborted or we're generally not able to run discards. This gets flipped on when we're mounted rw, and also when we go from ro->rw. Commit 63a7cb13071842 ("btrfs: auto enable discard=async when possible") enabled async discard by default, and this meant "mount -o ro /dev/xxx /yyy" had async discards turned on. Unfortunately, this meant our check in btrfs_remount_cleanup() would see that discards are already on: /* If we toggled discard async / if (!btrfs_raw_test_opt(old_opts, DISCARD_ASYNC) && btrfs_test_opt(fs_info, DISCARD_ASYNC)) btrfs_discard_resume(fs_info); So, we'd never call btrfs_discard_resume() when remounting the root filesystem from ro->rw. drgn shows this really nicely: import os import sys from drgn.helpers.linux.fs import path_lookup from drgn import NULL, Object, Type, cast def btrfs_sb(sb): return cast("struct btrfs_fs_info ", sb.s_fs_info) if len(sys.argv) == 1: path = "/" else: path = sys.argv[1] fs_info = cast("struct btrfs_fs_info *", path_lookup(prog, path).mnt.mnt_sb.s_fs_info) BTRFS_FS_DISCARD_RUNNING = 1 << prog['BTRFS_FS_DISCARD_RUNNING'] if fs_info.flags & BTRFS_FS_DISCARD_RUNNING: print("discard running flag is on") else: print("discard running flag is off") [root]# mount \| grep nvme /dev/nvme0n1p3 on / type btrfs (rw,relatime,compress-force=zstd:3,ssd,discard=async,space_cache=v2,subvolid=5,subvol=/) [root]# ./discard_running.drgn discard running flag is off [root]# mount -o remount,discard=sync / [root]# mount -o remount,discard=async / [root]# ./discard_running.drgn discard running flag is on The fix is to call btrfs_discard_resume() when we're going from ro->rw. It already checks to make sure the async discard flag is on, so it'll do the right thing. Fixes: 63a7cb13071842 ("btrfs: auto enable discard=async when possible") CC: stable@vger.kernel.org # 6.3+ Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Chris Mason <clm@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-06	gfs2: Don't remember delete unless it's successful	Bob Peterson
	This patch changes function evict_unlinked_inode so it does not call gfs2_inode_remember_delete until it gets a good return code from gfs2_dinode_dealloc. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2023-06-06	gfs2: Update rl_unlinked before releasing rgrp lock	Bob Peterson
	Function gfs2_free_di was changing the rgrp lvb count of unlinked dinodes after the lock was released. This patch moves it inside the lock. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2023-06-06	gfs2: Fix gfs2_qa_get imbalance in gfs2_quota_hold	Bob Peterson
	This patch fixes a case in which function gfs2_quota_hold encounters an assert error and exits. The lack of gfs2_qa_put causes further problems when the inode is evicted and the get/put count is non-zero. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2023-06-06	gfs2: ignore rindex_update failure in dinode_dealloc	Bob Peterson
	Before this patch, function gfs2_dinode_dealloc would abort if it got a bad return code from gfs2_rindex_update(). The problem is that it left the dinode in the unlinked (not free) state, which meant subsequent fsck would clean it up and flag an error. That meant some of our QE tests would fail. The sole purpose of gfs2_rindex_update(), in this code path, is to read in any newer rgrps added by gfs2_grow. But since this is a delete operation it won't actually use any of those new rgrps. It can really only twiddle the bits from "Unlinked" to "Free" in an existing rgrp. Therefore the error should not prevent the transition from unlinked to free. This patch makes gfs2_dinode_dealloc ignore the bad return code and proceed with freeing the dinode so the QE tests will not be tripped up. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2023-06-06	gfs2: fix minor comment typos	Bob Peterson
	Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2023-06-06	gfs2: simplify gdlm_put_lock with out_free label	Bob Peterson
	This patch introduces a new out_free label and consolidates the three places function gdlm_put_lock freed the glock. No change in functionality. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2023-06-06	mm: Add support for unaccepted memory	Kirill A. Shutemov
	UEFI Specification version 2.9 introduces the concept of memory acceptance. Some Virtual Machine platforms, such as Intel TDX or AMD SEV-SNP, require memory to be accepted before it can be used by the guest. Accepting happens via a protocol specific to the Virtual Machine platform. There are several ways the kernel can deal with unaccepted memory: 1. Accept all the memory during boot. It is easy to implement and it doesn't have runtime cost once the system is booted. The downside is very long boot time. Accept can be parallelized to multiple CPUs to keep it manageable (i.e. via DEFERRED_STRUCT_PAGE_INIT), but it tends to saturate memory bandwidth and does not scale beyond the point. 2. Accept a block of memory on the first use. It requires more infrastructure and changes in page allocator to make it work, but it provides good boot time. On-demand memory accept means latency spikes every time kernel steps onto a new memory block. The spikes will go away once workload data set size gets stabilized or all memory gets accepted. 3. Accept all memory in background. Introduce a thread (or multiple) that gets memory accepted proactively. It will minimize time the system experience latency spikes on memory allocation while keeping low boot time. This approach cannot function on its own. It is an extension of #2: background memory acceptance requires functional scheduler, but the page allocator may need to tap into unaccepted memory before that. The downside of the approach is that these threads also steal CPU cycles and memory bandwidth from the user's workload and may hurt user experience. Implement #1 and #2 for now. #2 is the default. Some workloads may want to use #1 with accept_memory=eager in kernel command line. #3 can be implemented later based on user's demands. Support of unaccepted memory requires a few changes in core-mm code: - memblock accepts memory on allocation. It serves early boot memory allocations and doesn't limit them to pre-accepted pool of memory. - page allocator accepts memory on the first allocation of the page. When kernel runs out of accepted memory, it accepts memory until the high watermark is reached. It helps to minimize fragmentation. EFI code will provide two helpers if the platform supports unaccepted memory: - accept_memory() makes a range of physical addresses accepted. - range_contains_unaccepted_memory() checks anything within the range of physical addresses requires acceptance. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mike Rapoport <rppt@linux.ibm.com> # memblock Link: https://lore.kernel.org/r/20230606142637.5171-2-kirill.shutemov@linux.intel.com
2023-06-06	Merge tag 'gfs2-v6.4-rc4-fix' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2 Pull gfs2 fix from Andreas Gruenbacher: - Don't get stuck writing page onto itself under direct I/O * tag 'gfs2-v6.4-rc4-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: gfs2: Don't get stuck writing page onto itself under direct I/O
2023-06-05	jbd2: Avoid printing outside the boundary of the buffer	Andy Shevchenko
	Theoretically possible that "%pg" will take all room for the j_devname and hence the "-%lu" will go outside the boundary due to unconditional sprintf() in use. To make this code more robust, replace two sequential s*printf():s by a single call and then replace forbidden character. It's possible to do this way, because '/' won't ever be in the result of "-%lu". Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20230605170553.7835-2-andriy.shevchenko@linux.intel.com
2023-06-05	btrfs: subpage: fix a crash in metadata repair path	Qu Wenruo
	[BUG] Test case btrfs/027 would crash with subpage (64K page size, 4K sectorsize) with the following dying messages: debug: map_length=16384 length=65536 type=metadata\|raid6(0x104) assertion failed: map_length >= length, in fs/btrfs/volumes.c:8093 ------------[ cut here ]------------ kernel BUG at fs/btrfs/messages.c:259! Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015 Call trace: btrfs_assertfail+0x28/0x2c [btrfs] btrfs_map_repair_block+0x150/0x2b8 [btrfs] btrfs_repair_io_failure+0xd4/0x31c [btrfs] btrfs_read_extent_buffer+0x150/0x16c [btrfs] read_tree_block+0x38/0xbc [btrfs] read_tree_root_path+0xfc/0x1bc [btrfs] btrfs_get_root_ref.part.0+0xd4/0x3a8 [btrfs] open_ctree+0xa30/0x172c [btrfs] btrfs_mount_root+0x3c4/0x4a4 [btrfs] legacy_get_tree+0x30/0x60 vfs_get_tree+0x28/0xec vfs_kern_mount.part.0+0x90/0xd4 vfs_kern_mount+0x14/0x28 btrfs_mount+0x114/0x418 [btrfs] legacy_get_tree+0x30/0x60 vfs_get_tree+0x28/0xec path_mount+0x3e0/0xb64 __arm64_sys_mount+0x200/0x2d8 invoke_syscall+0x48/0x114 el0_svc_common.constprop.0+0x60/0x11c do_el0_svc+0x38/0x98 el0_svc+0x40/0xa8 el0t_64_sync_handler+0xf4/0x120 el0t_64_sync+0x190/0x194 Code: aa0403e2 b0fff060 91010000 959c2024 (d4210000) [CAUSE] In btrfs/027 we test RAID6 with missing devices, in this particular case, we're repairing a metadata at the end of a data stripe. But at btrfs_repair_io_failure(), we always pass a full PAGE for repair, and for subpage case this can cross stripe boundary and lead to the above BUG_ON(). This metadata repair code is always there, since the introduction of subpage support, but this can trigger BUG_ON() after the bio split ability at btrfs_map_bio(). [FIX] Instead of passing the old PAGE_SIZE, we calculate the correct length based on the eb size and page size for both regular and subpage cases. CC: stable@vger.kernel.org # 6.3+ Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-05	init: improve the name_to_dev_t interface	Christoph Hellwig
	name_to_dev_t has a very misleading name, that doesn't make clear it should only be used by the early init code, and also has a bad calling convention that doesn't allow returning different kinds of errors. Rename it to early_lookup_bdev to make the use case clear, and return an errno, where -EINVAL means the string could not be parsed, and -ENODEV means it the string was valid, but there was no device found for it. Also stub out the whole call for !CONFIG_BLOCK as all the non-block root cases are always covered in the caller. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230531125535.676098-14-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-05	ext4: wire up the ->mark_dead holder operation for log devices	Christoph Hellwig
	Implement a set of holder_ops that shut down the file system when the block device used as log device is removed undeneath the file system. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Link: https://lore.kernel.org/r/20230601094459.1350643-17-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-05	ext4: wire up sops->shutdown	Christoph Hellwig
	Wire up the shutdown method to shut down the file system when the underlying block device is marked dead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Link: https://lore.kernel.org/r/20230601094459.1350643-16-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-05	ext4: split ext4_shutdown	Christoph Hellwig
	Split ext4_shutdown into a low-level helper that will be reused for implementing the shutdown super operation and a wrapper for the ioctl handling. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Link: https://lore.kernel.org/r/20230601094459.1350643-15-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-05	xfs: wire up the ->mark_dead holder operation for log and RT devices	Christoph Hellwig
	Implement a set of holder_ops that shut down the file system when the block device used as log or RT device is removed undeneath the file system. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Link: https://lore.kernel.org/r/20230601094459.1350643-14-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-05	xfs: wire up sops->shutdown	Christoph Hellwig
	Wire up the shutdown method to shut down the file system when the underlying block device is marked dead. Add a new message to clearly distinguish this shutdown reason from other shutdowns. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Link: https://lore.kernel.org/r/20230601094459.1350643-13-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-05	fs: add a method to shut down the file system	Christoph Hellwig
	Add a new ->shutdown super operation that can be used to tell the file system to shut down, and call it from newly created holder ops when the block device under a file system shuts down. This only covers the main block device for "simple" file systems using get_tree_bdev / mount_bdev. File systems their own get_tree method or opening additional devices will need to set up their own blk_holder_ops. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Link: https://lore.kernel.org/r/20230601094459.1350643-12-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-05	block: introduce holder ops	Christoph Hellwig
	Add a new blk_holder_ops structure, which is passed to blkdev_get_by_* and installed in the block_device for exclusive claims. It will be used to allow the block layer to call back into the user of the block device for thing like notification of a removed device or a device resize. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Link: https://lore.kernel.org/r/20230601094459.1350643-10-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-05	quota: fix warning in dqgrab()	Ye Bin
	There's issue as follows when do fault injection: WARNING: CPU: 1 PID: 14870 at include/linux/quotaops.h:51 dquot_disable+0x13b7/0x18c0 Modules linked in: CPU: 1 PID: 14870 Comm: fsconfig Not tainted 6.3.0-next-20230505-00006-g5107a9c821af-dirty #541 RIP: 0010:dquot_disable+0x13b7/0x18c0 RSP: 0018:ffffc9000acc79e0 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff88825e41b980 RDX: 0000000000000000 RSI: ffff88825e41b980 RDI: 0000000000000002 RBP: ffff888179f68000 R08: ffffffff82087ca7 R09: 0000000000000000 R10: 0000000000000001 R11: ffffed102f3ed026 R12: ffff888179f68130 R13: ffff888179f68110 R14: dffffc0000000000 R15: ffff888179f68118 FS: 00007f450a073740(0000) GS:ffff88882fc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007ffe96f2efd8 CR3: 000000025c8ad000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> dquot_load_quota_sb+0xd53/0x1060 dquot_resume+0x172/0x230 ext4_reconfigure+0x1dc6/0x27b0 reconfigure_super+0x515/0xa90 __x64_sys_fsconfig+0xb19/0xd20 do_syscall_64+0x39/0xb0 entry_SYSCALL_64_after_hwframe+0x63/0xcd Above issue may happens as follows: ProcessA ProcessB ProcessC sys_fsconfig vfs_fsconfig_locked reconfigure_super ext4_remount dquot_suspend -> suspend all type quota sys_fsconfig vfs_fsconfig_locked reconfigure_super ext4_remount dquot_resume ret = dquot_load_quota_sb add_dquot_ref do_open -> open file O_RDWR vfs_open do_dentry_open get_write_access atomic_inc_unless_negative(&inode->i_writecount) ext4_file_open dquot_file_open dquot_initialize __dquot_initialize dqget atomic_inc(&dquot->dq_count); __dquot_initialize __dquot_initialize dqget if (!test_bit(DQ_ACTIVE_B, &dquot->dq_flags)) ext4_acquire_dquot -> Return error DQ_ACTIVE_B flag isn't set dquot_disable invalidate_dquots if (atomic_read(&dquot->dq_count)) dqgrab WARN_ON_ONCE(!test_bit(DQ_ACTIVE_B, &dquot->dq_flags)) -> Trigger warning In the above scenario, 'dquot->dq_flags' has no DQ_ACTIVE_B is normal when dqgrab(). To solve above issue just replace the dqgrab() use in invalidate_dquots() with atomic_inc(&dquot->dq_count). Signed-off-by: Ye Bin <yebin10@huawei.com> Signed-off-by: Jan Kara <jack@suse.cz> Message-Id: <20230605140731.2427629-3-yebin10@huawei.com>
2023-06-05	quota: Properly disable quotas when add_dquot_ref() fails	Jan Kara
	When add_dquot_ref() fails (usually due to IO error or ENOMEM), we want to disable quotas we are trying to enable. However dquot_disable() call was passed just the flags we are enabling so in case flags == DQUOT_USAGE_ENABLED dquot_disable() call will just fail with EINVAL instead of properly disabling quotas. Fix the problem by always passing DQUOT_LIMITS_ENABLED \| DQUOT_USAGE_ENABLED to dquot_disable() in this case. Reported-and-tested-by: Ye Bin <yebin10@huawei.com> Reported-by: syzbot+e633c79ceaecbf479854@syzkaller.appspotmail.com Signed-off-by: Jan Kara <jack@suse.cz> Message-Id: <20230605140731.2427629-2-yebin10@huawei.com>
2023-06-05	NFSD: Ensure that xdr_write_pages updates rq_next_page	Chuck Lever
	All other NFSv[23] procedures manage to keep page_ptr and rq_next_page in lock step. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-05	NFSD: Replace encode_cinfo()	Chuck Lever
	De-duplicate "reserve_space; encode_cinfo". Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-05	NFSD: Add encoders for NFSv4 clientids and verifiers	Chuck Lever
	Deduplicate some common code. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-05	NFSD: trace nfsctl operations	Chuck Lever
	Add trace log eye-catchers that record the arguments used to configure NFSD. This helps when troubleshooting the NFSD administrative interfaces. These tracepoints can capture NFSD start-up and shutdown times and parameters, changes in lease time and thread count, and a request to end the namespace's NFSv4 grace period, in addition to the set of NFS versions that are enabled. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-05	NFSD: Clean up nfsctl_transaction_write()	Chuck Lever
	For easier readability, follow the common convention: if (error) handle_error; continue_normally; No behavior change is expected. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-05	NFSD: Clean up nfsctl white-space damage	Chuck Lever
	Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-05	nfsd: use vfs setgid helper	Christian Brauner
	We've aligned setgid behavior over multiple kernel releases. The details can be found in commit cf619f891971 ("Merge tag 'fs.ovl.setgid.v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping") and commit 426b4ca2d6a5 ("Merge tag 'fs.setgid.v6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux"). Consistent setgid stripping behavior is now encapsulated in the setattr_should_drop_sgid() helper which is used by all filesystems that strip setgid bits outside of vfs proper. Usually ATTR_KILL_SGID is raised in e.g., chown_common() and is subject to the setattr_should_drop_sgid() check to determine whether the setgid bit can be retained. Since nfsd is raising ATTR_KILL_SGID unconditionally it will cause notify_change() to strip it even if the caller had the necessary privileges to retain it. Ensure that nfsd only raises ATR_KILL_SGID if the caller lacks the necessary privileges to retain the setgid bit. Without this patch the setgid stripping tests in LTP will fail: > As you can see, the problem is S_ISGID (0002000) was dropped on a > non-group-executable file while chown was invoked by super-user, while [...] > fchown02.c:66: TFAIL: testfile2: wrong mode permissions 0100700, expected 0102700 [...] > chown02.c:57: TFAIL: testfile2: wrong mode permissions 0100700, expected 0102700 With this patch all tests pass. Reported-by: Sherry Yang <sherry.yang@oracle.com> Signed-off-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Jeff Layton <jlayton@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-05	highmem: Rename put_and_unmap_page() to unmap_and_put_page()	Fabio M. De Francesco
	With commit 849ad04cf562a ("new helper: put_and_unmap_page()"), Al Viro introduced the put_and_unmap_page() to use in those many places where we have a common pattern consisting of calls to kunmap_local() + put_page(). Obviously, first we unmap and then we put pages. Instead, the original name of this helper seems to imply that we first put and then unmap. Therefore, rename the helper and change the only known upstreamed user (i.e., fs/sysv) before this helper enters common use and might become difficult to find all call sites and instead easy to break the builds. Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com> Reviewed-by: Eric Biggers <ebiggers@google.com> Message-Id: <20230602103307.5637-1-fmdefrancesco@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-06-05	cachefiles: Allow the cache to be non-root	David Howells
	Set mode 0600 on files in the cache so that cachefilesd can run as an unprivileged user rather than leaving the files all with 0. Directories are already set to 0700. Userspace then needs to set the uid and gid before issuing the "bind" command and the cache must've been chown'd to those IDs. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> cc: David Howells <dhowells@redhat.com> cc: Jeff Layton <jlayton@kernel.org> cc: linux-cachefs@redhat.com cc: linux-erofs@lists.ozlabs.org cc: linux-fsdevel@vger.kernel.org Message-Id: <1853230.1684516880@warthog.procyon.org.uk> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-06-05	Merge 6.4-rc5 into driver-core-next	Greg Kroah-Hartman
	We need the driver core fixes in here as well. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-06-05	xfs: collect errors from inodegc for unlinked inode recovery	Dave Chinner
	Unlinked list recovery requires errors removing the inode the from the unlinked list get fed back to the main recovery loop. Now that we offload the unlinking to the inodegc work, we don't get errors being fed back when we trip over a corruption that prevents the inode from being removed from the unlinked list. This means we never clear the corrupt unlinked list bucket, resulting in runtime operations eventually tripping over it and shutting down. Fix this by collecting inodegc worker errors and feed them back to the flush caller. This is largely best effort - the only context that really cares is log recovery, and it only flushes a single inode at a time so we don't need complex synchronised handling. Essentially the inodegc workers will capture the first error that occurs and the next flush will gather them and clear them. The flush itself will only report the first gathered error. In the cases where callers can return errors, propagate the collected inodegc flush error up the error handling chain. In the case of inode unlinked list recovery, there are several superfluous calls to flush queued unlinked inodes - xlog_recover_iunlink_bucket() guarantees that it has flushed the inodegc and collected errors before it returns. Hence nothing in the calling path needs to run a flush, even when an error is returned. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-06-05	xfs: validate block number being freed before adding to xefi	Dave Chinner
	Bad things happen in defered extent freeing operations if it is passed a bad block number in the xefi. This can come from a bogus agno/agbno pair from deferred agfl freeing, or just a bad fsbno being passed to __xfs_free_extent_later(). Either way, it's very difficult to diagnose where a null perag oops in EFI creation is coming from when the operation that queued the xefi has already been completed and there's no longer any trace of it around.... Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-06-05	xfs: validity check agbnos on the AGFL	Dave Chinner
	If the agfl or the indexing in the AGF has been corrupted, getting a block form the AGFL could return an invalid block number. If this happens, bad things happen. Check the agbno we pull off the AGFL and return -EFSCORRUPTED if we find somethign bad. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-06-05	xfs: fix agf/agfl verification on v4 filesystems	Dave Chinner
	When a v4 filesystem has fl_last - fl_first != fl_count, we do not not detect the corruption and allow the AGF to be used as it if was fully valid. On V5 filesystems, we reset the AGFL to empty in these cases and avoid the corruption at a small cost of leaked blocks. If we don't catch the corruption on V4 filesystems, bad things happen later when an allocation attempts to trim the free list and either double-frees stale entries in the AGFl or tries to free NULLAGBNO entries. Either way, this is bad. Prevent this from happening by using the AGFL_NEED_RESET logic for v4 filesysetms, too. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-06-05	xfs: fix double xfs_perag_rele() in xfs_filestream_pick_ag()	Dave Chinner
	xfs_bmap_longest_free_extent() can return an error when accessing the AGF fails. In this case, the behaviour of xfs_filestream_pick_ag() is conditional on the error. We may continue the loop, or break out of it. The error handling after the loop cleans up the perag reference held when the break occurs. If we continue, the next loop iteration handles cleaning up the perag reference. EIther way, we don't need to release the active perag reference when xfs_bmap_longest_free_extent() fails. Doing so means we do a double decrement on the active reference count, and this causes tha active reference count to fall to zero. At this point, new active references will fail. This leads to unmount hanging because it tries to grab active references to that perag, only for it to fail. This happens inside a loop that retries until a inode tree radix tree tag is cleared, which cannot happen because we can't get an active reference to the perag. The unmount livelocks in this path: xfs_reclaim_inodes+0x80/0xc0 xfs_unmount_flush_inodes+0x5b/0x70 xfs_unmountfs+0x5b/0x1a0 xfs_fs_put_super+0x49/0x110 generic_shutdown_super+0x7c/0x1a0 kill_block_super+0x27/0x50 deactivate_locked_super+0x30/0x90 deactivate_super+0x3c/0x50 cleanup_mnt+0xc2/0x160 __cleanup_mnt+0x12/0x20 task_work_run+0x5e/0xa0 exit_to_user_mode_prepare+0x1bc/0x1c0 syscall_exit_to_user_mode+0x16/0x40 do_syscall_64+0x40/0x80 entry_SYSCALL_64_after_hwframe+0x63/0xcd Reported-by: Pengfei Xu <pengfei.xu@intel.com> Fixes: eb70aa2d8ed9 ("xfs: use for_each_perag_wrap in xfs_filestream_pick_ag") Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-06-05	xfs: fix broken logic when detecting mergeable bmap records	Darrick J. Wong
	Commit 6bc6c99a944c was a well-intentioned effort to initiate consolidation of adjacent bmbt mapping records by setting the PREEN flag. Consolidation can only happen if the length of the combined record doesn't overflow the 21-bit blockcount field of the bmbt recordset. Unfortunately, the length test is inverted, leading to it triggering on data forks like these: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL 0: [0..16777207]: 76110848..92888055 0 (76110848..92888055) 16777208 1: [16777208..20639743]: 92888056..96750591 0 (92888056..96750591) 3862536 Note that record 0 has a length of 16777208 512b blocks. This corresponds to 2097151 4k fsblocks, which is the maximum. Hence the two records cannot be merged. However, the logic is still wrong even if we change the in-loop comparison, because the scope of our examination isn't broad enough inside the loop to detect mappings like this: 0: [0..9]: 76110838..76110847 0 (76110838..76110847) 10 1: [10..16777217]: 76110848..92888055 0 (76110848..92888055) 16777208 2: [16777218..20639753]: 92888056..96750591 0 (92888056..96750591) 3862536 These three records could be merged into two, but one cannot determine this purely from looking at records 0-1 or 1-2 in isolation. Hoist the mergability detection outside the loop, and base its decision making on whether or not a merged mapping could be expressed in fewer bmbt records. While we're at it, fix the incorrect return type of the iter function. Fixes: 336642f79283 ("xfs: alert the user about data/attr fork mappings that could be merged") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-06-05	xfs: Fix undefined behavior of shift into sign bit	Geert Uytterhoeven
	With gcc-5: In file included from ./include/trace/define_trace.h:102:0, from ./fs/xfs/scrub/trace.h:988, from fs/xfs/scrub/trace.c:40: ./fs/xfs/./scrub/trace.h: In function ‘trace_raw_output_xchk_fsgate_class’: ./fs/xfs/scrub/scrub.h:111:28: error: initializer element is not constant #define XREP_ALREADY_FIXED (1 << 31) /* checking our repair work */ ^ Shifting the (signed) value 1 into the sign bit is undefined behavior. Fix this for all definitions in the file by shifting "1U" instead of "1". This was exposed by the first user added in commit 466c525d6d35e691 ("xfs: minimize overhead of drain wakeups by using jump labels"). Fixes: 160b5a784525e8a4 ("xfs: hoist the already_fixed variable to the scrub context") Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-06-05	xfs: fix AGF vs inode cluster buffer deadlock	Dave Chinner
	Lock order in XFS is AGI -> AGF, hence for operations involving inode unlinked list operations we always lock the AGI first. Inode unlinked list operations operate on the inode cluster buffer, so the lock order there is AGI -> inode cluster buffer. For O_TMPFILE operations, this now means the lock order set down in xfs_rename and xfs_link is AGI -> inode cluster buffer -> AGF as the unlinked ops are done before the directory modifications that may allocate space and lock the AGF. Unfortunately, we also now lock the inode cluster buffer when logging an inode so that we can attach the inode to the cluster buffer and pin it in memory. This creates a lock order of AGF -> inode cluster buffer in directory operations as we have to log the inode after we've allocated new space for it. This creates a lock inversion between the AGF and the inode cluster buffer. Because the inode cluster buffer is shared across multiple inodes, the inversion is not specific to individual inodes but can occur when inodes in the same cluster buffer are accessed in different orders. To fix this we need move all the inode log item cluster buffer interactions to the end of the current transaction. Unfortunately, xfs_trans_log_inode() calls are littered throughout the transactions with no thought to ordering against other items or locking. This makes it difficult to do anything that involves changing the call sites of xfs_trans_log_inode() to change locking orders. However, we do now have a mechanism that allows is to postpone dirty item processing to just before we commit the transaction: the ->iop_precommit method. This will be called after all the modifications are done and high level objects like AGI and AGF buffers have been locked and modified, thereby providing a mechanism that guarantees we don't lock the inode cluster buffer before those high level objects are locked. This change is largely moving the guts of xfs_trans_log_inode() to xfs_inode_item_precommit() and providing an extra flag context in the inode log item to track the dirty state of the inode in the current transaction. This also means we do a lot less repeated work in xfs_trans_log_inode() by only doing it once per transaction when all the work is done. Fixes: 298f7bec503f ("xfs: pin inode backing buffer to the inode log item") Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-06-05	xfs: defered work could create precommits	Dave Chinner
	To fix a AGI-AGF-inode cluster buffer deadlock, we need to move inode cluster buffer operations to the ->iop_precommit() method. However, this means that deferred operations can require precommits to be run on the final transaction that the deferred ops pass back to xfs_trans_commit() context. This will be exposed by attribute handling, in that the last changes to the inode in the attr set state machine "disappear" because the precommit operation is not run. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-06-05	xfs: restore allocation trylock iteration	Dave Chinner
	It was accidentally dropped when refactoring the allocation code, resulting in the AG iteration always doing blocking AG iteration. This results in a small performance regression for a specific fsmark test that runs more user data writer threads than there are AGs. Reported-by: kernel test robot <oliver.sang@intel.com> Fixes: 2edf06a50f5b ("xfs: factor xfs_alloc_vextent_this_ag() for _iterate_ags()") Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-06-05	xfs: buffer pins need to hold a buffer reference	Dave Chinner
	When a buffer is unpinned by xfs_buf_item_unpin(), we need to access the buffer after we've dropped the buffer log item reference count. This opens a window where we can have two racing unpins for the buffer item (e.g. shutdown checkpoint context callback processing racing with journal IO iclog completion processing) and both attempt to access the buffer after dropping the BLI reference count. If we are unlucky, the "BLI freed" context wins the race and frees the buffer before the "BLI still active" case checks the buffer pin count. This results in a use after free that can only be triggered in active filesystem shutdown situations. To fix this, we need to ensure that buffer existence extends beyond the BLI reference count checks and until the unpin processing is complete. This implies that a buffer pin operation must also take a buffer reference to ensure that the buffer cannot be freed until the buffer unpin processing is complete. Reported-by: yangerkun <yangerkun@huawei.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-06-04	fsverity: simplify error handling in verify_data_block()	Eric Biggers
	Clean up the error handling in verify_data_block() to (a) eliminate the 'err' variable which has caused some confusion because the function actually returns a bool, (b) reduce the compiled code size slightly, and (c) execute one fewer branch in the success case. Link: https://lore.kernel.org/r/20230604022312.48532-1-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>
2023-06-04	fsverity: don't use bio_first_page_all() in fsverity_verify_bio()	Eric Biggers
	bio_first_page_all(bio)->mapping->host is not compatible with large folios, since the first page of the bio is not necessarily the head page of the folio, and therefore it might not have the mapping pointer set. Therefore, move the dereference of ->mapping->host into verify_data_blocks(), which works with a folio. (Like the commit that this Fixes, this hasn't actually been tested with large folios yet, since the filesystems that use fs/verity/ don't support that yet. But based on code review, I think this is needed.) Fixes: 5d0f0e57ed90 ("fsverity: support verifying data from large folios") Link: https://lore.kernel.org/r/20230604022101.48342-1-ebiggers@kernel.org Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Eric Biggers <ebiggers@google.com>
2023-06-04	fsverity: constify fsverity_hash_alg	Eric Biggers
	Now that fsverity_hash_alg doesn't have an embedded mempool, it can be 'const' almost everywhere. Add it. Link: https://lore.kernel.org/r/20230604022348.48658-1-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>
2023-06-04	fsverity: use shash API instead of ahash API	Eric Biggers
	The "ahash" API, like the other scatterlist-based crypto APIs such as "skcipher", comes with some well-known limitations. First, it can't easily be used with vmalloc addresses. Second, the request struct can't be allocated on the stack. This adds complexity and a possible failure point that needs to be worked around, e.g. using a mempool. The only benefit of ahash over "shash" is that ahash is needed to access traditional memory-to-memory crypto accelerators, i.e. drivers/crypto/. However, this style of crypto acceleration has largely fallen out of favor and been superseded by CPU-based acceleration or inline crypto engines. Also, ahash needs to be used asynchronously to take full advantage of such hardware, but fs/verity/ has never done this. On all systems that aren't actually using one of these ahash-only crypto accelerators, ahash just adds unnecessary overhead as it sits between the user and the underlying shash algorithms. Also, XFS is planned to cache fsverity Merkle tree blocks in the existing XFS buffer cache. As a result, it will be possible for a single Merkle tree block to be split across discontiguous pages (https://lore.kernel.org/r/20230405233753.GU3223426@dread.disaster.area). This data will need to be hashed. It is easiest to work with a vmapped address in this case. However, ahash is incompatible with this. Therefore, let's convert fs/verity/ from ahash to shash. This simplifies the code, and it should also slightly improve performance for everyone who wasn't actually using one of these ahash-only crypto accelerators, i.e. almost everyone (or maybe even everyone)! Link: https://lore.kernel.org/r/20230516052306.99600-1-ebiggers@kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Eric Biggers <ebiggers@google.com>