summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2023-08-18btrfs: fix incorrect splitting in btrfs_drop_extent_map_rangeJosef Bacik
In production we were seeing a variety of WARN_ON()'s in the extent_map code, specifically in btrfs_drop_extent_map_range() when we have to call add_extent_mapping() for our second split. Consider the following extent map layout PINNED [0 16K) [32K, 48K) and then we call btrfs_drop_extent_map_range for [0, 36K), with skip_pinned == true. The initial loop will have start = 0 end = 36K len = 36K we will find the [0, 16k) extent, but since we are pinned we will skip it, which has this code start = em_end; if (end != (u64)-1) len = start + len - em_end; em_end here is 16K, so now the values are start = 16K len = 16K + 36K - 16K = 36K len should instead be 20K. This is a problem when we find the next extent at [32K, 48K), we need to split this extent to leave [36K, 48k), however the code for the split looks like this split->start = start + len; split->len = em_end - (start + len); In this case we have em_end = 48K split->start = 16K + 36K // this should be 16K + 20K split->len = 48K - (16K + 36K) // this overflows as 16K + 36K is 52K and now we have an invalid extent_map in the tree that potentially overlaps other entries in the extent map. Even in the non-overlapping case we will have split->start set improperly, which will cause problems with any block related calculations. We don't actually need len in this loop, we can simply use end as our end point, and only adjust start up when we find a pinned extent we need to skip. Adjust the logic to do this, which keeps us from inserting an invalid extent map. We only skip_pinned in the relocation case, so this is relatively rare, except in the case where you are running relocation a lot, which can happen with auto relocation on. Fixes: 55ef68990029 ("Btrfs: Fix btrfs_drop_extent_cache for skip pinned case") CC: stable@vger.kernel.org # 4.14+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-18ext2: improve consistency of ext2_fsblk_t datatype usageGeorg Ottinger
The ext2 block allocation/deallocation functions and their respective calls use a mixture of unsigned long and ext2_fsblk_t datatypes to index the desired ext2 block. This commit replaces occurrences of unsigned long with ext2_fsblk_t, covering the functions ext2_new_block(), ext2_new_blocks(), ext2_free_blocks(), ext2_free_data() and ext2_free_branches(). This commit is rather conservative, and only replaces unsigned long with ext2_fsblk_t if the variable is used to index a specific ext2 block. Signed-off-by: Georg Ottinger <g.ottinger@gmx.at> Signed-off-by: Jan Kara <jack@suse.cz> Message-Id: <20230817195925.10268-1-g.ottinger@gmx.at>
2023-08-18fs/xfs: Fix typos in commentsZizhen Pang
Delete duplicate word "the" [chandan: Fix mangled patch] Signed-off-by: Zizhen Pang <pangzizhen001@208suo.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
2023-08-18xfs: fix dqiterate thinkoDarrick J. Wong
For some unknown reason, when I converted the incore dquot objects to store the dquot id in host endian order, I removed the increment here. This causes the scan to stop after retrieving the root dquot, which severely limits the usefulness of the quota scrubber. Fix the lost increment, though it won't fix the problem that the quota iterator code filters out zeroed dquot records. Fixes: c51df7334167e ("xfs: stop using q_core.d_id in the quota code") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
2023-08-18Merge tag 'scrub-bmap-fixes-6.6_2023-08-10' of ↵Chandan Babu R
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.6-mergeA xfs: fixes for the block mapping checker This series amends the file extent map checking code so that nonexistent cow/attr forks get the ENOENT return they're supposed to; and fixes some incorrect logic about the presence of a cow fork vs. reflink iflag. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com> * tag 'scrub-bmap-fixes-6.6_2023-08-10' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux: xfs: don't check reflink iflag state when checking cow fork xfs: simplify returns in xchk_bmap xfs: rewrite xchk_inode_is_allocated to work properly xfs: hide xfs_inode_is_allocated in scrub common code
2023-08-18Merge tag 'repair-agfl-fixes-6.6_2023-08-10' of ↵Chandan Babu R
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.6-mergeA xfs: fixes to the AGFL repair code This series contains a couple of bug fixes to the AGFL repair code that came up during QA. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com> * tag 'repair-agfl-fixes-6.6_2023-08-10' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux: xfs: fix agf_fllast when repairing an empty AGFL xfs: clear pagf_agflreset when repairing the AGFL
2023-08-18Merge tag 'repair-force-rebuild-6.6_2023-08-10' of ↵Chandan Babu R
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.6-mergeA xfs: force rebuilding of metadata This patchset adds a new IFLAG to the scrub ioctl so that userspace can force a rebuild of an otherwise consistent piece of metadata. This will eventually enable the use of online repair to relocate metadata during a filesystem reorganization (e.g. shrink). For now, it facilitates stress testing of online repair without needing the debugging knobs to be enabled. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com> * tag 'repair-force-rebuild-6.6_2023-08-10' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux: xfs: allow userspace to rebuild metadata structures xfs: don't complain about unfixed metadata when repairs were injected
2023-08-18Merge tag 'repair-tweaks-6.6_2023-08-10' of ↵Chandan Babu R
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.6-mergeA xfs: miscellaneous repair tweaks Before we start adding online repair functionality, there's a few tweaks that I'd like to make to the common repair code. First is a fix to the integration between repair and the health status code that was interfering with repair re-evaluations. Second is a minor tweak to the sole existing repair functions to make one last check that the user hasn't terminated the calling process before we start writing to the filesystem. This is a pattern that will repeat throughout the rest of the repair functions. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com> * tag 'repair-tweaks-6.6_2023-08-10' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux: xfs: allow the user to cancel repairs before we start writing xfs: always rescan allegedly healthy per-ag metadata after repair
2023-08-18Merge tag 'scrub-rtsummary-6.6_2023-08-10' of ↵Chandan Babu R
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.6-mergeA xfs: online scrubbing of realtime summary files This patchset implements an online checker for the realtime summary file. The first few changes are some general cleanups -- scrub should get its own references to all inodes, and we also wrap the inode lock functions so that we can standardize unlocking and releasing inodes that are the focus of a scrub. With that out of the way, we move on to constructing a shadow copy of the rtsummary information from the rtbitmap, and compare the new copy against the ondisk copy. This has been running on the djcloud for years with no problems. Enjoy! Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com> * tag 'scrub-rtsummary-6.6_2023-08-10' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux: xfs: implement online scrubbing of rtsummary info xfs: move the realtime summary file scrubber to a separate source file xfs: wrap ilock/iunlock operations on sc->ip xfs: get our own reference to inodes that we want to scrub
2023-08-18Merge tag 'scrub-usage-stats-6.6_2023-08-10' of ↵Chandan Babu R
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.6-mergeA xfs: add usage counters for scrub This series introduces simple usage and performance counters for the online fsck subsystem. The goal here is to enable developers and sysadmins to look at summary counts of how many objects were checked and repaired; what the outcomes were; and how much time the kernel has spent on these operations. The counter file is exposed in debugfs because that's easier than cramming it into the device model, and debugfs doesn't have rules against complex file contents, unlike sysfs. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com> * tag 'scrub-usage-stats-6.6_2023-08-10' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux: xfs: track usage statistics of online fsck xfs: create scaffolding for creating debugfs entries
2023-08-18Merge tag 'big-array-6.6_2023-08-10' of ↵Chandan Babu R
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.6-mergeA xfs: stage repair information in pageable memory In general, online repair of an indexed record set walks the filesystem looking for records. These records are sorted and bulk-loaded into a new btree. To make this happen without pinning gigabytes of metadata in memory, first create an abstraction ('xfile') of memfd files so that kernel code can access paged memory, and then an array abstraction ('xfarray') based on xfiles so that online repair can create an array of new records without pinning memory. These two data storage abstractions are critical for repair of space metadata -- the memory used is pageable, which helps us avoid pinning kernel memory and driving OOM problems; and they are byte-accessible enough that we can use them like (very slow and programmatic) memory buffers. Later patchsets will build on this functionality to provide blob storage and btrees. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com> * tag 'big-array-6.6_2023-08-10' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux: xfs: improve xfarray quicksort pivot xfs: cache pages used for xfarray quicksort convergence xfs: speed up xfarray sort by sorting xfile page contents directly xfs: teach xfile to pass back direct-map pages to caller xfs: convert xfarray insertion sort to heapsort using scratchpad memory xfs: enable sorting of xfile-backed arrays xfs: create a big array data structure
2023-08-18Merge tag 'repair-reap-fixes-6.6_2023-08-10' of ↵Chandan Babu R
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.6-mergeA xfs: fix online repair block reaping These patches fix a few problems that I noticed in the code that deals with old btree blocks after a successful repair. First, I observed that it is possible for repair to incorrectly invalidate and delete old btree blocks if they were crosslinked. The solution here is to consult the reverse mappings for each block in the extent -- singly owned blocks are invalidated and freed, whereas for crosslinked blocks, we merely drop the incorrect reverse mapping. A largeish change in this patchset is moving the reaping code to a separate file, because the code are mostly interrelated static functions. For now this also drops the ability to reap file blocks, which will return when we add the bmbt repair functions. Second, we convert the reap function to use EFIs so that we can commit to freeing as many blocks in as few transactions as we dare. We would like to free as many old blocks as we can in the same transaction that commits the new structure to the ondisk filesystem to minimize the number of blocks that leak if the system crashes before the repair fully completes. The third change made in this series is to avoid tripping buffer cache assertions if we're merely scanning the buffer cache for buffers to invalidate, and find a non-stale buffer of the wrong length. This is primarily cosmetic, but makes my life easier. The fourth change restructures the reaping code to try to process as many blocks in one go as possible, to reduce logging traffic. The last change switches the reaping mechanism to use per-AG bitmaps defined in a previous patchset. This should reduce type confusion when reading the source code. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com> * tag 'repair-reap-fixes-6.6_2023-08-10' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux: xfs: use per-AG bitmaps to reap unused AG metadata blocks during repair xfs: reap large AG metadata extents when possible xfs: allow scanning ranges of the buffer cache for live buffers xfs: rearrange xrep_reap_block to make future code flow easier xfs: use deferred frees to reap old btree blocks xfs: only allow reaping of per-AG blocks in xrep_reap_extents xfs: only invalidate blocks if we're going to free them xfs: move the post-repair block reaping code to a separate file xfs: cull repair code that will never get used
2023-08-17pstore: Support record sizes larger than kmalloc() limitYuxiao Zhang
Currently pstore record buffers are allocated using kmalloc() which has a maximum size based on page size. If a large "pmsg-size" module parameter is specified, pmsg will fail to copy the contents since memdup_user() is limited to kmalloc() allocation sizes. Since we don't need physically contiguous memory for any of the pstore record buffers, use kvzalloc() to avoid such limitations in the core of pstore and in the ram backend, and explicitly read from userspace using vmemdup_user(). This also means that any other backends that want to (or do already) support larger record sizes will Just Work now. Signed-off-by: Yuxiao Zhang <yuxiaozhang@google.com> Link: https://lore.kernel.org/r/20230627202540.881909-2-yuxiaozhang@google.com Co-developed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Kees Cook <keescook@chromium.org>
2023-08-17NFS: Fix a use after free in nfs_direct_join_group()Trond Myklebust
Be more careful when tearing down the subrequests of an O_DIRECT write as part of a retransmission. Reported-by: Chris Mason <clm@fb.com> Fixes: ed5d588fe47f ("NFS: Try to join page groups before an O_DIRECT retransmission") Cc: stable@vger.kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2023-08-17btrfs: fix BUG_ON condition in btrfs_cancel_balancexiaoshoukui
Pausing and canceling balance can race to interrupt balance lead to BUG_ON panic in btrfs_cancel_balance. The BUG_ON condition in btrfs_cancel_balance does not take this race scenario into account. However, the race condition has no other side effects. We can fix that. Reproducing it with panic trace like this: kernel BUG at fs/btrfs/volumes.c:4618! RIP: 0010:btrfs_cancel_balance+0x5cf/0x6a0 Call Trace: <TASK> ? do_nanosleep+0x60/0x120 ? hrtimer_nanosleep+0xb7/0x1a0 ? sched_core_clone_cookie+0x70/0x70 btrfs_ioctl_balance_ctl+0x55/0x70 btrfs_ioctl+0xa46/0xd20 __x64_sys_ioctl+0x7d/0xa0 do_syscall_64+0x38/0x80 entry_SYSCALL_64_after_hwframe+0x63/0xcd Race scenario as follows: > mutex_unlock(&fs_info->balance_mutex); > -------------------- > .......issue pause and cancel req in another thread > -------------------- > ret = __btrfs_balance(fs_info); > > mutex_lock(&fs_info->balance_mutex); > if (ret == -ECANCELED && atomic_read(&fs_info->balance_pause_req)) { > btrfs_info(fs_info, "balance: paused"); > btrfs_exclop_balance(fs_info, BTRFS_EXCLOP_BALANCE_PAUSED); > } CC: stable@vger.kernel.org # 4.19+ Signed-off-by: xiaoshoukui <xiaoshoukui@ruijie.com.cn> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-17btrfs: only subtract from len_to_oe_boundary when it is tracking an extentChris Mason
bio_ctrl->len_to_oe_boundary is used to make sure we stay inside a zone as we submit bios for writes. Every time we add a page to the bio, we decrement those bytes from len_to_oe_boundary, and then we submit the bio if we happen to hit zero. Most of the time, len_to_oe_boundary gets set to U32_MAX. submit_extent_page() adds pages into our bio, and the size of the bio ends up limited by: - Are we contiguous on disk? - Does bio_add_page() allow us to stuff more in? - is len_to_oe_boundary > 0? The len_to_oe_boundary math starts with U32_MAX, which isn't page or sector aligned, and subtracts from it until it hits zero. In the non-zoned case, the last IO we submit before we hit zero is going to be unaligned, triggering BUGs. This is hard to trigger because bio_add_page() isn't going to make a bio of U32_MAX size unless you give it a perfect set of pages and fully contiguous extents on disk. We can hit it pretty reliably while making large swapfiles during provisioning because the machine is freshly booted, mostly idle, and the disk is freshly formatted. It's also possible to trigger with reads when read_ahead_kb is set to 4GB. The code has been clean up and shifted around a few times, but this flaw has been lurking since the counter was added. I think the commit 24e6c8082208 ("btrfs: simplify main loop in submit_extent_page") ended up exposing the bug. The fix used here is to skip doing math on len_to_oe_boundary unless we've changed it from the default U32_MAX value. bio_add_page() is the real limit we want, and there's no reason to do extra math when block layer is doing it for us. Sample reproducer, note you'll need to change the path to the bdi and device: SUBVOL=/btrfs/swapvol SWAPFILE=$SUBVOL/swapfile SZMB=8192 mkfs.btrfs -f /dev/vdb mount /dev/vdb /btrfs btrfs subvol create $SUBVOL chattr +C $SUBVOL dd if=/dev/zero of=$SWAPFILE bs=1M count=$SZMB sync echo 4 > /proc/sys/vm/drop_caches echo 4194304 > /sys/class/bdi/btrfs-2/read_ahead_kb while true; do echo 1 > /proc/sys/vm/drop_caches echo 1 > /proc/sys/vm/drop_caches dd of=/dev/zero if=$SWAPFILE bs=4096M count=2 iflag=fullblock done Fixes: 24e6c8082208 ("btrfs: simplify main loop in submit_extent_page") CC: stable@vger.kernel.org # 6.4+ Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-17btrfs: fix replace/scrub failure with metadata_uuidAnand Jain
Fstests with POST_MKFS_CMD="btrfstune -m" (as in the mailing list) reported a few of the test cases failing. The failure scenario can be summarized and simplified as follows: $ mkfs.btrfs -fq -draid1 -mraid1 /dev/sdb1 /dev/sdb2 :0 $ btrfstune -m /dev/sdb1 :0 $ wipefs -a /dev/sdb1 :0 $ mount -o degraded /dev/sdb2 /btrfs :0 $ btrfs replace start -B -f -r 1 /dev/sdb1 /btrfs :1 STDERR: ERROR: ioctl(DEV_REPLACE_START) failed on "/btrfs": Input/output error [11290.583502] BTRFS warning (device sdb2): tree block 22036480 mirror 2 has bad fsid, has 99835c32-49f0-4668-9e66-dc277a96b4a6 want da40350c-33ac-4872-92a8-4948ed8c04d0 [11290.586580] BTRFS error (device sdb2): unable to fix up (regular) error at logical 22020096 on dev /dev/sdb8 physical 1048576 As above, the replace is failing because we are verifying the header with fs_devices::fsid instead of fs_devices::metadata_uuid, despite the metadata_uuid actually being present. To fix this, use fs_devices::metadata_uuid. We copy fsid into fs_devices::metadata_uuid if there is no metadata_uuid, so its fine. Fixes: a3ddbaebc7c9 ("btrfs: scrub: introduce a helper to verify one metadata block") CC: stable@vger.kernel.org # 6.4+ Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-16ext2: dump current reservation window infoYe Bin
There's report BUG in 'ext2_try_to_allocate_with_rsv()', although there's now dump of all reservation windows information. But there's unknown which window is being processed.So this is not helpful for locating the issue. To better analyze the problem, dump the information about reservation window that is being processed. And just bail with error instead of BUG here. Signed-off-by: Ye Bin <yebin10@huawei.com> Message-Id: <20230815112612.221145-5-yebin10@huawei.com> Signed-off-by: Jan Kara <jack@suse.cz>
2023-08-16ext2: fix race between setxattr and write backYe Bin
There's an issue when allocating xattrs as follows: Block Allocation Reservation Windows Map (ext2_try_to_allocate_with_rsv): reservation window 0x000000006f105382 start: 0, end: 0 reservation window 0x000000008fd1a555 start: 1044, end: 1059 Window map complete. kernel BUG at fs/ext2/balloc.c:1158! invalid opcode: 0000 [#1] PREEMPT SMP KASAN RIP: 0010:ext2_try_to_allocate_with_rsv.isra.0+0x15c4/0x1800 Call Trace: <TASK> ext2_new_blocks+0x935/0x1690 ext2_new_block+0x73/0xa0 ext2_xattr_set2+0x74f/0x1730 ext2_xattr_set+0x12b6/0x2260 ext2_xattr_user_set+0x9c/0x110 __vfs_setxattr+0x139/0x1d0 __vfs_setxattr_noperm+0xfc/0x370 __vfs_setxattr_locked+0x205/0x2c0 vfs_setxattr+0x19d/0x3b0 do_setxattr+0xff/0x220 setxattr+0x123/0x150 path_setxattr+0x193/0x1e0 __x64_sys_setxattr+0xc8/0x170 do_syscall_64+0x35/0x80 entry_SYSCALL_64_after_hwframe+0x63/0xcd Above issue may happens as follows: setxattr write back ext2_xattr_set ext2_xattr_set2 ext2_new_block ext2_new_blocks ext2_try_to_allocate_with_rsv alloc_new_reservation --> group=0 [0, 1023] rsv [1016, 1023] do_writepages mpage_writepages write_cache_pages __mpage_writepage ext2_get_block ext2_get_blocks ext2_alloc_branch ext2_new_blocks ext2_try_to_allocate_with_rsv alloc_new_reservation -->group=1 [1024, 2047] rsv [1044, 1059] if ((my_rsv->rsv_start > group_last_block) || (my_rsv->rsv_end < group_first_block) rsv_window_dump BUG(); Now ext2 mkwrite doesn't allocate new blocks so for these cases we may be allocating blocks during writeback. However, there is no protection between ext2_xattr_set() and do_writepages() so these two functions can conflict on handling the reservation window. To solve about issue don't use the reservation window when allocating block for xattr. Signed-off-by: Ye Bin <yebin10@huawei.com> Message-Id: <20230815112612.221145-4-yebin10@huawei.com> Signed-off-by: Jan Kara <jack@suse.cz>
2023-08-16ext2: introduce new flags argument for ext2_new_blocks()Ye Bin
This patch introduces a new flags argument for ext2_new_blocks() and also a new EXT2_ALLOC_NORESERVE flag. Signed-off-by: Ye Bin <yebin10@huawei.com> Message-Id: <20230815112612.221145-3-yebin10@huawei.com> Signed-off-by: Jan Kara <jack@suse.cz>
2023-08-16ext2: remove ext2_new_block()Ye Bin
Now, only xattr allocate block use ext2_new_block(), so just opencode it in the xattr code. Signed-off-by: Ye Bin <yebin10@huawei.com> Message-Id: <20230815112612.221145-2-yebin10@huawei.com> Signed-off-by: Jan Kara <jack@suse.cz>
2023-08-16ext2: fix datatype of block number in ext2_xattr_set2()Georg Ottinger
I run a small server that uses external hard drives for backups. The backup software I use uses ext2 filesystems with 4KiB block size and the server is running SELinux and therefore relies on xattr. I recently upgraded the hard drives from 4TB to 12TB models. I noticed that after transferring some TBs I got a filesystem error "Freeing blocks not in datazone - block = 18446744071529317386, count = 1" and the backup process stopped. Trying to fix the fs with e2fsck resulted in a completely corrupted fs. The error probably came from ext2_free_blocks(), and because of the large number 18e19 this problem immediately looked like some kind of integer overflow. Whereas the 4TB fs was about 1e9 blocks, the new 12TB is about 3e9 blocks. So, searching the ext2 code, I came across the line in fs/ext2/xattr.c:745 where ext2_new_block() is called and the resulting block number is stored in the variable block as an int datatype. If a block with a block number greater than INT32_MAX is returned, this variable overflows and the call to sb_getblk() at line fs/ext2/xattr.c:750 fails, then the call to ext2_free_blocks() produces the error. Signed-off-by: Georg Ottinger <g.ottinger@gmx.at> Signed-off-by: Jan Kara <jack@suse.cz> Message-Id: <20230815100340.22121-1-g.ottinger@gmx.at>
2023-08-16smb: client: fix null authScott Mayhew
Commit abdb1742a312 removed code that clears ctx->username when sec=none, so attempting to mount with '-o sec=none' now fails with -EACCES. Fix it by adding that logic to the parsing of the 'sec' option, as well as checking if the mount is using null auth before setting the username when parsing the 'user' option. Fixes: abdb1742a312 ("cifs: get rid of mount options string parsing") Cc: stable@vger.kernel.org Signed-off-by: Scott Mayhew <smayhew@redhat.com> Reviewed-by: Paulo Alcantara (SUSE) <pc@manguebit.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2023-08-15sysctl: Use ctl_table_size as stopping criteria for list macroJoel Granados
This is a preparation commit to make it easy to remove the sentinel elements (empty end markers) from the ctl_table arrays. It both allows the systematic removal of the sentinels and adds the ctl_table_size variable to the stopping criteria of the list_for_each_table_entry macro that traverses all ctl_table arrays. Once all the sentinels are removed by subsequent commits, ctl_table_size will become the only stopping criteria in the macro. We don't actually remove any elements in this commit, but it sets things up to for the removal process to take place. By adding header->ctl_table_size as an additional stopping criteria for the list_for_each_table_entry macro, it will execute until it finds an "empty" ->procname or until the size runs out. Therefore if a ctl_table array with a sentinel is passed its size will be too big (by one element) but it will stop on the sentinel. On the other hand, if the ctl_table array without a sentinel is passed its size will be just write and there will be no need for a sentinel. Signed-off-by: Joel Granados <j.granados@samsung.com> Suggested-by: Jani Nikula <jani.nikula@linux.intel.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-08-15sysctl: Add size arg to __register_sysctl_initJoel Granados
This commit adds table_size to __register_sysctl_init in preparation for the removal of the sentinel elements in the ctl_table arrays (last empty markers). And though we do *not* remove any sentinels in this commit, we set things up by calculating the ctl_table array size with ARRAY_SIZE. We add a table_size argument to __register_sysctl_init and modify the register_sysctl_init macro to calculate the array size with ARRAY_SIZE. The original callers do not need to be updated as they will go through the new macro. Signed-off-by: Joel Granados <j.granados@samsung.com> Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-08-15sysctl: Add size to register_sysctlJoel Granados
This commit adds table_size to register_sysctl in preparation for the removal of the sentinel elements in the ctl_table arrays (last empty markers). And though we do *not* remove any sentinels in this commit, we set things up by either passing the table_size explicitly or using ARRAY_SIZE on the ctl_table arrays. We replace the register_syctl function with a macro that will add the ARRAY_SIZE to the new register_sysctl_sz function. In this way the callers that are already using an array of ctl_table structs do not change. For the callers that pass a ctl_table array pointer, we pass the table_size to register_sysctl_sz instead of the macro. Signed-off-by: Joel Granados <j.granados@samsung.com> Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-08-15sysctl: Add a size arg to __register_sysctl_tableJoel Granados
We make these changes in order to prepare __register_sysctl_table and its callers for when we remove the sentinel element (empty element at the end of ctl_table arrays). We don't actually remove any sentinels in this commit, but we *do* make sure to use ARRAY_SIZE so the table_size is available when the removal occurs. We add a table_size argument to __register_sysctl_table and adjust callers, all of which pass ctl_table pointers and need an explicit call to ARRAY_SIZE. We implement a size calculation in register_net_sysctl in order to forward the size of the array pointer received from the network register calls. The new table_size argument does not yet have any effect in the init_header call which is still dependent on the sentinel's presence. table_size *does* however drive the `kzalloc` allocation in __register_sysctl_table with no adverse effects as the allocated memory is either one element greater than the calculated ctl_table array (for the calls in ipc_sysctl.c, mq_sysctl.c and ucount.c) or the exact size of the calculated ctl_table array (for the call from sysctl_net.c and register_sysctl). This approach will allows us to "just" remove the sentinel without further changes to __register_sysctl_table as table_size will represent the exact size for all the callers at that point. Signed-off-by: Joel Granados <j.granados@samsung.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-08-15sysctl: Add size argument to init_headerJoel Granados
In this commit, we add a table_size argument to the init_header function in order to initialize the ctl_table_size variable in ctl_table_header. Even though the size is not yet used, it is now initialized within the sysctl subsys. We need this commit for when we start adding the table_size arguments to the sysctl functions (e.g. register_sysctl, __register_sysctl_table and __register_sysctl_init). Note that in __register_sysctl_table we temporarily use a calculated size until we add the size argument to that function in subsequent commits. Signed-off-by: Joel Granados <j.granados@samsung.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-08-15sysctl: Use ctl_table_header in list_for_each_table_entryJoel Granados
We replace the ctl_table with the ctl_table_header pointer in list_for_each_table_entry which is the macro responsible for traversing the ctl_table arrays. This is a preparation commit that will make it easier to add the ctl_table array size (that will be added to ctl_table_header in subsequent commits) to the already existing loop logic based on empty ctl_table elements (so called sentinels). Signed-off-by: Joel Granados <j.granados@samsung.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-08-15sysctl: Prefer ctl_table_header in proc_sysctlJoel Granados
This is a preparation commit that replaces ctl_table with ctl_table_header as the pointer that is passed around in proc_sysctl.c. This will become necessary in subsequent commits when the size of the ctl_table array can no longer be calculated by searching for an empty sentinel (last empty ctl_table element) but will be carried along inside the ctl_table_header struct. Signed-off-by: Joel Granados <j.granados@samsung.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-08-15Merge tag '6.5-rc6-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds
Pull smb client fixes from Steve French: "Three smb client fixes, all for stable: - fix for oops in unmount race with lease break of deferred close - debugging improvement for reconnect - fix for fscache deadlock (folio_wait_bit_common hang)" * tag '6.5-rc6-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6: smb3: display network namespace in debug information cifs: Release folio lock on fscache read hit. cifs: fix potential oops in cifs_oplock_break
2023-08-15jbd2: drop useless error tag in jbd2_journal_wipe()Zhang Yi
no_recovery is redundant, just drop it. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230811063610.2980059-10-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-08-15jbd2: cleanup journal_init_common()Zhang Yi
Adjust the initialization sequence and error handle of journal_t, moving load superblock to the begin, and classify others initialization. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230811063610.2980059-9-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-08-15jbd2: add fast_commit space checkZhang Yi
If JBD2_FEATURE_INCOMPAT_FAST_COMMIT bit is set, it means the journal have fast commit records need to recover, so the fast commit size should not be too large, and the leftover normal journal size should never less than JBD2_MIN_JOURNAL_BLOCKS. If it happens, the journal->j_last is likely to be wrong and will probably lead to incorrect journal recovery. So add a check into the journal_check_superblock(), and drop the pointless check when initializing the fastcommit parameters. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230811063610.2980059-8-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-08-15jbd2: cleanup load_superblock()Zhang Yi
Rename load_superblock() to journal_load_superblock(), move getting and reading superblock from journal_init_common() and journal_get_superblock() to this function, and also rename journal_get_superblock() to journal_check_superblock(), make it a pure check helper to check superblock validity from disk. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230811063610.2980059-7-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-08-15jbd2: open code jbd2_verify_csum_type() helperZhang Yi
jbd2_verify_csum_type() helper check checksum type in the superblock for v2 or v3 checksum feature, it always return true if these features are not enabled, and it has only one user, so open code it is more clear. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230811063610.2980059-6-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-08-15jbd2: checking valid features early in journal_get_superblock()Zhang Yi
journal_get_superblock() is used to check validity of the jounal supberblock, so move the features checks from jbd2_journal_load() to journal_get_superblock(). Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230811063610.2980059-5-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-08-15jbd2: don't load superblock in jbd2_journal_check_used_features()Zhang Yi
Since load_superblock() has been moved to journal_init_common(), the in-memory superblock structure is initialized and contains valid data once the file system has a journal_t object, so it's safe to access it, let's drop the call to journal_get_superblock() from jbd2_journal_check_used_features() and also drop the setting/clearing of the veirfy bit of the superblock buffer. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230811063610.2980059-4-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-08-15jbd2: move load_superblock() into journal_init_common()Zhang Yi
Move the call to load_superblock() from jbd2_journal_load() and jbd2_journal_wipe() early into journal_init_common(), the journal superblock gets read and the in-memory journal_t structure gets initialised after calling jbd2_journal_init_{dev,inode}, it's safe to do following initialization according to it. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230811063610.2980059-3-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-08-15jbd2: move load_superblock() dependent functionsZhang Yi
Move load_superblock() declaration and the functions it calls before journal_init_common(). This is a preparation for moving a call to load_superblock() from jbd2_journal_load() and jbd2_journal_wipe() to journal_init_common(). No functional changes. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230811063610.2980059-2-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-08-15fs: Fix one kernel-doc commentYang Li
Fix one kernel-doc comment to silence the warning: fs/read_write.c:88: warning: Function parameter or member 'maxsize' not described in 'generic_file_llseek_size' Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Message-Id: <20230811014359.4960-1-yang.lee@linux.alibaba.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-15fs/buffer.c: disable per-CPU buffer_head cache for isolated CPUsMarcelo Tosatti
For certain types of applications (for example PLC software or RAN processing), upon occurrence of an event, it is necessary to complete a certain task in a maximum amount of time (deadline). One way to express this requirement is with a pair of numbers, deadline time and execution time, where: * deadline time: length of time between event and deadline. * execution time: length of time it takes for processing of event to occur on a particular hardware platform (uninterrupted). The particular values depend on use-case. For the case where the realtime application executes in a virtualized guest, an IPI which must be serviced in the host will cause the following sequence of events: 1) VM-exit 2) execution of IPI (and function call) 3) VM-entry Which causes an excess of 50us latency as observed by cyclictest (this violates the latency requirement of vRAN application with 1ms TTI, for example). invalidate_bh_lrus calls an IPI on each CPU that has non empty per-CPU cache: on_each_cpu_cond(has_bh_in_lru, invalidate_bh_lru, NULL, 1); The performance when using the per-CPU LRU cache is as follows: 42 ns per __find_get_block 68 ns per __find_get_block_slow Given that the main use cases for latency sensitive applications do not involve block I/O (data necessary for program operation is locked in RAM), disable per-CPU buffer_head caches for isolated CPUs. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Acked-by: Frederic Weisbecker <frederic@kernel.org> Message-Id: <ZJtBrybavtb1x45V@tpad> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-15vfs, security: Fix automount superblock LSM init problem, preventing NFS sb ↵David Howells
sharing When NFS superblocks are created by automounting, their LSM parameters aren't set in the fs_context struct prior to sget_fc() being called, leading to failure to match existing superblocks. This bug leads to messages like the following appearing in dmesg when fscache is enabled: NFS: Cache volume key already in use (nfs,4.2,2,108,106a8c0,1,,,,100000,100000,2ee,3a98,1d4c,3a98,1) Fix this by adding a new LSM hook to load fc->security for submount creation. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/165962680944.3334508.6610023900349142034.stgit@warthog.procyon.org.uk/ # v1 Link: https://lore.kernel.org/r/165962729225.3357250.14350728846471527137.stgit@warthog.procyon.org.uk/ # v2 Link: https://lore.kernel.org/r/165970659095.2812394.6868894171102318796.stgit@warthog.procyon.org.uk/ # v3 Link: https://lore.kernel.org/r/166133579016.3678898.6283195019480567275.stgit@warthog.procyon.org.uk/ # v4 Link: https://lore.kernel.org/r/217595.1662033775@warthog.procyon.org.uk/ # v5 Fixes: 9bc61ab18b1d ("vfs: Introduce fs_context, switch vfs_kern_mount() to it.") Fixes: 779df6a5480f ("NFS: Ensure security label is set for root inode") Tested-by: Jeff Layton <jlayton@kernel.org> Acked-by: Casey Schaufler <casey@schaufler-ca.com> Acked-by: "Christian Brauner (Microsoft)" <brauner@kernel.org> Acked-by: Paul Moore <paul@paul-moore.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Message-Id: <20230808-master-v9-1-e0ecde888221@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-14fs: add FSCONFIG_CMD_CREATE_EXCLChristian Brauner
Summary ======= This introduces FSCONFIG_CMD_CREATE_EXCL which will allows userspace to implement something like mount -t ext4 --exclusive /dev/sda /B which fails if a superblock for the requested filesystem does already exist: Before this patch ----------------- $ sudo ./move-mount -f xfs -o source=/dev/sda4 /A Requesting filesystem type xfs Mount options requested: source=/dev/sda4 Attaching mount at /A Moving single attached mount Setting key(source) with val(/dev/sda4) $ sudo ./move-mount -f xfs -o source=/dev/sda4 /B Requesting filesystem type xfs Mount options requested: source=/dev/sda4 Attaching mount at /B Moving single attached mount Setting key(source) with val(/dev/sda4) After this patch with --exclusive as a switch for FSCONFIG_CMD_CREATE_EXCL -------------------------------------------------------------------------- $ sudo ./move-mount -f xfs --exclusive -o source=/dev/sda4 /A Requesting filesystem type xfs Request exclusive superblock creation Mount options requested: source=/dev/sda4 Attaching mount at /A Moving single attached mount Setting key(source) with val(/dev/sda4) $ sudo ./move-mount -f xfs --exclusive -o source=/dev/sda4 /B Requesting filesystem type xfs Request exclusive superblock creation Mount options requested: source=/dev/sda4 Attaching mount at /B Moving single attached mount Setting key(source) with val(/dev/sda4) Device or resource busy | move-mount.c: 300: do_fsconfig: i xfs: reusing existing filesystem not allowed Details ======= As mentioned on the list (cf. [1]-[3]) mount requests like mount -t ext4 /dev/sda /A are ambigous for userspace. Either a new superblock has been created and mounted or an existing superblock has been reused and a bind-mount has been created. This becomes clear in the following example where two processes create the same mount for the same block device: P1 P2 fd_fs = fsopen("ext4"); fd_fs = fsopen("ext4"); fsconfig(fd_fs, FSCONFIG_SET_STRING, "source", "/dev/sda"); fsconfig(fd_fs, FSCONFIG_SET_STRING, "source", "/dev/sda"); fsconfig(fd_fs, FSCONFIG_SET_STRING, "dax", "always"); fsconfig(fd_fs, FSCONFIG_SET_STRING, "resuid", "1000"); // wins and creates superblock fsconfig(fd_fs, FSCONFIG_CMD_CREATE, ...) // finds compatible superblock of P1 // spins until P1 sets SB_BORN and grabs a reference fsconfig(fd_fs, FSCONFIG_CMD_CREATE, ...) fd_mnt1 = fsmount(fd_fs); fd_mnt2 = fsmount(fd_fs); move_mount(fd_mnt1, "/A") move_mount(fd_mnt2, "/B") Not just does P2 get a bind-mount but the mount options that P2 requestes are silently ignored. The VFS itself doesn't, can't and shouldn't enforce filesystem specific mount option compatibility. It only enforces incompatibility for read-only <-> read-write transitions: mount -t ext4 /dev/sda /A mount -t ext4 -o ro /dev/sda /B The read-only request will fail with EBUSY as the VFS can't just silently transition a superblock from read-write to read-only or vica versa without risking security issues. To userspace this silent superblock reuse can become a security issue in because there is currently no straightforward way for userspace to know that they did indeed manage to create a new superblock and didn't just reuse an existing one. This adds a new FSCONFIG_CMD_CREATE_EXCL command to fsconfig() that returns EBUSY if an existing superblock would be reused. Userspace that needs to be sure that it did create a new superblock with the requested mount options can request superblock creation using this command. If the command succeeds they can be sure that they did create a new superblock with the requested mount options. This requires the new mount api. With the old mount api it would be necessary to plumb this through every legacy filesystem's file_system_type->mount() method. If they want this feature they are most welcome to switch to the new mount api. Following is an analysis of the effect of FSCONFIG_CMD_CREATE_EXCL on each high-level superblock creation helper: (1) get_tree_nodev() Always allocate new superblock. Hence, FSCONFIG_CMD_CREATE and FSCONFIG_CMD_CREATE_EXCL are equivalent. The binderfs or overlayfs filesystems are examples. (4) get_tree_keyed() Finds an existing superblock based on sb->s_fs_info. Hence, FSCONFIG_CMD_CREATE would reuse an existing superblock whereas FSCONFIG_CMD_CREATE_EXCL would reject it with EBUSY. The mqueue or nfsd filesystems are examples. (2) get_tree_bdev() This effectively works like get_tree_keyed(). The ext4 or xfs filesystems are examples. (3) get_tree_single() Only one superblock of this filesystem type can ever exist. Hence, FSCONFIG_CMD_CREATE would reuse an existing superblock whereas FSCONFIG_CMD_CREATE_EXCL would reject it with EBUSY. The securityfs or configfs filesystems are examples. Note that some single-instance filesystems never destroy the superblock once it has been created during the first mount. For example, if securityfs has been mounted at least onces then the created superblock will never be destroyed again as long as there is still an LSM making use it. Consequently, even if securityfs is unmounted and the superblock seemingly destroyed it really isn't which means that FSCONFIG_CMD_CREATE_EXCL will continue rejecting reusing an existing superblock. This is acceptable thugh since special purpose filesystems such as this shouldn't have a need to use FSCONFIG_CMD_CREATE_EXCL anyway and if they do it's probably to make sure that mount options aren't ignored. Following is an analysis of the effect of FSCONFIG_CMD_CREATE_EXCL on filesystems that make use of the low-level sget_fc() helper directly. They're all effectively variants on get_tree_keyed(), get_tree_bdev(), or get_tree_nodev(): (5) mtd_get_sb() Similar logic to get_tree_keyed(). (6) afs_get_tree() Similar logic to get_tree_keyed(). (7) ceph_get_tree() Similar logic to get_tree_keyed(). Already explicitly allows forcing the allocation of a new superblock via CEPH_OPT_NOSHARE. This turns it into get_tree_nodev(). (8) fuse_get_tree_submount() Similar logic to get_tree_nodev(). (9) fuse_get_tree() Forces reuse of existing FUSE superblock. Forces reuse of existing superblock if passed in file refers to an existing FUSE connection. If FSCONFIG_CMD_CREATE_EXCL is specified together with an fd referring to an existing FUSE connections this would cause the superblock reusal to fail. If reusing is the intent then FSCONFIG_CMD_CREATE_EXCL shouldn't be specified. (10) fuse_get_tree() -> get_tree_nodev() Same logic as in get_tree_nodev(). (11) fuse_get_tree() -> get_tree_bdev() Same logic as in get_tree_bdev(). (12) virtio_fs_get_tree() Same logic as get_tree_keyed(). (13) gfs2_meta_get_tree() Forces reuse of existing gfs2 superblock. Mounting gfs2meta enforces that a gf2s superblock must already exist. If not, it will error out. Consequently, mounting gfs2meta with FSCONFIG_CMD_CREATE_EXCL would always fail. If reusing is the intent then FSCONFIG_CMD_CREATE_EXCL shouldn't be specified. (14) kernfs_get_tree() Similar logic to get_tree_keyed(). (15) nfs_get_tree_common() Similar logic to get_tree_keyed(). Already explicitly allows forcing the allocation of a new superblock via NFS_MOUNT_UNSHARED. This effectively turns it into get_tree_nodev(). Link: [1] https://lore.kernel.org/linux-block/20230704-fasching-wertarbeit-7c6ffb01c83d@brauner Link: [2] https://lore.kernel.org/linux-block/20230705-pumpwerk-vielversprechend-a4b1fd947b65@brauner Link: [3] https://lore.kernel.org/linux-fsdevel/20230725-einnahmen-warnschilder-17779aec0a97@brauner Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Aleksa Sarai <cyphar@cyphar.com> Message-Id: <20230802-vfs-super-exclusive-v2-4-95dc4e41b870@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-14fs: add vfs_cmd_reconfigure()Christian Brauner
Split the steps to reconfigure a superblock into a tiny helper instead of open-coding it in the switch. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Aleksa Sarai <cyphar@cyphar.com> Message-Id: <20230802-vfs-super-exclusive-v2-3-95dc4e41b870@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-14fs: add vfs_cmd_create()Christian Brauner
Split the steps to create a superblock into a tiny helper. This will make the next patch easier to follow. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Aleksa Sarai <cyphar@cyphar.com> Message-Id: <20230802-vfs-super-exclusive-v2-2-95dc4e41b870@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-14super: remove get_tree_single_reconf()Christian Brauner
The get_tree_single_reconf() helper isn't used anywhere. Remove it. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Aleksa Sarai <cyphar@cyphar.com> Message-Id: <20230802-vfs-super-exclusive-v2-1-95dc4e41b870@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-14btrfs: fix infinite directory readsFilipe Manana
The readdir implementation currently processes always up to the last index it finds. This however can result in an infinite loop if the directory has a large number of entries such that they won't all fit in the given buffer passed to the readdir callback, that is, dir_emit() returns a non-zero value. Because in that case readdir() will be called again and if in the meanwhile new directory entries were added and we still can't put all the remaining entries in the buffer, we keep repeating this over and over. The following C program and test script reproduce the problem: $ cat /mnt/readdir_prog.c #include <sys/types.h> #include <dirent.h> #include <stdio.h> int main(int argc, char *argv[]) { DIR *dir = opendir("."); struct dirent *dd; while ((dd = readdir(dir))) { printf("%s\n", dd->d_name); rename(dd->d_name, "TEMPFILE"); rename("TEMPFILE", dd->d_name); } closedir(dir); } $ gcc -o /mnt/readdir_prog /mnt/readdir_prog.c $ cat test.sh #!/bin/bash DEV=/dev/sdi MNT=/mnt/sdi mkfs.btrfs -f $DEV &> /dev/null #mkfs.xfs -f $DEV &> /dev/null #mkfs.ext4 -F $DEV &> /dev/null mount $DEV $MNT mkdir $MNT/testdir for ((i = 1; i <= 2000; i++)); do echo -n > $MNT/testdir/file_$i done cd $MNT/testdir /mnt/readdir_prog cd /mnt umount $MNT This behaviour is surprising to applications and it's unlike ext4, xfs, tmpfs, vfat and other filesystems, which always finish. In this case where new entries were added due to renames, some file names may be reported more than once, but this varies according to each filesystem - for example ext4 never reported the same file more than once while xfs reports the first 13 file names twice. So change our readdir implementation to track the last index number when opendir() is called and then make readdir() never process beyond that index number. This gives the same behaviour as ext4. Reported-by: Rob Landley <rob@landley.net> Link: https://lore.kernel.org/linux-btrfs/2c8c55ec-04c6-e0dc-9c5c-8c7924778c35@landley.net/ Link: https://bugzilla.kernel.org/show_bug.cgi?id=217681 CC: stable@vger.kernel.org # 6.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-14smb3: display network namespace in debug informationSteve French
We recently had problems where a network namespace was deleted causing hard to debug reconnect problems. To help deal with configuration issues like this it is useful to dump the network namespace to better debug what happened. So add this to information displayed in /proc/fs/cifs/DebugData for the server (and channels if mounted with multichannel). For example: Local Users To Server: 1 SecMode: 0x1 Req On Wire: 0 Net namespace: 4026531840 This can be easily compared with what is displayed for the processes on the system. For example /proc/1/ns/net in this case showed the same thing (see below), and we can see that the namespace is still valid in this example. 'net:[4026531840]' Cc: stable@vger.kernel.org Acked-by: Paulo Alcantara (SUSE) <pc@manguebit.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2023-08-14cifs: Release folio lock on fscache read hit.Russell Harmon via samba-technical
Under the current code, when cifs_readpage_worker is called, the call contract is that the callee should unlock the page. This is documented in the read_folio section of Documentation/filesystems/vfs.rst as: > The filesystem should unlock the folio once the read has completed, > whether it was successful or not. Without this change, when fscache is in use and cache hit occurs during a read, the page lock is leaked, producing the following stack on subsequent reads (via mmap) to the page: $ cat /proc/3890/task/12864/stack [<0>] folio_wait_bit_common+0x124/0x350 [<0>] filemap_read_folio+0xad/0xf0 [<0>] filemap_fault+0x8b1/0xab0 [<0>] __do_fault+0x39/0x150 [<0>] do_fault+0x25c/0x3e0 [<0>] __handle_mm_fault+0x6ca/0xc70 [<0>] handle_mm_fault+0xe9/0x350 [<0>] do_user_addr_fault+0x225/0x6c0 [<0>] exc_page_fault+0x84/0x1b0 [<0>] asm_exc_page_fault+0x27/0x30 This requires a reboot to resolve; it is a deadlock. Note however that the call to cifs_readpage_from_fscache does mark the page clean, but does not free the folio lock. This happens in __cifs_readpage_from_fscache on success. Releasing the lock at that point however is not appropriate as cifs_readahead also calls cifs_readpage_from_fscache and *does* unconditionally release the lock after its return. This change therefore effectively makes cifs_readpage_worker work like cifs_readahead. Signed-off-by: Russell Harmon <russ@har.mn> Acked-by: Paulo Alcantara (SUSE) <pc@manguebit.com> Reviewed-by: David Howells <dhowells@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Steve French <stfrench@microsoft.com>