summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2024-08-30Merge tag 'execve-v6.11-rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull execve fix from Kees Cook: - binfmt_elf_fdpic: fix AUXV size with ELF_HWCAP2 (Max Filippov) * tag 'execve-v6.11-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: binfmt_elf_fdpic: fix AUXV size calculation when ELF_HWCAP2 is defined
2024-08-30dcache: keep dentry_hashtable or d_hash_shift even when not usedStephen Brennan
The runtime constant feature removes all the users of these variables, allowing the compiler to optimize them away. It's quite difficult to extract their values from the kernel text, and the memory saved by removing them is tiny, and it was never the point of this optimization. Since the dentry_hashtable is a core data structure, it's valuable for debugging tools to be able to read it easily. For instance, scripts built on drgn, like the dentrycache script[1], rely on it to be able to perform diagnostics on the contents of the dcache. Annotate it as used, so the compiler doesn't discard it. Link: https://github.com/oracle-samples/drgn-tools/blob/3afc56146f54d09dfd1f6d3c1b7436eda7e638be/drgn_tools/dentry.py#L325-L355 [1] Fixes: e3c92e81711d ("runtime constants: add x86 architecture support") Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-08-29fs: use kmem_cache_create_rcu()Christian Brauner
Switch to the new kmem_cache_create_rcu() helper which allows us to use a custom free pointer offset avoiding the need to have an external free pointer which would grow struct file behind our backs. Link: https://lore.kernel.org/r/20240828-work-kmem_cache-rcu-v3-3-5460bc1f09f6@kernel.org Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-29block: rework bio splittingChristoph Hellwig
The current setup with bio_may_exceed_limit and __bio_split_to_limits is a bit of a mess. Change it so that __bio_split_to_limits does all the work and is just a variant of bio_split_to_limits that returns nr_segs. This is done by inlining it and instead have the various bio_split_* helpers directly submit the potentially split bios. To support btrfs, the rw version has a lower level helper split out that just returns the offset to split. This turns out to nicely clean up the btrfs flow as well. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: David Sterba <dsterba@suse.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Tested-by: Hans Holmberg <hans.holmberg@wdc.com> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com> Link: https://lore.kernel.org/r/20240826173820.1690925-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-29fuse: use correct name fuse_conn_list in docstringAurelien Aptel
fuse_mount_list doesn't exist, use fuse_conn_list. Signed-off-by: Aurelien Aptel <aaptel@nvidia.com> Reviewed-by: Bernd Schubert <bschubert@ddn.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-08-29fuse: add simple request tracepointsJosef Bacik
I've been timing various fuse operations and it's quite annoying to do with kprobes. Add two tracepoints for sending and ending fuse requests to make it easier to debug and time various operations. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Bernd Schubert <bschubert@ddn.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-08-29fuse: refactor out shared logic in fuse_writepages_fill() and ↵Joanne Koong
fuse_writepage_locked() This change refactors the shared logic in fuse_writepages_fill() and fuse_writepages_locked() into two separate helper functions, fuse_writepage_args_page_fill() and fuse_writepage_args_setup(). No functional changes added. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-08-29fuse: move fuse file initialization to wpa allocation timeJoanne Koong
Before this change, wpa->ia.ff is initialized with an acquired reference on the fuse file right before it submits the writeback request. If there are auxiliary writebacks, then the initialization and reference acquisition needs to also be set before we submit the auxiliary writeback request. To make the logic simpler and to pave the way for a subsequent refactoring of fuse_writepages_fill() and fuse_writepage_locked(), this change initializes and acquires wpa->ia.ff when the wpa is allocated. No functional changes added. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-08-29fuse: convert fuse_writepages_fill() to use a folio for its tmp pageJoanne Koong
To pave the way for refactoring out the shared logic in fuse_writepages_fill() and fuse_writepage_locked(), this change converts the temporary page in fuse_writepages_fill() to use the folio API. This is similar to the change in commit e0887e095a80 ("fuse: Convert fuse_writepage_locked to take a folio"), which converted the tmp page in fuse_writepage_locked() to use the folio API. inc_node_page_state() is intentionally preserved here instead of converting to node_stat_add_folio() since it is updating the stat of the underlying page and to better maintain API symmetry with dec_node_page_stat() in fuse_writepage_finish_stat(). No functional changes added. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-08-29fuse: move initialization of fuse_file to fuse_writepages() instead of in ↵Joanne Koong
callback Prior to this change, data->ff is checked and if not initialized then initialized in the fuse_writepages_fill() callback, which gets called for every dirty page in the address space mapping. This logic is better placed in the main fuse_writepages() caller where data.ff is initialized before walking the dirty pages. No functional changes added. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-08-29fuse: refactor finished writeback stats updates into helper functionJoanne Koong
Move the logic for updating the bdi and page stats for a finished writeback into a separate helper function, where it can be called from both fuse_writepage_finish() and fuse_writepage_add() (in the case where there is already an auxiliary write request for the page). No functional changes added. Suggested by: Jingbo Xu <jefflexu@linux.alibaba.com> Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-08-29fuse: drop unused fuse_mount arg in fuse_writepage_finish()Joanne Koong
Drop the unused "struct fuse_mount *fm" arg in fuse_writepage_finish(). No functional changes added. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-08-29fuse: add fast path for fuse_range_is_writebackyangyun
In some cases, the fi->writepages may be empty. And there is no need to check fi->writepages with spin_lock, which may have an impact on performance due to lock contention. For example, in scenarios where multiple readers read the same file without any writers, or where the page cache is not enabled. Also remove the outdated comment since commit 6b2fb79963fb ("fuse: optimize writepages search") has optimize the situation by replacing list with rb-tree. Signed-off-by: yangyun <yangyun50@huawei.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-08-29fuse: cleanup request queuing towards virtiofsMiklos Szeredi
Virtiofs has its own queuing mechanism, but still requests are first queued on fiq->pending to be immediately dequeued and queued onto the virtio queue. The queuing on fiq->pending is unnecessary and might even have some performance impact due to being a contention point. Forget requests are handled similarly. Move the queuing of requests and forgets into the fiq->ops->*. fuse_iqueue_ops are renamed to reflect the new semantics. Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Fixed-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com> Tested-by: Peter-Jan Gootzen <pgootzen@nvidia.com> Reviewed-by: Peter-Jan Gootzen <pgootzen@nvidia.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-08-29fuse: disable the combination of passthrough and writeback cacheBernd Schubert
Current design and handling of passthrough is without fuse caching and with that FUSE_WRITEBACK_CACHE is conflicting. Fixes: 7dc4e97a4f9a ("fuse: introduce FUSE_PASSTHROUGH capability") Cc: stable@kernel.org # v6.9 Signed-off-by: Bernd Schubert <bschubert@ddn.com> Acked-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-08-29ovl: don't set the superblock's errseq_t manuallyHaifeng Xu
Since commit 5679897eb104 ("vfs: make sync_filesystem return errors from ->sync_fs"), the return value from sync_fs callback can be seen in sync_filesystem(). Thus the errseq_set opreation can be removed here. Depends-on: commit 5679897eb104 ("vfs: make sync_filesystem return errors from ->sync_fs") Signed-off-by: Haifeng Xu <haifeng.xu@shopee.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2024-08-28cifs: Fix FALLOC_FL_ZERO_RANGE to preflush buffered part of target regionDavid Howells
Under certain conditions, the range to be cleared by FALLOC_FL_ZERO_RANGE may only be buffered locally and not yet have been flushed to the server. For example: xfs_io -f -t -c "pwrite -S 0x41 0 4k" \ -c "pwrite -S 0x42 4k 4k" \ -c "fzero 0 4k" \ -c "pread -v 0 8k" /xfstest.test/foo will write two 4KiB blocks of data, which get buffered in the pagecache, and then fallocate() is used to clear the first 4KiB block on the server - but we don't flush the data first, which means the EOF position on the server is wrong, and so the FSCTL_SET_ZERO_DATA RPC fails (and xfs_io ignores the error), but then when we try to read it, we see the old data. Fix this by preflushing any part of the target region that above the server's idea of the EOF position to force the server to update its EOF position. Note, however, that we don't want to simply expand the file by moving the EOF before doing the FSCTL_SET_ZERO_DATA[*] because someone else might see the zeroed region or if the RPC fails we then have to try to clean it up or risk getting corruption. [*] And we have to move the EOF first otherwise FSCTL_SET_ZERO_DATA won't do what we want. This fixes the generic/008 xfstest. [!] Note: A better way to do this might be to split the operation into two parts: we only do FSCTL_SET_ZERO_DATA for the part of the range below the server's EOF and then, if that worked, invalidate the buffered pages for the part above the range. Fixes: 6b69040247e1 ("cifs/smb3: Fix data inconsistent when zero file range") Signed-off-by: David Howells <dhowells@redhat.com> cc: Steve French <stfrench@microsoft.com> cc: Zhang Xiaoxu <zhangxiaoxu5@huawei.com> cc: Pavel Shilovsky <pshilov@microsoft.com> cc: Paulo Alcantara <pc@manguebit.com> cc: Shyam Prasad N <nspmangalore@gmail.com> cc: Rohith Surabattula <rohiths.msft@gmail.com> cc: Jeff Layton <jlayton@kernel.org> cc: linux-cifs@vger.kernel.org cc: linux-mm@kvack.org Signed-off-by: Steve French <stfrench@microsoft.com>
2024-08-29Merge tag 'nfsd-6.11-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux Pull nfsd fixes from Chuck Lever: - Fix a number of crashers - Update email address for an NFSD reviewer * tag 'nfsd-6.11-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: fs/nfsd: fix update of inode attrs in CB_GETATTR nfsd: fix potential UAF in nfsd4_cb_getattr_release nfsd: hold reference to delegation when updating it for cb_getattr MAINTAINERS: Update Olga Kornievskaia's email address nfsd: prevent panic for nfsv4.0 closed files in nfs4_show_open nfsd: ensure that nfsd4_fattr_args.context is zeroed out
2024-08-29Merge tag 'for-6.11-rc5-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - fix use-after-free when submitting bios for read, after an error and partially submitted bio the original one is freed while it can be still be accessed again - fix fstests case btrfs/301, with enabled quotas wait for delayed iputs when flushing delalloc - fix periodic block group reclaim, an unitialized value can be returned if there are no block groups to reclaim - fix build warning (-Wmaybe-uninitialized) * tag 'for-6.11-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix uninitialized return value from btrfs_reclaim_sweep() btrfs: fix a use-after-free when hitting errors inside btrfs_submit_chunk() btrfs: initialize last_extent_end to fix -Wmaybe-uninitialized warning in extent_fiemap() btrfs: run delayed iputs when flushing delalloc
2024-08-28fuse: update stats for pages in dropped aux writeback listJoanne Koong
In the case where the aux writeback list is dropped (e.g. the pages have been truncated or the connection is broken), the stats for its pages and backing device info need to be updated as well. Fixes: e2653bd53a98 ("fuse: fix leaked aux requests") Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Cc: <stable@vger.kernel.org> # v5.1 Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-08-28fuse: clear PG_uptodate when using a stolen pageMiklos Szeredi
Originally when a stolen page was inserted into fuse's page cache by fuse_try_move_page(), it would be marked uptodate. Then fuse_readpages_end() would call SetPageUptodate() again on the already uptodate page. Commit 413e8f014c8b ("fuse: Convert fuse_readpages_end() to use folio_end_read()") changed that by replacing the SetPageUptodate() + unlock_page() combination with folio_end_read(), which does mostly the same, except it sets the uptodate flag with an xor operation, which in the above scenario resulted in the uptodate flag being cleared, which in turn resulted in EIO being returned on the read. Fix by clearing PG_uptodate instead of setting it in fuse_try_move_page(), conforming to the expectation of folio_end_read(). Reported-by: Jürg Billeter <j@bitron.ch> Debugged-by: Matthew Wilcox <willy@infradead.org> Fixes: 413e8f014c8b ("fuse: Convert fuse_readpages_end() to use folio_end_read()") Cc: <stable@vger.kernel.org> # v6.10 Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-08-28fuse: fix memory leak in fuse_create_openyangyun
The memory of struct fuse_file is allocated but not freed when get_create_ext return error. Fixes: 3e2b6fdbdc9a ("fuse: send security context of inode on file") Cc: stable@vger.kernel.org # v5.17 Signed-off-by: yangyun <yangyun50@huawei.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-08-28fuse: check aborted connection before adding requests to pending list for ↵Joanne Koong
resending There is a race condition where inflight requests will not be aborted if they are in the middle of being re-sent when the connection is aborted. If fuse_resend has already moved all the requests in the fpq->processing lists to its private queue ("to_queue") and then the connection starts and finishes aborting, these requests will be added to the pending queue and remain on it indefinitely. Fixes: 760eac73f9f6 ("fuse: Introduce a new notification type for resend pending requests") Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com> Cc: <stable@vger.kernel.org> # v6.9 Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-08-28fuse: use unsigned type for getxattr/listxattr size truncationJann Horn
The existing code uses min_t(ssize_t, outarg.size, XATTR_LIST_MAX) when parsing the FUSE daemon's response to a zero-length getxattr/listxattr request. On 32-bit kernels, where ssize_t and outarg.size are the same size, this is wrong: The min_t() will pass through any size values that are negative when interpreted as signed. fuse_listxattr() will then return this userspace-supplied negative value, which callers will treat as an error value. This kind of bug pattern can lead to fairly bad security bugs because of how error codes are used in the Linux kernel. If a caller were to convert the numeric error into an error pointer, like so: struct foo *func(...) { int len = fuse_getxattr(..., NULL, 0); if (len < 0) return ERR_PTR(len); ... } then it would end up returning this userspace-supplied negative value cast to a pointer - but the caller of this function wouldn't recognize it as an error pointer (IS_ERR_VALUE() only detects values in the narrow range in which legitimate errno values are), and so it would just be treated as a kernel pointer. I think there is at least one theoretical codepath where this could happen, but that path would involve virtio-fs with submounts plus some weird SELinux configuration, so I think it's probably not a concern in practice. Cc: stable@vger.kernel.org # v4.9 Fixes: 63401ccdb2ca ("fuse: limit xattr returned size") Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-08-28xfs: refactor xfs_file_fallocateChristoph Hellwig
Refactor xfs_file_fallocate into separate helpers for each mode, two factors for i_size handling and a single switch statement over the supported modes. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240827065123.1762168-7-hch@lst.de Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-28xfs: move the xfs_is_always_cow_inode check into xfs_alloc_file_spaceChristoph Hellwig
Move the xfs_is_always_cow_inode check from the caller into xfs_alloc_file_space to prepare for refactoring of xfs_file_fallocate. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240827065123.1762168-6-hch@lst.de Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-28xfs: call xfs_flush_unmap_range from xfs_free_file_spaceChristoph Hellwig
Call xfs_flush_unmap_range from xfs_free_file_space so that xfs_file_fallocate doesn't have to predict which mode will call it. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240827065123.1762168-5-hch@lst.de Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-28fs: sort out the fallocate mode vs flag messChristoph Hellwig
The fallocate system call takes a mode argument, but that argument contains a wild mix of exclusive modes and an optional flags. Replace FALLOC_FL_SUPPORTED_MASK with FALLOC_FL_MODE_MASK, which excludes the optional flag bit, so that we can use switch statement on the value to easily enumerate the cases while getting the check for duplicate modes for free. To make this (and in the future the file system implementations) more readable also add a symbolic name for the 0 mode used to allocate blocks. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240827065123.1762168-4-hch@lst.de Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-28cifs: Fix copy offload to flush destination regionDavid Howells
Fix cifs_file_copychunk_range() to flush the destination region before invalidating it to avoid potential loss of data should the copy fail, in whole or in part, in some way. Fixes: 7b2404a886f8 ("cifs: Fix flushing, invalidation and file size with copy_file_range()") Signed-off-by: David Howells <dhowells@redhat.com> cc: Steve French <stfrench@microsoft.com> cc: Paulo Alcantara <pc@manguebit.com> cc: Shyam Prasad N <nspmangalore@gmail.com> cc: Rohith Surabattula <rohiths.msft@gmail.com> cc: Matthew Wilcox <willy@infradead.org> cc: Jeff Layton <jlayton@kernel.org> cc: linux-cifs@vger.kernel.org cc: linux-mm@kvack.org cc: linux-fsdevel@vger.kernel.org Signed-off-by: Steve French <stfrench@microsoft.com>
2024-08-28netfs, cifs: Fix handling of short DIO readDavid Howells
Short DIO reads, particularly in relation to cifs, are not being handled correctly by cifs and netfslib. This can be tested by doing a DIO read of a file where the size of read is larger than the size of the file. When it crosses the EOF, it gets a short read and this gets retried, and in the case of cifs, the retry read fails, with the failure being translated to ENODATA. Fix this by the following means: (1) Add a flag, NETFS_SREQ_HIT_EOF, for the filesystem to set when it detects that the read did hit the EOF. (2) Make the netfslib read assessment stop processing subrequests when it encounters one with that flag set. (3) Return rreq->transferred, the accumulated contiguous amount read to that point, to userspace for a DIO read. (4) Make cifs set the flag and clear the error if the read RPC returned ENODATA. (5) Make cifs set the flag and clear the error if a short read occurred without error and the read-to file position is now at the remote inode size. Fixes: 69c3c023af25 ("cifs: Implement netfslib hooks") Signed-off-by: David Howells <dhowells@redhat.com> cc: Steve French <sfrench@samba.org> cc: Paulo Alcantara <pc@manguebit.com> cc: Jeff Layton <jlayton@kernel.org> cc: linux-cifs@vger.kernel.org cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Signed-off-by: Steve French <stfrench@microsoft.com>
2024-08-28cifs: Fix lack of credit renegotiation on read retryDavid Howells
When netfslib asks cifs to issue a read operation, it prefaces this with a call to ->clamp_length() which cifs uses to negotiate credits, providing receive capacity on the server; however, in the event that a read op needs reissuing, netfslib doesn't call ->clamp_length() again as that could shorten the subrequest, leaving a gap. This causes the retried read to be done with zero credits which causes the server to reject it with STATUS_INVALID_PARAMETER. This is a problem for a DIO read that is requested that would go over the EOF. The short read will be retried, causing EINVAL to be returned to the user when it fails. Fix this by making cifs_req_issue_read() negotiate new credits if retrying (NETFS_SREQ_RETRYING now gets set in the read side as well as the write side in this instance). This isn't sufficient, however: the new credits might not be sufficient to complete the remainder of the read, so also add an additional field, rreq->actual_len, that holds the actual size of the op we want to perform without having to alter subreq->len. We then rely on repeated short reads being retried until we finish the read or reach the end of file and make a zero-length read. Also fix a couple of places where the subrequest start and length need to be altered by the amount so far transferred when being used. Fixes: 69c3c023af25 ("cifs: Implement netfslib hooks") Signed-off-by: David Howells <dhowells@redhat.com> cc: Steve French <sfrench@samba.org> cc: Paulo Alcantara <pc@manguebit.com> cc: Jeff Layton <jlayton@kernel.org> cc: linux-cifs@vger.kernel.org cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Signed-off-by: Steve French <stfrench@microsoft.com>
2024-08-28file: reclaim 24 bytes from f_ownerChristian Brauner
We do embedd struct fown_struct into struct file letting it take up 32 bytes in total. We could tweak struct fown_struct to be more compact but really it shouldn't even be embedded in struct file in the first place. Instead, actual users of struct fown_struct should allocate the struct on demand. This frees up 24 bytes in struct file. That will have some potentially user-visible changes for the ownership fcntl()s. Some of them can now fail due to allocation failures. Practically, that probably will almost never happen as the allocations are small and they only happen once per file. The fown_struct is used during kill_fasync() which is used by e.g., pipes to generate a SIGIO signal. Sending of such signals is conditional on userspace having set an owner for the file using one of the F_OWNER fcntl()s. Such users will be unaffected if struct fown_struct is allocated during the fcntl() call. There are a few subsystems that call __f_setown() expecting file->f_owner to be allocated: (1) tun devices file->f_op->fasync::tun_chr_fasync() -> __f_setown() There are no callers of tun_chr_fasync(). (2) tty devices file->f_op->fasync::tty_fasync() -> __tty_fasync() -> __f_setown() tty_fasync() has no additional callers but __tty_fasync() has. Note that __tty_fasync() only calls __f_setown() if the @on argument is true. It's called from: file->f_op->release::tty_release() -> tty_release() -> __tty_fasync() -> __f_setown() tty_release() calls __tty_fasync() with @on false => __f_setown() is never called from tty_release(). => All callers of tty_release() are safe as well. file->f_op->release::tty_open() -> tty_release() -> __tty_fasync() -> __f_setown() __tty_hangup() calls __tty_fasync() with @on false => __f_setown() is never called from tty_release(). => All callers of __tty_hangup() are safe as well. From the callchains it's obvious that (1) and (2) end up getting called via file->f_op->fasync(). That can happen either through the F_SETFL fcntl() with the FASYNC flag raised or via the FIOASYNC ioctl(). If FASYNC is requested and the file isn't already FASYNC then file->f_op->fasync() is called with @on true which ends up causing both (1) and (2) to call __f_setown(). (1) and (2) are the only subsystems that call __f_setown() from the file->f_op->fasync() handler. So both (1) and (2) have been updated to allocate a struct fown_struct prior to calling fasync_helper() to register with the fasync infrastructure. That's safe as they both call fasync_helper() which also does allocations if @on is true. The other interesting case are file leases: (3) file leases lease_manager_ops->lm_setup::lease_setup() -> __f_setown() Which in turn is called from: generic_add_lease() -> lease_manager_ops->lm_setup::lease_setup() -> __f_setown() So here again we can simply make generic_add_lease() allocate struct fown_struct prior to the lease_manager_ops->lm_setup::lease_setup() which happens under a spinlock. With that the two remaining subsystems that call __f_setown() are: (4) dnotify (5) sockets Both have their own custom ioctls to set struct fown_struct and both have been converted to allocate a struct fown_struct on demand from their respective ioctls. Interactions with O_PATH are fine as well e.g., when opening a /dev/tty as O_PATH then no file->f_op->open() happens thus no file->f_owner is allocated. That's fine as no file operation will be set for those and the device has never been opened. fcntl()s called on such things will just allocate a ->f_owner on demand. Although I have zero idea why'd you care about f_owner on an O_PATH fd. Link: https://lore.kernel.org/r/20240813-work-f_owner-v2-1-4e9343a79f9f@kernel.org Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-28Merge tag 'v6.11-rc5-client-fixes' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds
Pull smb client fixes from Steve French: - two RDMA/smbdirect fixes and a minor cleanup - punch hole fix * tag 'v6.11-rc5-client-fixes' of git://git.samba.org/sfrench/cifs-2.6: cifs: Fix FALLOC_FL_PUNCH_HOLE support smb/client: fix rdma usage in smb2_async_writev() smb/client: remove unused rq_iter_size from struct smb_rqst smb/client: avoid dereferencing rdata=NULL in smb2_new_read_req()
2024-08-27jfs: check if leafidx greater than num leaves per dmap treeEdward Adam Davis
syzbot report a out of bounds in dbSplit, it because dmt_leafidx greater than num leaves per dmap tree, add a checking for dmt_leafidx in dbFindLeaf. Shaggy: Modified sanity check to apply to control pages as well as leaf pages. Reported-and-tested-by: syzbot+dca05492eff41f604890@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=dca05492eff41f604890 Signed-off-by: Edward Adam Davis <eadavis@qq.com> Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
2024-08-27jfs: Fix uaf in dbFreeBitsEdward Adam Davis
[syzbot reported] ================================================================== BUG: KASAN: slab-use-after-free in __mutex_lock_common kernel/locking/mutex.c:587 [inline] BUG: KASAN: slab-use-after-free in __mutex_lock+0xfe/0xd70 kernel/locking/mutex.c:752 Read of size 8 at addr ffff8880229254b0 by task syz-executor357/5216 CPU: 0 UID: 0 PID: 5216 Comm: syz-executor357 Not tainted 6.11.0-rc3-syzkaller-00156-gd7a5aa4b3c00 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 06/27/2024 Call Trace: <TASK> __dump_stack lib/dump_stack.c:93 [inline] dump_stack_lvl+0x241/0x360 lib/dump_stack.c:119 print_address_description mm/kasan/report.c:377 [inline] print_report+0x169/0x550 mm/kasan/report.c:488 kasan_report+0x143/0x180 mm/kasan/report.c:601 __mutex_lock_common kernel/locking/mutex.c:587 [inline] __mutex_lock+0xfe/0xd70 kernel/locking/mutex.c:752 dbFreeBits+0x7ea/0xd90 fs/jfs/jfs_dmap.c:2390 dbFreeDmap fs/jfs/jfs_dmap.c:2089 [inline] dbFree+0x35b/0x680 fs/jfs/jfs_dmap.c:409 dbDiscardAG+0x8a9/0xa20 fs/jfs/jfs_dmap.c:1650 jfs_ioc_trim+0x433/0x670 fs/jfs/jfs_discard.c:100 jfs_ioctl+0x2d0/0x3e0 fs/jfs/ioctl.c:131 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:907 [inline] __se_sys_ioctl+0xfc/0x170 fs/ioctl.c:893 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 Freed by task 5218: kasan_save_stack mm/kasan/common.c:47 [inline] kasan_save_track+0x3f/0x80 mm/kasan/common.c:68 kasan_save_free_info+0x40/0x50 mm/kasan/generic.c:579 poison_slab_object+0xe0/0x150 mm/kasan/common.c:240 __kasan_slab_free+0x37/0x60 mm/kasan/common.c:256 kasan_slab_free include/linux/kasan.h:184 [inline] slab_free_hook mm/slub.c:2252 [inline] slab_free mm/slub.c:4473 [inline] kfree+0x149/0x360 mm/slub.c:4594 dbUnmount+0x11d/0x190 fs/jfs/jfs_dmap.c:278 jfs_mount_rw+0x4ac/0x6a0 fs/jfs/jfs_mount.c:247 jfs_remount+0x3d1/0x6b0 fs/jfs/super.c:454 reconfigure_super+0x445/0x880 fs/super.c:1083 vfs_cmd_reconfigure fs/fsopen.c:263 [inline] vfs_fsconfig_locked fs/fsopen.c:292 [inline] __do_sys_fsconfig fs/fsopen.c:473 [inline] __se_sys_fsconfig+0xb6e/0xf80 fs/fsopen.c:345 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f [Analysis] There are two paths (dbUnmount and jfs_ioc_trim) that generate race condition when accessing bmap, which leads to the occurrence of uaf. Use the lock s_umount to synchronize them, in order to avoid uaf caused by race condition. Reported-and-tested-by: syzbot+3c010e21296f33a5dc16@syzkaller.appspotmail.com Signed-off-by: Edward Adam Davis <eadavis@qq.com> Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
2024-08-27btrfs: fix uninitialized return value from btrfs_reclaim_sweep()Filipe Manana
The return variable 'ret' at btrfs_reclaim_sweep() is never assigned if none of the space infos is reclaimable (for example if periodic reclaim is disabled, which is the default), so we return an undefined value. This can be fixed my making btrfs_reclaim_sweep() not return any value as well as do_reclaim_sweep() because: 1) do_reclaim_sweep() always returns 0, so we can make it return void; 2) The only caller of btrfs_reclaim_sweep() (btrfs_reclaim_bgs()) doesn't care about its return value, and in its context there's nothing to do about any errors anyway. Therefore remove the return value from btrfs_reclaim_sweep() and do_reclaim_sweep(). Fixes: e4ca3932ae90 ("btrfs: periodic block_group reclaim") Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-08-27xfs: reset rootdir extent size hint after growfsrtDarrick J. Wong
If growfsrt is run on a filesystem that doesn't have a rt volume, it's possible to change the rt extent size. If the root directory was previously set up with an inherited extent size hint and rtinherit, it's possible that the hint is no longer a multiple of the rt extent size. Although the verifiers don't complain about this, xfs_repair will, so if we detect this situation, log the root directory to clean it up. This is still racy, but it's better than nothing. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-08-27xfs: take m_growlock when running growfsrtDarrick J. Wong
Take the grow lock when we're expanding the realtime volume, like we do for the other growfs calls. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-08-27xfs: Fix missing interval for missing_owner in xfs fsmapZizhi Wo
In the fsmap query of xfs, there is an interval missing problem: [root@fedora ~]# xfs_io -c 'fsmap -vvvv' /mnt EXT: DEV BLOCK-RANGE OWNER FILE-OFFSET AG AG-OFFSET TOTAL 0: 253:16 [0..7]: static fs metadata 0 (0..7) 8 1: 253:16 [8..23]: per-AG metadata 0 (8..23) 16 2: 253:16 [24..39]: inode btree 0 (24..39) 16 3: 253:16 [40..47]: per-AG metadata 0 (40..47) 8 4: 253:16 [48..55]: refcount btree 0 (48..55) 8 5: 253:16 [56..103]: per-AG metadata 0 (56..103) 48 6: 253:16 [104..127]: free space 0 (104..127) 24 ...... BUG: [root@fedora ~]# xfs_io -c 'fsmap -vvvv -d 104 107' /mnt [root@fedora ~]# Normally, we should be able to get [104, 107), but we got nothing. The problem is caused by shifting. The query for the problem-triggered scenario is for the missing_owner interval (e.g. freespace in rmapbt/ unknown space in bnobt), which is obtained by subtraction (gap). For this scenario, the interval is obtained by info->last. However, rec_daddr is calculated based on the start_block recorded in key[1], which is converted by calling XFS_BB_TO_FSBT. Then if rec_daddr does not exceed info->next_daddr, which means keys[1].fmr_physical >> (mp)->m_blkbb_log <= info->next_daddr, no records will be displayed. In the above example, 104 >> (mp)->m_blkbb_log = 12 and 107 >> (mp)->m_blkbb_log = 12, so the two are reduced to 0 and the gap is ignored: before calculate ----------------> after shifting 104(st) 107(ed) 12(st/ed) |---------| | sector size block size Resolve this issue by introducing the "end_daddr" field in xfs_getfsmap_info. This records |key[1].fmr_physical + key[1].length| at the granularity of sector. If the current query is the last, the rec_daddr is end_daddr to prevent missing interval problems caused by shifting. We only need to focus on the last query, because xfs disks are internally aligned with disk blocksize that are powers of two and minimum 512, so there is no problem with shifting in previous queries. After applying this patch, the above problem have been solved: [root@fedora ~]# xfs_io -c 'fsmap -vvvv -d 104 107' /mnt EXT: DEV BLOCK-RANGE OWNER FILE-OFFSET AG AG-OFFSET TOTAL 0: 253:16 [104..106]: free space 0 (104..106) 3 Fixes: e89c041338ed ("xfs: implement the GETFSMAP ioctl") Signed-off-by: Zizhi Wo <wozizhi@huawei.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> [djwong: limit the range of end_addr correctly] Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-08-27xfs: use XFS_BUF_DADDR_NULL for daddrs in getfsmap codeDarrick J. Wong
Use XFS_BUF_DADDR_NULL (instead of a magic sentinel value) to mean "this field is null" like the rest of xfs. Cc: wozizhi@huawei.com Fixes: e89c041338ed6 ("xfs: implement the GETFSMAP ioctl") Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-08-27Merge v6.11-rc5 into drm-nextDaniel Vetter
amdgpu pr conconflicts due to patches cherry-picked to -fixes, I might as well catch up with a backmerge and handle them all. Plus both misc and intel maintainers asked for a backmerge anyway. Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
2024-08-27ceph: Convert to use jiffies macroChen Yufan
Use time_after_eq macro instead of using jiffies directly to handle wraparound. [ xiubli: adjust the header files order ] Signed-off-by: Chen Yufan <chenyufan@vivo.com> Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-08-27ceph: Remove unused declarationsYue Haibing
These functions is never implemented and used. Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-08-27Merge tag 'vfs-6.11-rc6.fixes' of ↵Linus Torvalds
gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: "VFS: - Ensure that backing files uses file->f_ops->splice_write() for splice netfs: - Revert the removal of PG_private_2 from netfs_release_folio() as cephfs still relies on this - When AS_RELEASE_ALWAYS is set on a mapping the folio needs to always be invalidated during truncation - Fix losing untruncated data in a folio by making letting netfs_release_folio() return false if the folio is dirty - Fix trimming of streaming-write folios in netfs_inval_folio() - Reset iterator before retrying a short read - Fix interaction of streaming writes with zero-point tracker afs: - During truncation afs currently calls truncate_setsize() which sets i_size, expands the pagecache and truncates it. The first two operations aren't needed because they will have already been done. So call truncate_pagecache() instead and skip the redundant parts overlayfs: - Fix checking of the number of allowed lower layers so 500 layers can actually be used instead of just 499 - Add missing '\n' to pr_err() output - Pass string to ovl_parse_layer() and thus allow it to be used for Opt_lowerdir as well pidfd: - Revert blocking the creation of pidfds for kthread as apparently userspace relies on this. Specifically, it breaks systemd during shutdown romfs: - Fix romfs_read_folio() to use the correct offset with folio_zero_tail()" * tag 'vfs-6.11-rc6.fixes' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: netfs: Fix interaction of streaming writes with zero-point tracker netfs: Fix missing iterator reset on retry of short read netfs: Fix trimming of streaming-write folios in netfs_inval_folio() netfs: Fix netfs_release_folio() to say no if folio dirty afs: Fix post-setattr file edit to do truncation correctly mm: Fix missing folio invalidation calls during truncation ovl: ovl_parse_param_lowerdir: Add missed '\n' for pr_err ovl: fix wrong lowerdir number check for parameter Opt_lowerdir ovl: pass string to ovl_parse_layer() backing-file: convert to using fops->splice_write Revert "pidfd: prevent creation of pidfds for kthreads" romfs: fix romfs_read_folio() netfs, ceph: Partially revert "netfs: Replace PG_fscache by setting folio->private and marking dirty"
2024-08-26ext4: dax: fix overflowing extents beyond inode size when partially writingZhihao Cheng
The dax_iomap_rw() does two things in each iteration: map written blocks and copy user data to blocks. If the process is killed by user(See signal handling in dax_iomap_iter()), the copied data will be returned and added on inode size, which means that the length of written extents may exceed the inode size, then fsck will fail. An example is given as: dd if=/dev/urandom of=file bs=4M count=1 dax_iomap_rw iomap_iter // round 1 ext4_iomap_begin ext4_iomap_alloc // allocate 0~2M extents(written flag) dax_iomap_iter // copy 2M data iomap_iter // round 2 iomap_iter_advance iter->pos += iter->processed // iter->pos = 2M ext4_iomap_begin ext4_iomap_alloc // allocate 2~4M extents(written flag) dax_iomap_iter fatal_signal_pending done = iter->pos - iocb->ki_pos // done = 2M ext4_handle_inode_extension ext4_update_inode_size // inode size = 2M fsck reports: Inode 13, i_size is 2097152, should be 4194304. Fix? Fix the problem by truncating extents if the written length is smaller than expected. Fixes: 776722e85d3b ("ext4: DAX iomap write support") CC: stable@vger.kernel.org Link: https://bugzilla.kernel.org/show_bug.cgi?id=219136 Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Zhihao Cheng <chengzhihao1@huawei.com> Link: https://patch.msgid.link/20240809121532.2105494-1-chengzhihao@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-08-26ext4: don't set SB_RDONLY after filesystem errorsJan Kara
When the filesystem is mounted with errors=remount-ro, we were setting SB_RDONLY flag to stop all filesystem modifications. We knew this misses proper locking (sb->s_umount) and does not go through proper filesystem remount procedure but it has been the way this worked since early ext2 days and it was good enough for catastrophic situation damage mitigation. Recently, syzbot has found a way (see link) to trigger warnings in filesystem freezing because the code got confused by SB_RDONLY changing under its hands. Since these days we set EXT4_FLAGS_SHUTDOWN on the superblock which is enough to stop all filesystem modifications, modifying SB_RDONLY shouldn't be needed. So stop doing that. Link: https://lore.kernel.org/all/000000000000b90a8e061e21d12f@google.com Reported-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Christian Brauner <brauner@kernel.org> Link: https://patch.msgid.link/20240805201241.27286-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-08-26ext4: nested locking for xattr inodeWojciech Gładysz
Add nested locking with I_MUTEX_XATTR subclass to avoid lockdep warning while handling xattr inode on file open syscall at ext4_xattr_inode_iget. Backtrace EXT4-fs (loop0): Ignoring removed oldalloc option ====================================================== WARNING: possible circular locking dependency detected 5.10.0-syzkaller #0 Not tainted ------------------------------------------------------ syz-executor543/2794 is trying to acquire lock: ffff8880215e1a48 (&ea_inode->i_rwsem#7/1){+.+.}-{3:3}, at: inode_lock include/linux/fs.h:782 [inline] ffff8880215e1a48 (&ea_inode->i_rwsem#7/1){+.+.}-{3:3}, at: ext4_xattr_inode_iget+0x42a/0x5c0 fs/ext4/xattr.c:425 but task is already holding lock: ffff8880215e3278 (&ei->i_data_sem/3){++++}-{3:3}, at: ext4_setattr+0x136d/0x19c0 fs/ext4/inode.c:5559 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (&ei->i_data_sem/3){++++}-{3:3}: lock_acquire+0x197/0x480 kernel/locking/lockdep.c:5566 down_write+0x93/0x180 kernel/locking/rwsem.c:1564 ext4_update_i_disksize fs/ext4/ext4.h:3267 [inline] ext4_xattr_inode_write fs/ext4/xattr.c:1390 [inline] ext4_xattr_inode_lookup_create fs/ext4/xattr.c:1538 [inline] ext4_xattr_set_entry+0x331a/0x3d80 fs/ext4/xattr.c:1662 ext4_xattr_ibody_set+0x124/0x390 fs/ext4/xattr.c:2228 ext4_xattr_set_handle+0xc27/0x14e0 fs/ext4/xattr.c:2385 ext4_xattr_set+0x219/0x390 fs/ext4/xattr.c:2498 ext4_xattr_user_set+0xc9/0xf0 fs/ext4/xattr_user.c:40 __vfs_setxattr+0x404/0x450 fs/xattr.c:177 __vfs_setxattr_noperm+0x11d/0x4f0 fs/xattr.c:208 __vfs_setxattr_locked+0x1f9/0x210 fs/xattr.c:266 vfs_setxattr+0x112/0x2c0 fs/xattr.c:283 setxattr+0x1db/0x3e0 fs/xattr.c:548 path_setxattr+0x15a/0x240 fs/xattr.c:567 __do_sys_setxattr fs/xattr.c:582 [inline] __se_sys_setxattr fs/xattr.c:578 [inline] __x64_sys_setxattr+0xc5/0xe0 fs/xattr.c:578 do_syscall_64+0x6d/0xa0 arch/x86/entry/common.c:62 entry_SYSCALL_64_after_hwframe+0x61/0xcb -> #0 (&ea_inode->i_rwsem#7/1){+.+.}-{3:3}: check_prev_add kernel/locking/lockdep.c:2988 [inline] check_prevs_add kernel/locking/lockdep.c:3113 [inline] validate_chain+0x1695/0x58f0 kernel/locking/lockdep.c:3729 __lock_acquire+0x12fd/0x20d0 kernel/locking/lockdep.c:4955 lock_acquire+0x197/0x480 kernel/locking/lockdep.c:5566 down_write+0x93/0x180 kernel/locking/rwsem.c:1564 inode_lock include/linux/fs.h:782 [inline] ext4_xattr_inode_iget+0x42a/0x5c0 fs/ext4/xattr.c:425 ext4_xattr_inode_get+0x138/0x410 fs/ext4/xattr.c:485 ext4_xattr_move_to_block fs/ext4/xattr.c:2580 [inline] ext4_xattr_make_inode_space fs/ext4/xattr.c:2682 [inline] ext4_expand_extra_isize_ea+0xe70/0x1bb0 fs/ext4/xattr.c:2774 __ext4_expand_extra_isize+0x304/0x3f0 fs/ext4/inode.c:5898 ext4_try_to_expand_extra_isize fs/ext4/inode.c:5941 [inline] __ext4_mark_inode_dirty+0x591/0x810 fs/ext4/inode.c:6018 ext4_setattr+0x1400/0x19c0 fs/ext4/inode.c:5562 notify_change+0xbb6/0xe60 fs/attr.c:435 do_truncate+0x1de/0x2c0 fs/open.c:64 handle_truncate fs/namei.c:2970 [inline] do_open fs/namei.c:3311 [inline] path_openat+0x29f3/0x3290 fs/namei.c:3425 do_filp_open+0x20b/0x450 fs/namei.c:3452 do_sys_openat2+0x124/0x460 fs/open.c:1207 do_sys_open fs/open.c:1223 [inline] __do_sys_open fs/open.c:1231 [inline] __se_sys_open fs/open.c:1227 [inline] __x64_sys_open+0x221/0x270 fs/open.c:1227 do_syscall_64+0x6d/0xa0 arch/x86/entry/common.c:62 entry_SYSCALL_64_after_hwframe+0x61/0xcb other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&ei->i_data_sem/3); lock(&ea_inode->i_rwsem#7/1); lock(&ei->i_data_sem/3); lock(&ea_inode->i_rwsem#7/1); *** DEADLOCK *** 5 locks held by syz-executor543/2794: #0: ffff888026fbc448 (sb_writers#4){.+.+}-{0:0}, at: mnt_want_write+0x4a/0x2a0 fs/namespace.c:365 #1: ffff8880215e3488 (&sb->s_type->i_mutex_key#7){++++}-{3:3}, at: inode_lock include/linux/fs.h:782 [inline] #1: ffff8880215e3488 (&sb->s_type->i_mutex_key#7){++++}-{3:3}, at: do_truncate+0x1cf/0x2c0 fs/open.c:62 #2: ffff8880215e3310 (&ei->i_mmap_sem){++++}-{3:3}, at: ext4_setattr+0xec4/0x19c0 fs/ext4/inode.c:5519 #3: ffff8880215e3278 (&ei->i_data_sem/3){++++}-{3:3}, at: ext4_setattr+0x136d/0x19c0 fs/ext4/inode.c:5559 #4: ffff8880215e30c8 (&ei->xattr_sem){++++}-{3:3}, at: ext4_write_trylock_xattr fs/ext4/xattr.h:162 [inline] #4: ffff8880215e30c8 (&ei->xattr_sem){++++}-{3:3}, at: ext4_try_to_expand_extra_isize fs/ext4/inode.c:5938 [inline] #4: ffff8880215e30c8 (&ei->xattr_sem){++++}-{3:3}, at: __ext4_mark_inode_dirty+0x4fb/0x810 fs/ext4/inode.c:6018 stack backtrace: CPU: 1 PID: 2794 Comm: syz-executor543 Not tainted 5.10.0-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/27/2024 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x177/0x211 lib/dump_stack.c:118 print_circular_bug+0x146/0x1b0 kernel/locking/lockdep.c:2002 check_noncircular+0x2cc/0x390 kernel/locking/lockdep.c:2123 check_prev_add kernel/locking/lockdep.c:2988 [inline] check_prevs_add kernel/locking/lockdep.c:3113 [inline] validate_chain+0x1695/0x58f0 kernel/locking/lockdep.c:3729 __lock_acquire+0x12fd/0x20d0 kernel/locking/lockdep.c:4955 lock_acquire+0x197/0x480 kernel/locking/lockdep.c:5566 down_write+0x93/0x180 kernel/locking/rwsem.c:1564 inode_lock include/linux/fs.h:782 [inline] ext4_xattr_inode_iget+0x42a/0x5c0 fs/ext4/xattr.c:425 ext4_xattr_inode_get+0x138/0x410 fs/ext4/xattr.c:485 ext4_xattr_move_to_block fs/ext4/xattr.c:2580 [inline] ext4_xattr_make_inode_space fs/ext4/xattr.c:2682 [inline] ext4_expand_extra_isize_ea+0xe70/0x1bb0 fs/ext4/xattr.c:2774 __ext4_expand_extra_isize+0x304/0x3f0 fs/ext4/inode.c:5898 ext4_try_to_expand_extra_isize fs/ext4/inode.c:5941 [inline] __ext4_mark_inode_dirty+0x591/0x810 fs/ext4/inode.c:6018 ext4_setattr+0x1400/0x19c0 fs/ext4/inode.c:5562 notify_change+0xbb6/0xe60 fs/attr.c:435 do_truncate+0x1de/0x2c0 fs/open.c:64 handle_truncate fs/namei.c:2970 [inline] do_open fs/namei.c:3311 [inline] path_openat+0x29f3/0x3290 fs/namei.c:3425 do_filp_open+0x20b/0x450 fs/namei.c:3452 do_sys_openat2+0x124/0x460 fs/open.c:1207 do_sys_open fs/open.c:1223 [inline] __do_sys_open fs/open.c:1231 [inline] __se_sys_open fs/open.c:1227 [inline] __x64_sys_open+0x221/0x270 fs/open.c:1227 do_syscall_64+0x6d/0xa0 arch/x86/entry/common.c:62 entry_SYSCALL_64_after_hwframe+0x61/0xcb RIP: 0033:0x7f0cde4ea229 Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 21 18 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007ffd81d1c978 EFLAGS: 00000246 ORIG_RAX: 0000000000000002 RAX: ffffffffffffffda RBX: 0030656c69662f30 RCX: 00007f0cde4ea229 RDX: 0000000000000089 RSI: 00000000000a0a00 RDI: 00000000200001c0 RBP: 2f30656c69662f2e R08: 0000000000208000 R09: 0000000000208000 R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffd81d1c9c0 R13: 00007ffd81d1ca00 R14: 0000000000080000 R15: 0000000000000003 EXT4-fs error (device loop0): ext4_expand_extra_isize_ea:2730: inode #13: comm syz-executor543: corrupted in-inode xattr Signed-off-by: Wojciech Gładysz <wojciech.gladysz@infogain.com> Link: https://patch.msgid.link/20240801143827.19135-1-wojciech.gladysz@infogain.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-08-26jbd2: remove unneeded check of ret in jbd2_fc_get_bufKemeng Shi
Simply return -EINVAL if j_fc_off is invalid to avoid repeated check of ret. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20240801013815.2393869-9-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-08-26jbd2: correct comment jbd2_mark_journal_emptyKemeng Shi
After jbd2_mark_journal_empty, journal log is supposed to be empty. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20240801013815.2393869-8-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-08-26jbd2: move escape handle to futher improve jbd2_journal_write_metadata_bufferKemeng Shi
Move escape handle to futher improve code readability and remove some repeat check. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20240801013815.2393869-7-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>