summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2025-03-03xfs: define the zoned on-disk formatChristoph Hellwig
Zone file systems reuse the basic RT group enabled XFS file system structure to support a mode where each RT group is always written from start to end and then reset for reuse (after moving out any remaining data). There are few minor but important changes, which are indicated by a new incompat flag: 1) there are no bitmap and summary inodes, thus the /rtgroups/{rgno}.{bitmap,summary} metadir files do not exist and the sb_rbmblocks superblock field must be cleared to zero. 2) there is a new superblock field that specifies the start of an internal RT section. This allows supporting SMR HDDs that have random writable space at the beginning which is used for the XFS data device (which really is the metadata device for this configuration), directly followed by a RT device on the same block device. While something similar could be achieved using dm-linear just having a single device directly consumed by XFS makes handling the file systems a lot easier. 3) Another superblock field that tracks the amount of reserved space (or overprovisioning) that is never used for user capacity, but allows GC to run more smoothly. 4) an overlay of the cowextsize field for the rtrmap inode so that we can persistently track the total amount of rtblocks currently used in a RT group. There is no data structure other than the rmap that tracks used space in an RT group, and this counter is used to decide when a RT group has been entirely emptied, and to select one that is relatively empty if garbage collection needs to be performed. While this counter could be tracked entirely in memory and rebuilt from the rmap at mount time, that would lead to very long mount times with the large number of RT groups implied by the number of hardware zones especially on SMR hard drives with 256MB zone sizes. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2025-03-03xfs: add a xfs_rtrmap_highest_rgbno helperChristoph Hellwig
Add a helper to find the last offset mapped in the rtrmap. This will be used by the zoned code to find out where to start writing again on conventional devices without hardware zone support. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2025-03-03xfs: support XFS_BMAPI_REMAP in xfs_bmap_del_extent_delayChristoph Hellwig
The zone allocator wants to be able to remove a delalloc mapping in the COW fork while keeping the block reservation. To support that pass the flags argument down to xfs_bmap_del_extent_delay and support the XFS_BMAPI_REMAP flag to keep the reservation. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2025-03-03xfs: refine the unaligned check for always COW inodes in xfs_file_dio_writeChristoph Hellwig
For always COW inodes we also must check the alignment of each individual iovec segment, as they could end up with different I/Os due to the way bio_iov_iter_get_pages works, and we'd then overwrite an already written block. The existing always_cow sysctl based code doesn't catch this because nothing enforces that blocks aren't rewritten, but for zoned XFS on sequential write required zones this is a hard error. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2025-03-03xfs: skip always_cow inodes in xfs_reflink_trim_around_sharedChristoph Hellwig
xfs_reflink_trim_around_shared tries to find shared blocks in the refcount btree. Always_cow inodes don't have that tree, so don't bother. For the existing always_cow code this is a minor optimization. For the upcoming zoned code that can do COW without the rtreflink code it avoids triggering a NULL pointer dereference. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2025-03-03xfs: move xfs_bmapi_reserve_delalloc to xfs_iomap.cChristoph Hellwig
Delalloc reservations are not supported in userspace, and thus it doesn't make sense to share this helper with xfsprogs.c. Move it to xfs_iomap.c toward the two callers. Note that there rest of the delalloc handling should probably eventually also move out of xfs_bmap.c, but that will require a bit more surgery. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2025-03-03xfs: add a rtg_blocks helperChristoph Hellwig
Shortcut dereferencing the xg_block_count field in the generic group structure. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2025-03-03xfs: factor out a xfs_rt_check_size helperChristoph Hellwig
Add a helper to check that the last block of a RT device is readable to share the code between mount and growfs. This also adds the mount time overflow check to growfs and improves the error messages. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2025-03-03xfs: reduce metafile reservationsChristoph Hellwig
There is no point in reserving more space than actually available on the data device for the worst case scenario that is unlikely to happen. Reserve at most 1/4th of the data device blocks, which is still a heuristic. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2025-03-03xfs: make metabtree reservations globalChristoph Hellwig
Currently each metabtree inode has it's own space reservation to ensure it can be expanded to the maximum size, mirroring what is done for the AG-based btrees. But unlike the AG-based btrees the metabtree inodes aren't restricted to allocate from a single AG but can use free space form the entire file system. And unlike AG-based btrees where the required reservation shrinks with the available free space due to this, the metabtree reservations for the rtrmap and rtfreflink trees are not bound in any way by the data device free space as they track RT extent allocations. This is not very efficient as it requires a large number of blocks to be set aside that can't be used at all by other btrees. Switch to a model that uses a global pool instead in preparation for reducing the amount of reserved space, which now also removes the overloading of the i_nblocks field for metabtree inodes, which would create problems if metabtree inodes ever had a big enough xattr fork to require xattr blocks outside the inode. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2025-03-03xfs: fixup the metabtree reservation in xrep_reap_metadir_fsblocksChristoph Hellwig
All callers of xrep_reap_metadir_fsblocks need to fix up the metabtree reservation, otherwise they'd leave the reservations in an incoherent state. Move the call to xrep_reset_metafile_resv into xrep_reap_metadir_fsblocks so it always is taken care of, and remove now superfluous helper functions in the callers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2025-03-03xfs: trace in-memory freecounter reservationsChristoph Hellwig
Add two tracepoints when the freecounter dips into the reserved pool and when it is entirely out of space. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2025-03-03xfs: support reserved blocks for the rt extent counterChristoph Hellwig
The zoned space allocator will need reserved RT extents for garbage collection and zeroing of partial blocks. Move the resblks related fields into the freecounter array so that they can be used for all counters. Co-developed-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2025-03-03xfs: generalize the freespace and reserved blocks handlingChristoph Hellwig
xfs_{add,dec}_freecounter already handles the block and RT extent percpu counters, but it currently hardcodes the passed in counter. Add a freecounter abstraction that uses an enum to designate the counter and add wrappers that hide the actual percpu_counters. This will allow expanding the reserved block handling to the RT extent counter in the next step, and also prepares for adding yet another such counter that can share the code. Both these additions will be needed for the zoned allocator. Also switch the flooring of the frextents counter to 0 in statfs for the rthinherit case to a manual min_t call to match the handling of the fdblocks counter for normal file systems. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2025-03-03xfs: reflow xfs_dec_freecounterChristoph Hellwig
Let the successful allocation be the main path through the function with exception handling in branches to make the code easier to follow. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2025-03-03Merge branch 'vfs-6.15.iomap' of ↵Carlos Maiolino
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs into xfs-6.15-merge
2025-03-02cifs: fix incorrect validation for num_aces field of smb_aclNamjae Jeon
parse_dcal() validate num_aces to allocate ace array. f (num_aces > ULONG_MAX / sizeof(struct smb_ace *)) It is an incorrect validation that we can create an array of size ULONG_MAX. smb_acl has ->size field to calculate actual number of aces in response buffer size. Use this to check invalid num_aces. Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-03-02ksmbd: fix incorrect validation for num_aces field of smb_aclNamjae Jeon
parse_dcal() validate num_aces to allocate posix_ace_state_array. if (num_aces > ULONG_MAX / sizeof(struct smb_ace *)) It is an incorrect validation that we can create an array of size ULONG_MAX. smb_acl has ->size field to calculate actual number of aces in request buffer size. Use this to check invalid num_aces. Reported-by: Igor Leite Ladessa <igor-ladessa@hotmail.com> Tested-by: Igor Leite Ladessa <igor-ladessa@hotmail.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-03-02smb: common: change the data type of num_aces to le16Namjae Jeon
2.4.5 in [MS-DTYP].pdf describe the data type of num_aces as le16. AceCount (2 bytes): An unsigned 16-bit integer that specifies the count of the number of ACE records in the ACL. Change it to le16 and add reserved field to smb_acl struct. Reported-by: Igor Leite Ladessa <igor-ladessa@hotmail.com> Tested-by: Igor Leite Ladessa <igor-ladessa@hotmail.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-03-02ksmbd: fix bug on trap in smb2_lockNamjae Jeon
If lock count is greater than 1, flags could be old value. It should be checked with flags of smb_lock, not flags. It will cause bug-on trap from locks_free_lock in error handling routine. Cc: stable@vger.kernel.org Reported-by: Norbert Szetei <norbert@doyensec.com> Tested-by: Norbert Szetei <norbert@doyensec.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-03-02ksmbd: fix use-after-free in smb2_lockNamjae Jeon
If smb_lock->zero_len has value, ->llist of smb_lock is not delete and flock is old one. It will cause use-after-free on error handling routine. Cc: stable@vger.kernel.org Reported-by: Norbert Szetei <norbert@doyensec.com> Tested-by: Norbert Szetei <norbert@doyensec.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-03-02ksmbd: fix type confusion via race condition when using ipc_msg_send_requestNamjae Jeon
req->handle is allocated using ksmbd_acquire_id(&ipc_ida), based on ida_alloc. req->handle from ksmbd_ipc_login_request and FSCTL_PIPE_TRANSCEIVE ioctl can be same and it could lead to type confusion between messages, resulting in access to unexpected parts of memory after an incorrect delivery. ksmbd check type of ipc response but missing add continue to check next ipc reponse. Cc: stable@vger.kernel.org Reported-by: Norbert Szetei <norbert@doyensec.com> Tested-by: Norbert Szetei <norbert@doyensec.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-03-02ksmbd: fix out-of-bounds in parse_sec_desc()Namjae Jeon
If osidoffset, gsidoffset and dacloffset could be greater than smb_ntsd struct size. If it is smaller, It could cause slab-out-of-bounds. And when validating sid, It need to check it included subauth array size. Cc: stable@vger.kernel.org Reported-by: Norbert Szetei <norbert@doyensec.com> Tested-by: Norbert Szetei <norbert@doyensec.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-03-01Merge tag 'v6.14-rc4-smb3-client-fix' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds
Pull smb client fix from Steve French: "Fix SMB1 netfs client regression" * tag 'v6.14-rc4-smb3-client-fix' of git://git.samba.org/sfrench/cifs-2.6: cifs: Fix the smb1 readv callback to correctly call netfs
2025-03-01ecryptfs: remove NULL remount_fs from super_operationsEric Sandeen
This got missed during the mount API conversion. This makes no functional difference, but after we convert the last filesystem, we'll want to remove the remount_fs op from super_operations altogether. So let's just get this out of the way now. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Link: https://lore.kernel.org/r/5dc2eb45-7126-4777-a7f9-29d02dff443f@redhat.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-28dlm: fix error if active rsb is not hashedAlexander Aring
If an active rsb is not hashed anymore and this could occur because we releases and acquired locks we need to signal the followed code that the lookup failed. Since the lookup was successful, but it isn't part of the rsb hash anymore we need to signal it by setting error to -EBADR as dlm_search_rsb_tree() does it. Cc: stable@vger.kernel.org Fixes: 5be323b0c64d ("dlm: move dlm_search_rsb_tree() out of lock") Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2025-02-28dlm: fix error if inactive rsb is not hashedAlexander Aring
If an inactive rsb is not hashed anymore and this could occur because we releases and acquired locks we need to signal the followed code that the lookup failed. Since the lookup was successful, but it isn't part of the rsb hash anymore we need to signal it by setting error to -EBADR as dlm_search_rsb_tree() does it. Cc: stable@vger.kernel.org Fixes: 01fdeca1cc2d ("dlm: use rcu to avoid an extra rsb struct lookup") Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2025-02-28bcachefs: Don't set BCH_FEATURE_incompat_version_field unless requestedKent Overstreet
We shouldn't be setting incompatible bits or the incompatible version field unless explicitly request or allowed - otherwise we break mounting with old kernels or userspace. Reported-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-02-28Merge tag 'efi-fixes-for-v6.14-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi Pull EFI fixes from Ard Biesheuvel: "Another couple of EFI fixes for v6.14. Only James's patch stands out, as it implements a workaround for odd behavior in fwupd in user space, which creates EFI variables by touching a file in efivarfs, clearing the immutable bit (which gets set automatically for $reasons) and then opening it again for writing, none of which is really necessary. The fwupd author and LVFS maintainer is already rolling out a fix for this on the fwupd side, and suggested that the workaround in this PR could be backed out again during the next cycle. (There is a semantic mismatch in efivarfs where some essential variable attributes are stored in the first 4 bytes of the file, and so zero length files cannot exist, as they cannot be written back to the underlying variable store. So now, they are dropped once the last reference is released.) Summary: - Fix CPER error record parsing bugs - Fix a couple of efivarfs issues that were introduced in the merge window - Fix an issue in the early remapping code of the MOKvar table" * tag 'efi-fixes-for-v6.14-2' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi: efi/mokvar-table: Avoid repeated map/unmap of the same page efi: Don't map the entire mokvar table to determine its size efivarfs: allow creation of zero length files efivarfs: Defer PM notifier registration until .fill_super efi/cper: Fix cper_arm_ctx_info alignment efi/cper: Fix cper_ia_proc_ctx alignment
2025-02-28f2fs: add check for deleted inodeLeo Stone
The syzbot reproducer mounts a f2fs image, then tries to unlink an existing file. However, the unlinked file already has a link count of 0 when it is read for the first time in do_read_inode(). Add a check to sanity_check_inode() for i_nlink == 0. [Chao Yu: rebase the code and fix orphan inode recovery issue] Reported-by: syzbot+b01a36acd7007e273a83@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=b01a36acd7007e273a83 Fixes: 39a53e0ce0df ("f2fs: add superblock and major in-memory structure") Signed-off-by: Leo Stone <leocstone@gmail.com> Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-02-28f2fs: fix the missing write pointer correctionJaegeuk Kim
If checkpoint was disabled, we missed to fix the write pointers. Cc: <stable@vger.kernel.org> Fixes: 1015035609e4 ("f2fs: fix changing cursegs if recovery fails on zoned device") Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-02-28f2fs: fix to set .discard_granularity correctlyChao Yu
commit 4f993264fe29 ("f2fs: introduce discard_unit mount option") introduced a bug, when we enable discard_unit=section option, it will set .discard_granularity to BLKS_PER_SEC(), however discard granularity only supports [1, 512], once section size is not equal to segment size, it will cause issue_discard_thread() in DPOLICY_BG mode will not select discard entry w/ any granularity to issue. Fixes: 4f993264fe29 ("f2fs: introduce discard_unit mount option") Reviewed-by: Daeho Jeong <daehojeong@google.com> Signed-off-by: Yohan Joung <yohan.joung@sk.com> Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-02-28ceph: Pass a folio to ceph_allocate_page_array()Matthew Wilcox (Oracle)
Remove two accesses to page->index. Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org> Link: https://lore.kernel.org/r/20250217185119.430193-10-willy@infradead.org Tested-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-28ceph: Convert ceph_move_dirty_page_in_page_array() to ↵Matthew Wilcox (Oracle)
move_dirty_folio_in_page_array() Shorten the name of this internal function by dropping the 'ceph_' prefix and pass in a folio instead of a page. Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org> Link: https://lore.kernel.org/r/20250217185119.430193-9-willy@infradead.org Tested-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-28ceph: Remove uses of page from ceph_process_folio_batch()Matthew Wilcox (Oracle)
Remove uses of page->index and deprecated page APIs. Saves a lot of hidden calls to compound_head(). Also convert is_page_index_contiguous() to is_folio_index_contiguous() and make its arguments const. Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org> Link: https://lore.kernel.org/r/20250217185119.430193-8-willy@infradead.org Tested-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-28ceph: Convert ceph_check_page_before_write() to use a folioMatthew Wilcox (Oracle)
Remove the conversion back to a struct page and just use the folio passed in. Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org> Link: https://lore.kernel.org/r/20250217185119.430193-7-willy@infradead.org Tested-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-28ceph: Convert writepage_nounlock() to write_folio_nounlock()Matthew Wilcox (Oracle)
Remove references to page->index, page->mapping, thp_size(), page_offset() and other page APIs in favour of their more efficient folio replacements. Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org> Link: https://lore.kernel.org/r/20250217185119.430193-6-willy@infradead.org Tested-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-28ceph: Convert ceph_readdir_cache_control to store a folioMatthew Wilcox (Oracle)
Pass a folio around instead of a page. This removes an access to page->index and a few hidden calls to compound_head(). Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org> Link: https://lore.kernel.org/r/20250217185119.430193-5-willy@infradead.org Tested-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-28ceph: Convert ceph_find_incompatible() to take a folioMatthew Wilcox (Oracle)
Both callers already have the folio. Pass it in and use it throughout. Removes some hidden calls to compound_head() and a reference to page->mapping. Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org> Link: https://lore.kernel.org/r/20250217185119.430193-4-willy@infradead.org Tested-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-28ceph: Use a folio in ceph_page_mkwrite()Matthew Wilcox (Oracle)
Convert the passed page to a folio and use it throughout ceph_page_mkwrite(). Removes the last call to page_mkwrite_check_truncate(), the last call to offset_in_thp() and one of the last calls to thp_size(). Saves a few calls to compound_head(). Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org> Link: https://lore.kernel.org/r/20250217185119.430193-3-willy@infradead.org Tested-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-28ceph: Remove ceph_writepage()Matthew Wilcox (Oracle)
Ceph already has a writepages operation which is preferred over writepage in all situations except for page migration. By adding a migrate_folio operation, there will be no situations in which ->writepage should be called. filemap_migrate_folio() is an appropriate operation to use because the ceph data stored in folio->private does not contain any reference to the memory address of the folio. Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org> Link: https://lore.kernel.org/r/20250217185119.430193-2-willy@infradead.org Tested-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-28ceph: fix generic/421 test failureViacheslav Dubeyko
The generic/421 fails to finish because of the issue: Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.894678] INFO: task kworker/u48:0:11 blocked for more than 122 seconds. Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.895403] Not tainted 6.13.0-rc5+ #1 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.895867] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.896633] task:kworker/u48:0 state:D stack:0 pid:11 tgid:11 ppid:2 flags:0x00004000 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.896641] Workqueue: writeback wb_workfn (flush-ceph-24) Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897614] Call Trace: Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897620] <TASK> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897629] __schedule+0x443/0x16b0 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897637] schedule+0x2b/0x140 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897640] io_schedule+0x4c/0x80 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897643] folio_wait_bit_common+0x11b/0x310 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897646] ? _raw_spin_unlock_irq+0xe/0x50 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897652] ? __pfx_wake_page_function+0x10/0x10 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897655] __folio_lock+0x17/0x30 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897658] ceph_writepages_start+0xca9/0x1fb0 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897663] ? fsnotify_remove_queued_event+0x2f/0x40 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897668] do_writepages+0xd2/0x240 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897672] __writeback_single_inode+0x44/0x350 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897675] writeback_sb_inodes+0x25c/0x550 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897680] wb_writeback+0x89/0x310 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897683] ? finish_task_switch.isra.0+0x97/0x310 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897687] wb_workfn+0xb5/0x410 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897689] process_one_work+0x188/0x3d0 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897692] worker_thread+0x2b5/0x3c0 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897694] ? __pfx_worker_thread+0x10/0x10 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897696] kthread+0xe1/0x120 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897699] ? __pfx_kthread+0x10/0x10 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897701] ret_from_fork+0x43/0x70 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897705] ? __pfx_kthread+0x10/0x10 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897707] ret_from_fork_asm+0x1a/0x30 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897711] </TASK> There are several issues here: (1) ceph_kill_sb() doesn't wait ending of flushing all dirty folios/pages because of racy nature of mdsc->stopping_blockers. As a result, mdsc->stopping becomes CEPH_MDSC_STOPPING_FLUSHED too early. (2) The ceph_inc_osd_stopping_blocker(fsc->mdsc) fails to increment mdsc->stopping_blockers. Finally, already locked folios/pages are never been unlocked and the logic tries to lock the same page second time. (3) The folio_batch with found dirty pages by filemap_get_folios_tag() is not processed properly. And this is why some number of dirty pages simply never processed and we have dirty folios/pages after unmount anyway. This patch fixes the issues by means of: (1) introducing dirty_folios counter and flush_end_wq waiting queue in struct ceph_mds_client; (2) ceph_dirty_folio() increments the dirty_folios counter; (3) writepages_finish() decrements the dirty_folios counter and wake up all waiters on the queue if dirty_folios counter is equal or lesser than zero; (4) adding in ceph_kill_sb() method the logic of checking the value of dirty_folios counter and waiting if it is bigger than zero; (5) adding ceph_inc_osd_stopping_blocker() call in the beginning of the ceph_writepages_start() and ceph_dec_osd_stopping_blocker() at the end of the ceph_writepages_start() with the goal to resolve the racy nature of mdsc->stopping_blockers. sudo ./check generic/421 FSTYP -- ceph PLATFORM -- Linux/x86_64 ceph-testing-0001 6.13.0+ #137 SMP PREEMPT_DYNAMIC Mon Feb 3 20:30:08 UTC 2025 MKFS_OPTIONS -- 127.0.0.1:40551:/scratch MOUNT_OPTIONS -- -o name=fs,secret=<secret>,ms_mode=crc,nowsync,copyfrom 127.0.0.1:40551:/scratch /mnt/scratch generic/421 7s ... 4s Ran: generic/421 Passed all 1 tests Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Link: https://lore.kernel.org/r/20250205000249.123054-5-slava@dubeyko.com Tested-by: David Howells <dhowells@redhat.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-28ceph: introduce ceph_submit_write() methodViacheslav Dubeyko
Final responsibility of ceph_writepages_start() is to submit write requests for processed dirty folios/pages. The ceph_submit_write() summarize all this logic in one method. The generic/421 fails to finish because of the issue: Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.894678] INFO: task kworker/u48:0:11 blocked for more than 122 seconds. Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.895403] Not tainted 6.13.0-rc5+ #1 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.895867] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.896633] task:kworker/u48:0 state:D stack:0 pid:11 tgid:11 ppid:2 flags:0x00004000 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.896641] Workqueue: writeback wb_workfn (flush-ceph-24) Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897614] Call Trace: Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897620] <TASK> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897629] __schedule+0x443/0x16b0 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897637] schedule+0x2b/0x140 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897640] io_schedule+0x4c/0x80 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897643] folio_wait_bit_common+0x11b/0x310 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897646] ? _raw_spin_unlock_irq+0xe/0x50 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897652] ? __pfx_wake_page_function+0x10/0x10 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897655] __folio_lock+0x17/0x30 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897658] ceph_writepages_start+0xca9/0x1fb0 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897663] ? fsnotify_remove_queued_event+0x2f/0x40 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897668] do_writepages+0xd2/0x240 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897672] __writeback_single_inode+0x44/0x350 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897675] writeback_sb_inodes+0x25c/0x550 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897680] wb_writeback+0x89/0x310 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897683] ? finish_task_switch.isra.0+0x97/0x310 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897687] wb_workfn+0xb5/0x410 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897689] process_one_work+0x188/0x3d0 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897692] worker_thread+0x2b5/0x3c0 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897694] ? __pfx_worker_thread+0x10/0x10 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897696] kthread+0xe1/0x120 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897699] ? __pfx_kthread+0x10/0x10 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897701] ret_from_fork+0x43/0x70 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897705] ? __pfx_kthread+0x10/0x10 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897707] ret_from_fork_asm+0x1a/0x30 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897711] </TASK> There are two problems here: if (!ceph_inc_osd_stopping_blocker(fsc->mdsc)) { rc = -EIO; goto release_folios; } (1) ceph_kill_sb() doesn't wait ending of flushing all dirty folios/pages because of racy nature of mdsc->stopping_blockers. As a result, mdsc->stopping becomes CEPH_MDSC_STOPPING_FLUSHED too early. (2) The ceph_inc_osd_stopping_blocker(fsc->mdsc) fails to increment mdsc->stopping_blockers. Finally, already locked folios/pages are never been unlocked and the logic tries to lock the same page second time. This patch implements refactoring of ceph_submit_write() and also it solves the second issue. Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Link: https://lore.kernel.org/r/20250205000249.123054-4-slava@dubeyko.com Tested-by: David Howells <dhowells@redhat.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-28ceph: introduce ceph_process_folio_batch() methodViacheslav Dubeyko
First step of ceph_writepages_start() logic is of finding the dirty memory folios and processing it. This patch introduces ceph_process_folio_batch() method that moves this logic into dedicated method. The ceph_writepages_start() has this logic: if (ceph_wbc.locked_pages == 0) lock_page(page); /* first page */ else if (!trylock_page(page)) break; <skipped> if (folio_test_writeback(folio) || folio_test_private_2(folio) /* [DEPRECATED] */) { if (wbc->sync_mode == WB_SYNC_NONE) { doutc(cl, "%p under writeback\n", folio); folio_unlock(folio); continue; } doutc(cl, "waiting on writeback %p\n", folio); folio_wait_writeback(folio); folio_wait_private_2(folio); /* [DEPRECATED] */ } The problem here that folio/page is locked here at first and it is by set_page_writeback(page) later before submitting the write request. The folio/page is unlocked by writepages_finish() after finishing the write request. It means that logic of checking folio_test_writeback() and folio_wait_writeback() never works because page is locked and it cannot be locked again until write request completion. However, for majority of folios/pages the trylock_page() is used. As a result, multiple threads can try to lock the same folios/pages multiple times even if they are under writeback already. It makes this logic more compute intensive than it is necessary. This patch changes this logic: if (folio_test_writeback(folio) || folio_test_private_2(folio) /* [DEPRECATED] */) { if (wbc->sync_mode == WB_SYNC_NONE) { doutc(cl, "%p under writeback\n", folio); folio_unlock(folio); continue; } doutc(cl, "waiting on writeback %p\n", folio); folio_wait_writeback(folio); folio_wait_private_2(folio); /* [DEPRECATED] */ } if (ceph_wbc.locked_pages == 0) lock_page(page); /* first page */ else if (!trylock_page(page)) break; This logic should exclude the ignoring of writeback state of folios/pages. Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Link: https://lore.kernel.org/r/20250205000249.123054-3-slava@dubeyko.com Tested-by: David Howells <dhowells@redhat.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-28ceph: extend ceph_writeback_ctl for ceph_writepages_start() refactoringViacheslav Dubeyko
The ceph_writepages_start() has unreasonably huge size and complex logic that makes this method hard to understand. Current state of the method's logic makes bug fix really hard task. This patch extends the struct ceph_writeback_ctl with the goal to make ceph_writepages_start() method more compact and easy to understand by means of deep refactoring. Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Link: https://lore.kernel.org/r/20250205000249.123054-2-slava@dubeyko.com Tested-by: David Howells <dhowells@redhat.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-27ceph: return the correct dentry on mkdirNeilBrown
ceph already splices the correct dentry (in splice_dentry()) from the result of mkdir but does nothing more with it. Now that ->mkdir can return a dentry, return the correct dentry. Note that previously ceph_mkdir() could call ceph_init_inode_acls() on the inode from the wrong dentry, which would be NULL. This is safe as ceph_init_inode_acls() checks for NULL, but is not strictly correct. With this patch, the inode for the returned dentry is passed to ceph_init_inode_acls(). Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Signed-off-by: NeilBrown <neilb@suse.de> Link: https://lore.kernel.org/r/20250227013949.536172-4-neilb@suse.de Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-27hostfs: store inode in dentry after mkdir if possible.NeilBrown
After handling a mkdir, get the inode for the name and use d_splice_alias() to store the correct dentry in the dcache. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de> Link: https://lore.kernel.org/r/20250227013949.536172-3-neilb@suse.de Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-27Change inode_operations.mkdir to return struct dentry *NeilBrown
Some filesystems, such as NFS, cifs, ceph, and fuse, do not have complete control of sequencing on the actual filesystem (e.g. on a different server) and may find that the inode created for a mkdir request already exists in the icache and dcache by the time the mkdir request returns. For example, if the filesystem is mounted twice the directory could be visible on the other mount before it is on the original mount, and a pair of name_to_handle_at(), open_by_handle_at() calls could instantiate the directory inode with an IS_ROOT() dentry before the first mkdir returns. This means that the dentry passed to ->mkdir() may not be the one that is associated with the inode after the ->mkdir() completes. Some callers need to interact with the inode after the ->mkdir completes and they currently need to perform a lookup in the (rare) case that the dentry is no longer hashed. This lookup-after-mkdir requires that the directory remains locked to avoid races. Planned future patches to lock the dentry rather than the directory will mean that this lookup cannot be performed atomically with the mkdir. To remove this barrier, this patch changes ->mkdir to return the resulting dentry if it is different from the one passed in. Possible returns are: NULL - the directory was created and no other dentry was used ERR_PTR() - an error occurred non-NULL - this other dentry was spliced in This patch only changes file-systems to return "ERR_PTR(err)" instead of "err" or equivalent transformations. Subsequent patches will make further changes to some file-systems to return a correct dentry. Not all filesystems reliably result in a positive hashed dentry: - NFS, cifs, hostfs will sometimes need to perform a lookup of the name to get inode information. Races could result in this returning something different. Note that this lookup is non-atomic which is what we are trying to avoid. Placing the lookup in filesystem code means it only happens when the filesystem has no other option. - kernfs and tracefs leave the dentry negative and the ->revalidate operation ensures that lookup will be called to correctly populate the dentry. This could be fixed but I don't think it is important to any of the users of vfs_mkdir() which look at the dentry. The recommendation to use d_drop();d_splice_alias() is ugly but fits with current practice. A planned future patch will change this. Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: NeilBrown <neilb@suse.de> Link: https://lore.kernel.org/r/20250227013949.536172-2-neilb@suse.de Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-27Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.14-rc5). Conflicts: drivers/net/ethernet/cadence/macb_main.c fa52f15c745c ("net: cadence: macb: Synchronize stats calculations") 75696dd0fd72 ("net: cadence: macb: Convert to get_stats64") https://lore.kernel.org/20250224125848.68ee63e5@canb.auug.org.au Adjacent changes: drivers/net/ethernet/intel/ice/ice_sriov.c 79990cf5e7ad ("ice: Fix deinitializing VF in error path") a203163274a4 ("ice: simplify VF MSI-X managing") net/ipv4/tcp.c 18912c520674 ("tcp: devmem: don't write truncated dmabuf CMSGs to userspace") 297d389e9e5b ("net: prefix devmem specific helpers") net/mptcp/subflow.c 8668860b0ad3 ("mptcp: reset when MPTCP opts are dropped after join") c3349a22c200 ("mptcp: consolidate subflow cleanup") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-27Merge tag 'net-6.14-rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from bluetooth. We didn't get netfilter or wireless PRs this week, so next week's PR is probably going to be bigger. A healthy dose of fixes for bugs introduced in the current release nonetheless. Current release - regressions: - Bluetooth: always allow SCO packets for user channel - af_unix: fix memory leak in unix_dgram_sendmsg() - rxrpc: - remove redundant peer->mtu_lock causing lockdep splats - fix spinlock flavor issues with the peer record hash - eth: iavf: fix circular lock dependency with netdev_lock - net: use rtnl_net_dev_lock() in register_netdevice_notifier_dev_net() RDMA driver register notifier after the device Current release - new code bugs: - ethtool: fix ioctl confusing drivers about desired HDS user config - eth: ixgbe: fix media cage present detection for E610 device Previous releases - regressions: - loopback: avoid sending IP packets without an Ethernet header - mptcp: reset connection when MPTCP opts are dropped after join Previous releases - always broken: - net: better track kernel sockets lifetime - ipv6: fix dst ref loop on input in seg6 and rpl lw tunnels - phy: qca807x: use right value from DTS for DAC_DSP_BIAS_CURRENT - eth: enetc: number of error handling fixes - dsa: rtl8366rb: reshuffle the code to fix config / build issue with LED support" * tag 'net-6.14-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (53 commits) net: ti: icss-iep: Reject perout generation request idpf: fix checksums set in idpf_rx_rsc() selftests: drv-net: Check if combined-count exists net: ipv6: fix dst ref loop on input in rpl lwt net: ipv6: fix dst ref loop on input in seg6 lwt usbnet: gl620a: fix endpoint checking in genelink_bind() net/mlx5: IRQ, Fix null string in debug print net/mlx5: Restore missing trace event when enabling vport QoS net/mlx5: Fix vport QoS cleanup on error net: mvpp2: cls: Fixed Non IP flow, with vlan tag flow defination. af_unix: Fix memory leak in unix_dgram_sendmsg() net: Handle napi_schedule() calls from non-interrupt net: Clear old fragment checksum value in napi_reuse_skb gve: unlink old napi when stopping a queue using queue API net: Use rtnl_net_dev_lock() in register_netdevice_notifier_dev_net(). tcp: Defer ts_recent changes until req is owned net: enetc: fix the off-by-one issue in enetc_map_tx_tso_buffs() net: enetc: remove the mm_lock from the ENETC v4 driver net: enetc: add missing enetc4_link_deinit() net: enetc: update UDP checksum when updating originTimestamp field ...