Age | Commit message (Collapse) | Author |
|
Zone file systems reuse the basic RT group enabled XFS file system
structure to support a mode where each RT group is always written from
start to end and then reset for reuse (after moving out any remaining
data). There are few minor but important changes, which are indicated
by a new incompat flag:
1) there are no bitmap and summary inodes, thus the
/rtgroups/{rgno}.{bitmap,summary} metadir files do not exist and the
sb_rbmblocks superblock field must be cleared to zero.
2) there is a new superblock field that specifies the start of an
internal RT section. This allows supporting SMR HDDs that have random
writable space at the beginning which is used for the XFS data device
(which really is the metadata device for this configuration), directly
followed by a RT device on the same block device. While something
similar could be achieved using dm-linear just having a single device
directly consumed by XFS makes handling the file systems a lot easier.
3) Another superblock field that tracks the amount of reserved space (or
overprovisioning) that is never used for user capacity, but allows GC
to run more smoothly.
4) an overlay of the cowextsize field for the rtrmap inode so that we
can persistently track the total amount of rtblocks currently used in
a RT group. There is no data structure other than the rmap that
tracks used space in an RT group, and this counter is used to decide
when a RT group has been entirely emptied, and to select one that
is relatively empty if garbage collection needs to be performed.
While this counter could be tracked entirely in memory and rebuilt
from the rmap at mount time, that would lead to very long mount times
with the large number of RT groups implied by the number of hardware
zones especially on SMR hard drives with 256MB zone sizes.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
|
|
Add a helper to find the last offset mapped in the rtrmap. This will be
used by the zoned code to find out where to start writing again on
conventional devices without hardware zone support.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
|
|
The zone allocator wants to be able to remove a delalloc mapping in the
COW fork while keeping the block reservation. To support that pass the
flags argument down to xfs_bmap_del_extent_delay and support the
XFS_BMAPI_REMAP flag to keep the reservation.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
|
|
For always COW inodes we also must check the alignment of each individual
iovec segment, as they could end up with different I/Os due to the way
bio_iov_iter_get_pages works, and we'd then overwrite an already written
block. The existing always_cow sysctl based code doesn't catch this
because nothing enforces that blocks aren't rewritten, but for zoned XFS
on sequential write required zones this is a hard error.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
|
|
xfs_reflink_trim_around_shared tries to find shared blocks in the
refcount btree. Always_cow inodes don't have that tree, so don't
bother.
For the existing always_cow code this is a minor optimization. For
the upcoming zoned code that can do COW without the rtreflink code it
avoids triggering a NULL pointer dereference.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
|
|
Delalloc reservations are not supported in userspace, and thus it doesn't
make sense to share this helper with xfsprogs.c. Move it to xfs_iomap.c
toward the two callers.
Note that there rest of the delalloc handling should probably eventually
also move out of xfs_bmap.c, but that will require a bit more surgery.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
|
|
Shortcut dereferencing the xg_block_count field in the generic group
structure.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
|
|
Add a helper to check that the last block of a RT device is readable
to share the code between mount and growfs. This also adds the mount
time overflow check to growfs and improves the error messages.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
|
|
There is no point in reserving more space than actually available
on the data device for the worst case scenario that is unlikely to
happen. Reserve at most 1/4th of the data device blocks, which is
still a heuristic.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
|
|
Currently each metabtree inode has it's own space reservation to ensure
it can be expanded to the maximum size, mirroring what is done for the
AG-based btrees. But unlike the AG-based btrees the metabtree inodes
aren't restricted to allocate from a single AG but can use free space
form the entire file system. And unlike AG-based btrees where the
required reservation shrinks with the available free space due to this,
the metabtree reservations for the rtrmap and rtfreflink trees are not
bound in any way by the data device free space as they track RT extent
allocations. This is not very efficient as it requires a large number
of blocks to be set aside that can't be used at all by other btrees.
Switch to a model that uses a global pool instead in preparation for
reducing the amount of reserved space, which now also removes the
overloading of the i_nblocks field for metabtree inodes, which would
create problems if metabtree inodes ever had a big enough xattr fork
to require xattr blocks outside the inode.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
|
|
All callers of xrep_reap_metadir_fsblocks need to fix up the metabtree
reservation, otherwise they'd leave the reservations in an incoherent
state. Move the call to xrep_reset_metafile_resv into
xrep_reap_metadir_fsblocks so it always is taken care of, and remove
now superfluous helper functions in the callers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
|
|
Add two tracepoints when the freecounter dips into the reserved pool
and when it is entirely out of space.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
|
|
The zoned space allocator will need reserved RT extents for garbage
collection and zeroing of partial blocks. Move the resblks related
fields into the freecounter array so that they can be used for all
counters.
Co-developed-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
|
|
xfs_{add,dec}_freecounter already handles the block and RT extent
percpu counters, but it currently hardcodes the passed in counter.
Add a freecounter abstraction that uses an enum to designate the counter
and add wrappers that hide the actual percpu_counters. This will allow
expanding the reserved block handling to the RT extent counter in the
next step, and also prepares for adding yet another such counter that
can share the code. Both these additions will be needed for the zoned
allocator.
Also switch the flooring of the frextents counter to 0 in statfs for the
rthinherit case to a manual min_t call to match the handling of the
fdblocks counter for normal file systems.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
|
|
Let the successful allocation be the main path through the function
with exception handling in branches to make the code easier to
follow.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs into xfs-6.15-merge
|
|
parse_dcal() validate num_aces to allocate ace array.
f (num_aces > ULONG_MAX / sizeof(struct smb_ace *))
It is an incorrect validation that we can create an array of size ULONG_MAX.
smb_acl has ->size field to calculate actual number of aces in response buffer
size. Use this to check invalid num_aces.
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
parse_dcal() validate num_aces to allocate posix_ace_state_array.
if (num_aces > ULONG_MAX / sizeof(struct smb_ace *))
It is an incorrect validation that we can create an array of size ULONG_MAX.
smb_acl has ->size field to calculate actual number of aces in request buffer
size. Use this to check invalid num_aces.
Reported-by: Igor Leite Ladessa <igor-ladessa@hotmail.com>
Tested-by: Igor Leite Ladessa <igor-ladessa@hotmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
2.4.5 in [MS-DTYP].pdf describe the data type of num_aces as le16.
AceCount (2 bytes): An unsigned 16-bit integer that specifies the count
of the number of ACE records in the ACL.
Change it to le16 and add reserved field to smb_acl struct.
Reported-by: Igor Leite Ladessa <igor-ladessa@hotmail.com>
Tested-by: Igor Leite Ladessa <igor-ladessa@hotmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
If lock count is greater than 1, flags could be old value.
It should be checked with flags of smb_lock, not flags.
It will cause bug-on trap from locks_free_lock in error handling
routine.
Cc: stable@vger.kernel.org
Reported-by: Norbert Szetei <norbert@doyensec.com>
Tested-by: Norbert Szetei <norbert@doyensec.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
If smb_lock->zero_len has value, ->llist of smb_lock is not delete and
flock is old one. It will cause use-after-free on error handling
routine.
Cc: stable@vger.kernel.org
Reported-by: Norbert Szetei <norbert@doyensec.com>
Tested-by: Norbert Szetei <norbert@doyensec.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
req->handle is allocated using ksmbd_acquire_id(&ipc_ida), based on
ida_alloc. req->handle from ksmbd_ipc_login_request and
FSCTL_PIPE_TRANSCEIVE ioctl can be same and it could lead to type confusion
between messages, resulting in access to unexpected parts of memory after
an incorrect delivery. ksmbd check type of ipc response but missing add
continue to check next ipc reponse.
Cc: stable@vger.kernel.org
Reported-by: Norbert Szetei <norbert@doyensec.com>
Tested-by: Norbert Szetei <norbert@doyensec.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
If osidoffset, gsidoffset and dacloffset could be greater than smb_ntsd
struct size. If it is smaller, It could cause slab-out-of-bounds.
And when validating sid, It need to check it included subauth array size.
Cc: stable@vger.kernel.org
Reported-by: Norbert Szetei <norbert@doyensec.com>
Tested-by: Norbert Szetei <norbert@doyensec.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Pull smb client fix from Steve French:
"Fix SMB1 netfs client regression"
* tag 'v6.14-rc4-smb3-client-fix' of git://git.samba.org/sfrench/cifs-2.6:
cifs: Fix the smb1 readv callback to correctly call netfs
|
|
This got missed during the mount API conversion. This makes no functional
difference, but after we convert the last filesystem, we'll want to remove
the remount_fs op from super_operations altogether. So let's just get this
out of the way now.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Link: https://lore.kernel.org/r/5dc2eb45-7126-4777-a7f9-29d02dff443f@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
If an active rsb is not hashed anymore and this could occur because we
releases and acquired locks we need to signal the followed code that
the lookup failed. Since the lookup was successful, but it isn't part of
the rsb hash anymore we need to signal it by setting error to -EBADR as
dlm_search_rsb_tree() does it.
Cc: stable@vger.kernel.org
Fixes: 5be323b0c64d ("dlm: move dlm_search_rsb_tree() out of lock")
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
|
|
If an inactive rsb is not hashed anymore and this could occur because we
releases and acquired locks we need to signal the followed code that the
lookup failed. Since the lookup was successful, but it isn't part of the
rsb hash anymore we need to signal it by setting error to -EBADR as
dlm_search_rsb_tree() does it.
Cc: stable@vger.kernel.org
Fixes: 01fdeca1cc2d ("dlm: use rcu to avoid an extra rsb struct lookup")
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
|
|
We shouldn't be setting incompatible bits or the incompatible version
field unless explicitly request or allowed - otherwise we break mounting
with old kernels or userspace.
Reported-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi
Pull EFI fixes from Ard Biesheuvel:
"Another couple of EFI fixes for v6.14.
Only James's patch stands out, as it implements a workaround for odd
behavior in fwupd in user space, which creates EFI variables by
touching a file in efivarfs, clearing the immutable bit (which gets
set automatically for $reasons) and then opening it again for writing,
none of which is really necessary.
The fwupd author and LVFS maintainer is already rolling out a fix for
this on the fwupd side, and suggested that the workaround in this PR
could be backed out again during the next cycle.
(There is a semantic mismatch in efivarfs where some essential
variable attributes are stored in the first 4 bytes of the file, and
so zero length files cannot exist, as they cannot be written back to
the underlying variable store. So now, they are dropped once the last
reference is released.)
Summary:
- Fix CPER error record parsing bugs
- Fix a couple of efivarfs issues that were introduced in the merge
window
- Fix an issue in the early remapping code of the MOKvar table"
* tag 'efi-fixes-for-v6.14-2' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi:
efi/mokvar-table: Avoid repeated map/unmap of the same page
efi: Don't map the entire mokvar table to determine its size
efivarfs: allow creation of zero length files
efivarfs: Defer PM notifier registration until .fill_super
efi/cper: Fix cper_arm_ctx_info alignment
efi/cper: Fix cper_ia_proc_ctx alignment
|
|
The syzbot reproducer mounts a f2fs image, then tries to unlink an
existing file. However, the unlinked file already has a link count of 0
when it is read for the first time in do_read_inode().
Add a check to sanity_check_inode() for i_nlink == 0.
[Chao Yu: rebase the code and fix orphan inode recovery issue]
Reported-by: syzbot+b01a36acd7007e273a83@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=b01a36acd7007e273a83
Fixes: 39a53e0ce0df ("f2fs: add superblock and major in-memory structure")
Signed-off-by: Leo Stone <leocstone@gmail.com>
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
If checkpoint was disabled, we missed to fix the write pointers.
Cc: <stable@vger.kernel.org>
Fixes: 1015035609e4 ("f2fs: fix changing cursegs if recovery fails on zoned device")
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
commit 4f993264fe29 ("f2fs: introduce discard_unit mount option") introduced
a bug, when we enable discard_unit=section option, it will set
.discard_granularity to BLKS_PER_SEC(), however discard granularity only
supports [1, 512], once section size is not equal to segment size, it will
cause issue_discard_thread() in DPOLICY_BG mode will not select discard entry
w/ any granularity to issue.
Fixes: 4f993264fe29 ("f2fs: introduce discard_unit mount option")
Reviewed-by: Daeho Jeong <daehojeong@google.com>
Signed-off-by: Yohan Joung <yohan.joung@sk.com>
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
Remove two accesses to page->index.
Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Link: https://lore.kernel.org/r/20250217185119.430193-10-willy@infradead.org
Tested-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
move_dirty_folio_in_page_array()
Shorten the name of this internal function by dropping the 'ceph_'
prefix and pass in a folio instead of a page.
Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Link: https://lore.kernel.org/r/20250217185119.430193-9-willy@infradead.org
Tested-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Remove uses of page->index and deprecated page APIs. Saves a lot of
hidden calls to compound_head().
Also convert is_page_index_contiguous() to is_folio_index_contiguous()
and make its arguments const.
Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Link: https://lore.kernel.org/r/20250217185119.430193-8-willy@infradead.org
Tested-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Remove the conversion back to a struct page and just use the folio
passed in.
Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Link: https://lore.kernel.org/r/20250217185119.430193-7-willy@infradead.org
Tested-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Remove references to page->index, page->mapping, thp_size(),
page_offset() and other page APIs in favour of their more efficient
folio replacements.
Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Link: https://lore.kernel.org/r/20250217185119.430193-6-willy@infradead.org
Tested-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Pass a folio around instead of a page. This removes an access to
page->index and a few hidden calls to compound_head().
Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Link: https://lore.kernel.org/r/20250217185119.430193-5-willy@infradead.org
Tested-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Both callers already have the folio. Pass it in and use it throughout.
Removes some hidden calls to compound_head() and a reference to
page->mapping.
Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Link: https://lore.kernel.org/r/20250217185119.430193-4-willy@infradead.org
Tested-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Convert the passed page to a folio and use it
throughout ceph_page_mkwrite(). Removes the last call to
page_mkwrite_check_truncate(), the last call to offset_in_thp() and one
of the last calls to thp_size(). Saves a few calls to compound_head().
Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Link: https://lore.kernel.org/r/20250217185119.430193-3-willy@infradead.org
Tested-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Ceph already has a writepages operation which is preferred over writepage
in all situations except for page migration. By adding a migrate_folio
operation, there will be no situations in which ->writepage should
be called. filemap_migrate_folio() is an appropriate operation to use
because the ceph data stored in folio->private does not contain any
reference to the memory address of the folio.
Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Link: https://lore.kernel.org/r/20250217185119.430193-2-willy@infradead.org
Tested-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
The generic/421 fails to finish because of the issue:
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.894678] INFO: task kworker/u48:0:11 blocked for more than 122 seconds.
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.895403] Not tainted 6.13.0-rc5+ #1
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.895867] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.896633] task:kworker/u48:0 state:D stack:0 pid:11 tgid:11 ppid:2 flags:0x00004000
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.896641] Workqueue: writeback wb_workfn (flush-ceph-24)
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897614] Call Trace:
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897620] <TASK>
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897629] __schedule+0x443/0x16b0
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897637] schedule+0x2b/0x140
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897640] io_schedule+0x4c/0x80
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897643] folio_wait_bit_common+0x11b/0x310
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897646] ? _raw_spin_unlock_irq+0xe/0x50
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897652] ? __pfx_wake_page_function+0x10/0x10
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897655] __folio_lock+0x17/0x30
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897658] ceph_writepages_start+0xca9/0x1fb0
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897663] ? fsnotify_remove_queued_event+0x2f/0x40
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897668] do_writepages+0xd2/0x240
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897672] __writeback_single_inode+0x44/0x350
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897675] writeback_sb_inodes+0x25c/0x550
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897680] wb_writeback+0x89/0x310
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897683] ? finish_task_switch.isra.0+0x97/0x310
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897687] wb_workfn+0xb5/0x410
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897689] process_one_work+0x188/0x3d0
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897692] worker_thread+0x2b5/0x3c0
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897694] ? __pfx_worker_thread+0x10/0x10
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897696] kthread+0xe1/0x120
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897699] ? __pfx_kthread+0x10/0x10
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897701] ret_from_fork+0x43/0x70
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897705] ? __pfx_kthread+0x10/0x10
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897707] ret_from_fork_asm+0x1a/0x30
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897711] </TASK>
There are several issues here:
(1) ceph_kill_sb() doesn't wait ending of flushing
all dirty folios/pages because of racy nature of
mdsc->stopping_blockers. As a result, mdsc->stopping
becomes CEPH_MDSC_STOPPING_FLUSHED too early.
(2) The ceph_inc_osd_stopping_blocker(fsc->mdsc) fails
to increment mdsc->stopping_blockers. Finally,
already locked folios/pages are never been unlocked
and the logic tries to lock the same page second time.
(3) The folio_batch with found dirty pages by
filemap_get_folios_tag() is not processed properly.
And this is why some number of dirty pages simply never
processed and we have dirty folios/pages after unmount
anyway.
This patch fixes the issues by means of:
(1) introducing dirty_folios counter and flush_end_wq
waiting queue in struct ceph_mds_client;
(2) ceph_dirty_folio() increments the dirty_folios
counter;
(3) writepages_finish() decrements the dirty_folios
counter and wake up all waiters on the queue
if dirty_folios counter is equal or lesser than zero;
(4) adding in ceph_kill_sb() method the logic of
checking the value of dirty_folios counter and
waiting if it is bigger than zero;
(5) adding ceph_inc_osd_stopping_blocker() call in the
beginning of the ceph_writepages_start() and
ceph_dec_osd_stopping_blocker() at the end of
the ceph_writepages_start() with the goal to resolve
the racy nature of mdsc->stopping_blockers.
sudo ./check generic/421
FSTYP -- ceph
PLATFORM -- Linux/x86_64 ceph-testing-0001 6.13.0+ #137 SMP PREEMPT_DYNAMIC Mon Feb 3 20:30:08 UTC 2025
MKFS_OPTIONS -- 127.0.0.1:40551:/scratch
MOUNT_OPTIONS -- -o name=fs,secret=<secret>,ms_mode=crc,nowsync,copyfrom 127.0.0.1:40551:/scratch /mnt/scratch
generic/421 7s ... 4s
Ran: generic/421
Passed all 1 tests
Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Link: https://lore.kernel.org/r/20250205000249.123054-5-slava@dubeyko.com
Tested-by: David Howells <dhowells@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Final responsibility of ceph_writepages_start() is
to submit write requests for processed dirty folios/pages.
The ceph_submit_write() summarize all this logic in
one method.
The generic/421 fails to finish because of the issue:
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.894678] INFO: task kworker/u48:0:11 blocked for more than 122 seconds.
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.895403] Not tainted 6.13.0-rc5+ #1
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.895867] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.896633] task:kworker/u48:0 state:D stack:0 pid:11 tgid:11 ppid:2 flags:0x00004000
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.896641] Workqueue: writeback wb_workfn (flush-ceph-24)
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897614] Call Trace:
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897620] <TASK>
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897629] __schedule+0x443/0x16b0
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897637] schedule+0x2b/0x140
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897640] io_schedule+0x4c/0x80
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897643] folio_wait_bit_common+0x11b/0x310
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897646] ? _raw_spin_unlock_irq+0xe/0x50
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897652] ? __pfx_wake_page_function+0x10/0x10
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897655] __folio_lock+0x17/0x30
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897658] ceph_writepages_start+0xca9/0x1fb0
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897663] ? fsnotify_remove_queued_event+0x2f/0x40
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897668] do_writepages+0xd2/0x240
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897672] __writeback_single_inode+0x44/0x350
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897675] writeback_sb_inodes+0x25c/0x550
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897680] wb_writeback+0x89/0x310
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897683] ? finish_task_switch.isra.0+0x97/0x310
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897687] wb_workfn+0xb5/0x410
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897689] process_one_work+0x188/0x3d0
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897692] worker_thread+0x2b5/0x3c0
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897694] ? __pfx_worker_thread+0x10/0x10
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897696] kthread+0xe1/0x120
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897699] ? __pfx_kthread+0x10/0x10
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897701] ret_from_fork+0x43/0x70
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897705] ? __pfx_kthread+0x10/0x10
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897707] ret_from_fork_asm+0x1a/0x30
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897711] </TASK>
There are two problems here:
if (!ceph_inc_osd_stopping_blocker(fsc->mdsc)) {
rc = -EIO;
goto release_folios;
}
(1) ceph_kill_sb() doesn't wait ending of flushing
all dirty folios/pages because of racy nature of
mdsc->stopping_blockers. As a result, mdsc->stopping
becomes CEPH_MDSC_STOPPING_FLUSHED too early.
(2) The ceph_inc_osd_stopping_blocker(fsc->mdsc) fails
to increment mdsc->stopping_blockers. Finally,
already locked folios/pages are never been unlocked
and the logic tries to lock the same page second time.
This patch implements refactoring of ceph_submit_write()
and also it solves the second issue.
Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Link: https://lore.kernel.org/r/20250205000249.123054-4-slava@dubeyko.com
Tested-by: David Howells <dhowells@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
First step of ceph_writepages_start() logic is
of finding the dirty memory folios and processing it.
This patch introduces ceph_process_folio_batch()
method that moves this logic into dedicated method.
The ceph_writepages_start() has this logic:
if (ceph_wbc.locked_pages == 0)
lock_page(page); /* first page */
else if (!trylock_page(page))
break;
<skipped>
if (folio_test_writeback(folio) ||
folio_test_private_2(folio) /* [DEPRECATED] */) {
if (wbc->sync_mode == WB_SYNC_NONE) {
doutc(cl, "%p under writeback\n", folio);
folio_unlock(folio);
continue;
}
doutc(cl, "waiting on writeback %p\n", folio);
folio_wait_writeback(folio);
folio_wait_private_2(folio); /* [DEPRECATED] */
}
The problem here that folio/page is locked here at first
and it is by set_page_writeback(page) later before
submitting the write request. The folio/page is unlocked
by writepages_finish() after finishing the write
request. It means that logic of checking folio_test_writeback()
and folio_wait_writeback() never works because page is locked
and it cannot be locked again until write request completion.
However, for majority of folios/pages the trylock_page()
is used. As a result, multiple threads can try to lock the same
folios/pages multiple times even if they are under writeback
already. It makes this logic more compute intensive than
it is necessary.
This patch changes this logic:
if (folio_test_writeback(folio) ||
folio_test_private_2(folio) /* [DEPRECATED] */) {
if (wbc->sync_mode == WB_SYNC_NONE) {
doutc(cl, "%p under writeback\n", folio);
folio_unlock(folio);
continue;
}
doutc(cl, "waiting on writeback %p\n", folio);
folio_wait_writeback(folio);
folio_wait_private_2(folio); /* [DEPRECATED] */
}
if (ceph_wbc.locked_pages == 0)
lock_page(page); /* first page */
else if (!trylock_page(page))
break;
This logic should exclude the ignoring of writeback
state of folios/pages.
Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Link: https://lore.kernel.org/r/20250205000249.123054-3-slava@dubeyko.com
Tested-by: David Howells <dhowells@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
The ceph_writepages_start() has unreasonably huge size and
complex logic that makes this method hard to understand.
Current state of the method's logic makes bug fix really
hard task. This patch extends the struct ceph_writeback_ctl
with the goal to make ceph_writepages_start() method
more compact and easy to understand by means of deep refactoring.
Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Link: https://lore.kernel.org/r/20250205000249.123054-2-slava@dubeyko.com
Tested-by: David Howells <dhowells@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
ceph already splices the correct dentry (in splice_dentry()) from the
result of mkdir but does nothing more with it.
Now that ->mkdir can return a dentry, return the correct dentry.
Note that previously ceph_mkdir() could call
ceph_init_inode_acls()
on the inode from the wrong dentry, which would be NULL. This
is safe as ceph_init_inode_acls() checks for NULL, but is not
strictly correct. With this patch, the inode for the returned dentry
is passed to ceph_init_inode_acls().
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Link: https://lore.kernel.org/r/20250227013949.536172-4-neilb@suse.de
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
After handling a mkdir, get the inode for the name and use
d_splice_alias() to store the correct dentry in the dcache.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Link: https://lore.kernel.org/r/20250227013949.536172-3-neilb@suse.de
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Some filesystems, such as NFS, cifs, ceph, and fuse, do not have
complete control of sequencing on the actual filesystem (e.g. on a
different server) and may find that the inode created for a mkdir
request already exists in the icache and dcache by the time the mkdir
request returns. For example, if the filesystem is mounted twice the
directory could be visible on the other mount before it is on the
original mount, and a pair of name_to_handle_at(), open_by_handle_at()
calls could instantiate the directory inode with an IS_ROOT() dentry
before the first mkdir returns.
This means that the dentry passed to ->mkdir() may not be the one that
is associated with the inode after the ->mkdir() completes. Some
callers need to interact with the inode after the ->mkdir completes and
they currently need to perform a lookup in the (rare) case that the
dentry is no longer hashed.
This lookup-after-mkdir requires that the directory remains locked to
avoid races. Planned future patches to lock the dentry rather than the
directory will mean that this lookup cannot be performed atomically with
the mkdir.
To remove this barrier, this patch changes ->mkdir to return the
resulting dentry if it is different from the one passed in.
Possible returns are:
NULL - the directory was created and no other dentry was used
ERR_PTR() - an error occurred
non-NULL - this other dentry was spliced in
This patch only changes file-systems to return "ERR_PTR(err)" instead of
"err" or equivalent transformations. Subsequent patches will make
further changes to some file-systems to return a correct dentry.
Not all filesystems reliably result in a positive hashed dentry:
- NFS, cifs, hostfs will sometimes need to perform a lookup of
the name to get inode information. Races could result in this
returning something different. Note that this lookup is
non-atomic which is what we are trying to avoid. Placing the
lookup in filesystem code means it only happens when the filesystem
has no other option.
- kernfs and tracefs leave the dentry negative and the ->revalidate
operation ensures that lookup will be called to correctly populate
the dentry. This could be fixed but I don't think it is important
to any of the users of vfs_mkdir() which look at the dentry.
The recommendation to use
d_drop();d_splice_alias()
is ugly but fits with current practice. A planned future patch will
change this.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: NeilBrown <neilb@suse.de>
Link: https://lore.kernel.org/r/20250227013949.536172-2-neilb@suse.de
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Cross-merge networking fixes after downstream PR (net-6.14-rc5).
Conflicts:
drivers/net/ethernet/cadence/macb_main.c
fa52f15c745c ("net: cadence: macb: Synchronize stats calculations")
75696dd0fd72 ("net: cadence: macb: Convert to get_stats64")
https://lore.kernel.org/20250224125848.68ee63e5@canb.auug.org.au
Adjacent changes:
drivers/net/ethernet/intel/ice/ice_sriov.c
79990cf5e7ad ("ice: Fix deinitializing VF in error path")
a203163274a4 ("ice: simplify VF MSI-X managing")
net/ipv4/tcp.c
18912c520674 ("tcp: devmem: don't write truncated dmabuf CMSGs to userspace")
297d389e9e5b ("net: prefix devmem specific helpers")
net/mptcp/subflow.c
8668860b0ad3 ("mptcp: reset when MPTCP opts are dropped after join")
c3349a22c200 ("mptcp: consolidate subflow cleanup")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from bluetooth.
We didn't get netfilter or wireless PRs this week, so next week's PR
is probably going to be bigger. A healthy dose of fixes for bugs
introduced in the current release nonetheless.
Current release - regressions:
- Bluetooth: always allow SCO packets for user channel
- af_unix: fix memory leak in unix_dgram_sendmsg()
- rxrpc:
- remove redundant peer->mtu_lock causing lockdep splats
- fix spinlock flavor issues with the peer record hash
- eth: iavf: fix circular lock dependency with netdev_lock
- net: use rtnl_net_dev_lock() in
register_netdevice_notifier_dev_net() RDMA driver register notifier
after the device
Current release - new code bugs:
- ethtool: fix ioctl confusing drivers about desired HDS user config
- eth: ixgbe: fix media cage present detection for E610 device
Previous releases - regressions:
- loopback: avoid sending IP packets without an Ethernet header
- mptcp: reset connection when MPTCP opts are dropped after join
Previous releases - always broken:
- net: better track kernel sockets lifetime
- ipv6: fix dst ref loop on input in seg6 and rpl lw tunnels
- phy: qca807x: use right value from DTS for DAC_DSP_BIAS_CURRENT
- eth: enetc: number of error handling fixes
- dsa: rtl8366rb: reshuffle the code to fix config / build issue with
LED support"
* tag 'net-6.14-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (53 commits)
net: ti: icss-iep: Reject perout generation request
idpf: fix checksums set in idpf_rx_rsc()
selftests: drv-net: Check if combined-count exists
net: ipv6: fix dst ref loop on input in rpl lwt
net: ipv6: fix dst ref loop on input in seg6 lwt
usbnet: gl620a: fix endpoint checking in genelink_bind()
net/mlx5: IRQ, Fix null string in debug print
net/mlx5: Restore missing trace event when enabling vport QoS
net/mlx5: Fix vport QoS cleanup on error
net: mvpp2: cls: Fixed Non IP flow, with vlan tag flow defination.
af_unix: Fix memory leak in unix_dgram_sendmsg()
net: Handle napi_schedule() calls from non-interrupt
net: Clear old fragment checksum value in napi_reuse_skb
gve: unlink old napi when stopping a queue using queue API
net: Use rtnl_net_dev_lock() in register_netdevice_notifier_dev_net().
tcp: Defer ts_recent changes until req is owned
net: enetc: fix the off-by-one issue in enetc_map_tx_tso_buffs()
net: enetc: remove the mm_lock from the ENETC v4 driver
net: enetc: add missing enetc4_link_deinit()
net: enetc: update UDP checksum when updating originTimestamp field
...
|