Age | Commit message (Collapse) | Author |
|
__lookup_nat_cache follows LRU manner to move clean nat entry, when nat
entries are going to be dirty, no need to move them to tail of lru list.
Introduce a parameter 'for_dirty' to avoid it.
Signed-off-by: wangzijie <wangzijie1@honor.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull dentry d_flags updates from Al Viro:
"The current exclusion rules for dentry->d_flags stores are rather
unpleasant. The basic rules are simple:
- stores to dentry->d_flags are OK under dentry->d_lock
- stores to dentry->d_flags are OK in the dentry constructor, before
becomes potentially visible to other threads
Unfortunately, there's a couple of exceptions to that, and that's
where the headache comes from.
The main PITA comes from d_set_d_op(); that primitive sets ->d_op of
dentry and adjusts the flags that correspond to presence of individual
methods. It's very easy to misuse; existing uses _are_ safe, but proof
of correctness is brittle.
Use in __d_alloc() is safe (we are within a constructor), but we might
as well precalculate the initial value of 'd_flags' when we set the
default ->d_op for given superblock and set 'd_flags' directly instead
of messing with that helper.
The reasons why other uses are safe are bloody convoluted; I'm not
going to reproduce it here. See [1] for gory details, if you care. The
critical part is using d_set_d_op() only just prior to
d_splice_alias(), which makes a combination of d_splice_alias() with
setting ->d_op, etc a natural replacement primitive.
Better yet, if we go that way, it's easy to take setting ->d_op and
modifying 'd_flags' under ->d_lock, which eliminates the headache as
far as 'd_flags' exclusion rules are concerned. Other exceptions are
minor and easy to deal with.
What this series does:
- d_set_d_op() is no longer available; instead a new primitive
(d_splice_alias_ops()) is provided, equivalent to combination of
d_set_d_op() and d_splice_alias().
- new field of struct super_block - 's_d_flags'. This sets the
default value of 'd_flags' to be used when allocating dentries on
this filesystem.
- new primitive for setting 's_d_op': set_default_d_op(). This
replaces stores to 's_d_op' at mount time.
All in-tree filesystems converted; out-of-tree ones will get caught
by the compiler ('s_d_op' is renamed, so stores to it will be
caught). 's_d_flags' is set by the same primitive to match the
's_d_op'.
- a lot of filesystems had sb->s_d_op->d_delete equal to
always_delete_dentry; that is equivalent to setting
DCACHE_DONTCACHE in 'd_flags', so such filesystems can bloody well
set that bit in 's_d_flags' and drop 'd_delete()' from
dentry_operations.
In quite a few cases that results in empty dentry_operations, which
means that we can get rid of those.
- kill simple_dentry_operations - not needed anymore
- massage d_alloc_parallel() to get rid of the other exception wrt
'd_flags' stores - we can set DCACHE_PAR_LOOKUP as soon as we
allocate the new dentry; no need to delay that until we commit to
using the sucker.
As the result, 'd_flags' stores are all either under ->d_lock or done
before the dentry becomes visible in any shared data structures"
Link: https://lore.kernel.org/all/20250224010624.GT1977892@ZenIV/ [1]
* tag 'pull-dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (21 commits)
configfs: use DCACHE_DONTCACHE
debugfs: use DCACHE_DONTCACHE
efivarfs: use DCACHE_DONTCACHE instead of always_delete_dentry()
9p: don't bother with always_delete_dentry
ramfs, hugetlbfs, mqueue: set DCACHE_DONTCACHE
kill simple_dentry_operations
devpts, sunrpc, hostfs: don't bother with ->d_op
shmem: no dentry retention past the refcount reaching zero
d_alloc_parallel(): set DCACHE_PAR_LOOKUP earlier
make d_set_d_op() static
simple_lookup(): just set DCACHE_DONTCACHE
tracefs: Add d_delete to remove negative dentries
set_default_d_op(): calculate the matching value for ->d_flags
correct the set of flags forbidden at d_set_d_op() time
split d_flags calculation out of d_set_d_op()
new helper: set_default_d_op()
fuse: no need for special dentry_operations for root dentry
switch procfs from d_set_d_op() to d_splice_alias_ops()
new helper: d_splice_alias_ops()
procfs: kill ->proc_dops
...
|
|
The most unlikely watched permission event is FAN_ACCESS_PERM, because
at the time that it was introduced there were no evictable ignore mark,
so subscribing to FAN_ACCESS_PERM would have incured a very high
overhead.
Yet, when we set the fmode to FMODE_NOTIFY_HSM(), we never skip trying
to send FAN_ACCESS_PERM, which is almost always a waste of cycles.
We got to this logic because of bundling FAN_OPEN*_PERM and
FAN_ACCESS_PERM in the same category and because FAN_OPEN_PERM is a
commonly used event.
By open coding fsnotify_open_perm() in fsnotify_open_perm_and_set_mode(),
we no longer need to regard FAN_OPEN*_PERM when calculating fmode.
This leaves the case of having pre-content events and not having any
other permission event in the object masks a more likely case than the
other way around.
Rework the fmode macros and code so that their meaning now refers only
to hooks on an already open file:
- FMODE_NOTIFY_NONE() skip all events
- FMODE_NOTIFY_ACCESS_PERM() send all permission events including
FAN_ACCESS_PERM
- FMODE_NOTIFY_HSM() send pre-content permission events
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20250708143641.418603-3-amir73il@gmail.com
|
|
Create helper fsnotify_open_perm_and_set_mode() that moves the
fsnotify_open_perm() hook into file_set_fsnotify_mode_from_watchers().
This will allow some more optimizations.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20250708143641.418603-2-amir73il@gmail.com
|
|
Pull nfsd updates from Chuck Lever:
"NFSD is finally able to offer write delegations to clients that open
files with O_WRONLY, thanks to patches from Dai Ngo. We're expecting
this to accelerate a few interesting corner cases.
The cap on the number of operations per NFSv4 COMPOUND has been
lifted. Now, clients that send COMPOUNDs containing dozens of
operations (for example, a long stream of LOOKUP operations to walk a
pathname in a single round trip) will no longer be rejected.
This release re-enables the ability for NFSD to perform NFSv4.2 COPY
operations asynchronously. This feature has been disabled to mitigate
the risk of denial-of-service when too many such requests arrive.
Many thanks to the contributors, reviewers, testers, and bug reporters
who participated during the v6.17 development cycle"
* tag 'nfsd-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (32 commits)
nfsd: Drop dprintk in blocklayout xdr functions
sunrpc: make svc_tcp_sendmsg() take a signed sentp pointer
sunrpc: rearrange struct svc_rqst for fewer cachelines
sunrpc: return better error in svcauth_gss_accept() on alloc failure
sunrpc: reset rq_accept_statp when starting a new RPC
sunrpc: remove SVC_SYSERR
sunrpc: fix handling of unknown auth status codes
NFSD: Simplify struct knfsd_fh
NFSD: Access a knfsd_fh's fsid by pointer
Revert "NFSD: Force all NFSv4.2 COPY requests to be synchronous"
NFSD: Avoid multiple -Wflex-array-member-not-at-end warnings
NFSD: Use vfs_iocb_iter_write()
NFSD: Use vfs_iocb_iter_read()
NFSD: Clean up kdoc for nfsd_open_local_fh()
NFSD: Clean up kdoc for nfsd_file_put_local()
NFSD: Remove definition for trace_nfsd_ctl_maxconn
NFSD: Remove definition for trace_nfsd_file_gc_recent
NFSD: Remove definitions for unused trace_nfsd_file_lru trace points
NFSD: Remove definition for trace_nfsd_file_unhash_and_queue
nfsd: Use correct error code when decoding extents
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull gfs2 updates from Andreas Gruenbacher:
- Prevent cluster nodes from trying to recover their own filesystems
during a withdraw
- Add two missing migrate_folio aops and an additional exhash directory
consistency check (both triggered by syzbot bug reports)
- Sanitize how dlm results are processed and clean up a few quirks in
the glock code
- Minor stuff: Get rid of the GIF_ALLOC_FAILED flag; use SECTOR_SIZE
and SECTOR_SHIFT
* tag 'gfs2-for-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
gfs2: No more self recovery
gfs2: Validate i_depth for exhash directories
gfs2: Set .migrate_folio in gfs2_{rgrp,meta}_aops
gfs2: a minor finish_xmote cleanup
gfs2: simplify finish_xmote
gfs2: sanitize the gdlm_ast -> finish_xmote interface
gfs2: Minor do_xmote cancelation fix
gfs2: Remove GIF_ALLOC_FAILED flag
gfs2: Use SECTOR_SIZE and SECTOR_SHIFT
|
|
Pull xfs updates from Carlos Maiolino:
"This doesn't contain any new features. It mostly is a collection of
clean ups and code refactoring that I preferred to postpone to the
merge window.
It includes removal of several unused tracepoints, refactoring key
comparing routines under the B-Trees management and cleanup of xfs
journaling code"
* tag 'xfs-merge-6.17' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (44 commits)
xfs: don't use a xfs_log_iovec for ri_buf in log recovery
xfs: don't use a xfs_log_iovec for attr_item names and values
xfs: use better names for size members in xfs_log_vec
xfs: cleanup the ordered item logic in xlog_cil_insert_format_items
xfs: don't pass the old lv to xfs_cil_prepare_item
xfs: remove unused trace event xfs_reflink_cow_enospc
xfs: remove unused trace event xfs_discard_rtrelax
xfs: remove unused trace event xfs_log_cil_return
xfs: remove unused trace event xfs_dqreclaim_dirty
fs/xfs: replace strncpy with memtostr_pad()
xfs: Remove unused label in xfs_dax_notify_dev_failure
xfs: improve the comments in xfs_select_zone_nowait
xfs: improve the comments in xfs_max_open_zones
xfs: stop passing an inode to the zone space reservation helpers
xfs: rename oz_write_pointer to oz_allocated
xfs: use a uint32_t to cache i_used_blocks in xfs_init_zone
xfs: improve the xg_active_ref check in xfs_group_free
xfs: remove the xlog_ticket_t typedef
xfs: remove xrep_trans_{alloc,cancel}_hook_dummy
xfs: return the allocated transaction from xchk_trans_alloc_empty
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs
Pull erofs updates from Gao Xiang:
"We now support metadata compression. It can be useful for embedded use
cases or archiving a large number of small files.
Additionally, readdir performance has been improved by enabling
readahead (note that it was already common practice for ext3/4 non-dx
and f2fs directories). We may consider further improvements later to
align with ext4's s_inode_readahead_blks behavior for slow devices
too.
The remaining commits are minor.
Summary:
- Add support for metadata compression
- Enable readahead for directories to improve readdir performance
- Minor fixes and cleanups"
* tag 'erofs-for-6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
erofs: support to readahead dirent blocks in erofs_readdir()
erofs: implement metadata compression
erofs: add on-disk definition for metadata compression
erofs: fix build error with CONFIG_EROFS_FS_ZIP_ACCEL=y
erofs: remove ENOATTR definition
erofs: refine erofs_iomap_begin()
erofs: unify meta buffers in z_erofs_fill_inode()
erofs: remove need_kmap in erofs_read_metabuf()
erofs: do sanity check on m->type in z_erofs_load_compact_lcluster()
erofs: get rid of {get,put}_page() for ztailpacking data
|
|
https://github.com/Paragon-Software-Group/linux-ntfs3
Pull ntfs3 updates from Konstantin Komarov:
"Added:
- sanity check for file name
- mark live inode as bad and avoid any operations
Fixed:
- handling of symlinks created in windows
- creation of symlinks for relative path
Changed:
- cancel setting inode as bad after removing name fails
- revert 'replace inode_trylock with inode_lock'"
* tag 'ntfs3_for_6.17' of https://github.com/Paragon-Software-Group/linux-ntfs3:
Revert "fs/ntfs3: Replace inode_trylock with inode_lock"
fs/ntfs3: Exclude call make_bad_inode for live nodes.
fs/ntfs3: cancle set bad inode after removing name fails
fs/ntfs3: Add sanity check for file name
fs/ntfs3: correctly create symlink for relative path
fs/ntfs3: fix symlinks cannot be handled correctly
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"A number of usability and feature updates, scattered performance
improvements and fixes. Highlight of the core changes is getting
closer to enabling large folios (now behind a config option).
User visible changes:
- update defrag ioctl, add new flag to request no compression on
existing extents
- restrict writes to block devices after mount
- in experimental config, enable large folios for data, almost
complete but not widely tested
- add stats tracking duration of critical section in transaction
commit to /sys/fs/btrfs/FSID/commit_stats
Performance improvements:
- caching of lookup results of free space bitmap (20% runtime
improvement on an empty file creation benchmark)
- accessors to metadata (b-tree items) simplified and optimized,
minor improvement in metadata-heavy workloads
- readahead on compressed data improves sequential read
- the xarray for extent buffers is indexed by denser keys, leading to
better packing of the nodes (50-70% reduction of leaf nodes)
Notable fixes:
- stricter compression mount option parsing
- send properly emits fallocate command for file holes when protocol
v2 is used
- fix overallocation of chunks with mount option 'ssd_spread', due to
interaction with size classes not finding the right chunk
(workaround: manual reclaim by 'usage' balance filter)
- various quota enable/disable races with rescan, more verbose
notifications about inconsistent state
- populate otime in tree-log during log replay
- handle ENOSPC when NOCOW file is used with mmap()
Core:
- large data folios enabled in experimental config
- improved error handling, transaction abort call sites
- in zoned mode, allocate reloc block group on mount to make sure
there's always one available for zone reclaim under heavy load
- rework device opening, they're always open as read-only and delayed
until the super block is created, allowing the restricted writes
after mount
- preparatory work for adding blk_holder_ops, allowing device
freeze/thaw in the future
Cleanups, refactoring:
- type and naming unifications (int/bool, return variables)
- rb-tree helper refactoring and simplifications
- reorder memory allocations to less critical places
- RCU string (used for device name) refactoring and API removal
- replace all remaining use of strcpy()"
* tag 'for-6.17-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (209 commits)
btrfs: send: use fallocate for hole punching with send stream v2
btrfs: unfold transaction aborts when writing dirty block groups
btrfs: use saner variable type and name to indicate extrefs at add_inode_ref()
btrfs: don't skip remaining extrefs if dir not found during log replay
btrfs: don't ignore inode missing when replaying log tree
btrfs: enable large data folios for data reloc inode
btrfs: output more info when btrfs_subpage_assert() failed
btrfs: reloc: unconditionally invalidate the page cache for each cluster
btrfs: defrag: add flag to force no-compression
btrfs: fix ssd_spread overallocation
btrfs: zoned: requeue to unused block group list if zone finish failed
btrfs: zoned: do not remove unwritten non-data block group
btrfs: remove btrfs_clear_extent_bits()
btrfs: use cached state when falling back from NOCoW write to CoW write
btrfs: set EXTENT_NORESERVE before range unlock in btrfs_truncate_block()
btrfs: don't print relocation messages from auto reclaim
btrfs: remove redundant auto reclaim log message
btrfs: make btrfs_check_nocow_lock() check more than one extent
btrfs: assert we can NOCOW the range in btrfs_truncate_block()
btrfs: update function comment for btrfs_check_nocow_lock()
...
|
|
Steal string reference from @param->string rather than duplicating it.
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Steal string reference from @param->string rather than duplicating it.
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Steal string reference from @param->string rather than duplicating it.
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Steal string reference from @param->string rather than duplicating it.
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Steal string reference from @param->string rather than duplicating it.
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
SMB1 already supports querying reparse points and detecting types of
symlink, fifo, socket, block and char.
This change implements the missing part - ability to create a new reparse
points over SMB1. This includes everything which SMB2+ already supports:
- native SMB symlinks and sockets
- NFS style of special files (symlinks, fifos, sockets, char/block devs)
- WSL style of special files (symlinks, fifos, sockets, char/block devs)
Attaching a reparse point to an existing file or directory is done via
SMB1 SMB_COM_NT_TRANSACT/NT_TRANSACT_IOCTL/FSCTL_SET_REPARSE_POINT command
and implemented in a new cifs_create_reparse_inode() function.
This change introduce a new callback ->create_reparse_inode() which creates
a new reperse point file or directory and returns inode. For SMB1 it is
provided via that new cifs_create_reparse_inode() function.
Existing reparse.c code was only slightly updated to call new protocol
callback ->create_reparse_inode() instead of hardcoded SMB2+ function.
This make the whole reparse.c code to work with every SMB dialect.
The original callback ->create_reparse_symlink() is not needed anymore as
the implementation of new create_reparse_symlink() function is dialect
agnostic too. So the link.c code was updated to call that function directly
(and not via callback).
Signed-off-by: Pali Rohár <pali@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
WSL EAs are not required for native SMB symlinks, so do not query them from server.
Signed-off-by: Pali Rohár <pali@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
When not searching for child entries with msearch wildcard pattern then ask
server just for one output entry. There is no need to ask for more entries
as we are interested only for one search result, as we are doing query on
path.
CIFSFindFirst() with msearch=false is called by the cifs_query_path_info()
function.
Signed-off-by: Pali Rohár <pali@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
To query root path (without msearch wildcard) it is needed to
send pattern '\' instead of '' (empty string).
This allows to use CIFSFindFirst() to query information about root path
which is being used in followup changes.
This change fixes the stat() syscall called on the root path on the mount.
It is because stat() syscall uses the cifs_query_path_info() function and
it can fallback to the CIFSFindFirst() usage with msearch=false.
Signed-off-by: Pali Rohár <pali@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Some servers might enforce the SPN to be set in the target info
blob (AV pairs) when sending NTLMSSP_AUTH message. In Windows Server,
this could be enforced with SmbServerNameHardeningLevel set to 2.
Fix this by always appending SPN (cifs/<hostname>) to the existing
list of target infos when setting up NTLMv2 response blob.
Cc: linux-cifs@vger.kernel.org
Cc: David Howells <dhowells@redhat.com>
Reported-by: Pierguido Lambri <plambri@redhat.com>
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Zero-length AV pairs should be considered as valid target infos.
Don't skip the next AV pairs that follow them.
Cc: linux-cifs@vger.kernel.org
Cc: David Howells <dhowells@redhat.com>
Fixes: 0e8ae9b953bc ("smb: client: parse av pair type 4 in CHALLENGE_MESSAGE")
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
The handlecache code today tracks the time at which dir lease was
acquired and the laundromat thread uses that to check for old
entries to cleanup.
However, if a directory is actively accessed, it should not
be chosen to expire first.
This change adds a new last_access_time field to cfid and
uses that to decide expiry of the cfid.
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
cached_dir_lease_break() has return type as int but only
returning true or false. change return type of this function
to bool for clarity.
Signed-off-by: Bharath SM <bharathsm@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
We now do a weighted selection of server interfaces when allocating
new channels. The weights are decided based on the speed advertised.
The fulfilled weight for an interface is a counter that is used to
track the interface selection. It should be reset back to zero once
all interfaces fulfilling their weight.
In cifs_chan_update_iface, this reset logic was missing. As a result
when the server interface list changes, the client may not be able
to find a new candidate for other channels after all interfaces have
been fulfilled.
Fixes: a6d8fb54a515 ("cifs: distribute channels across interfaces based on speed")
Cc: <stable@vger.kernel.org>
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
After commit 5c70eb5c593d ("net: better track kernel sockets lifetime"),
kernel sockets now use net_passive reference counting. However, commit
95d2b9f693ff ("Revert "smb: client: fix TCP timers deadlock after rmmod"")
restored the manual socket refcount manipulation without adapting to this
new mechanism, causing a memory leak.
The issue can be reproduced by[1]:
1. Creating a network namespace
2. Mounting and Unmounting CIFS within the namespace
3. Deleting the namespace
Some memory leaks may appear after a period of time following step 3.
unreferenced object 0xffff9951419f6b00 (size 256):
comm "ip", pid 447, jiffies 4294692389 (age 14.730s)
hex dump (first 32 bytes):
1b 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00 00 00 00 00 00 00 00 80 77 c2 44 51 99 ff ff .........w.DQ...
backtrace:
__kmem_cache_alloc_node+0x30e/0x3d0
__kmalloc+0x52/0x120
net_alloc_generic+0x1d/0x30
copy_net_ns+0x86/0x200
create_new_namespaces+0x117/0x300
unshare_nsproxy_namespaces+0x60/0xa0
ksys_unshare+0x148/0x360
__x64_sys_unshare+0x12/0x20
do_syscall_64+0x59/0x110
entry_SYSCALL_64_after_hwframe+0x78/0xe2
...
unreferenced object 0xffff9951442e7500 (size 32):
comm "mount.cifs", pid 475, jiffies 4294693782 (age 13.343s)
hex dump (first 32 bytes):
40 c5 38 46 51 99 ff ff 18 01 96 42 51 99 ff ff @.8FQ......BQ...
01 00 00 00 6f 00 c5 07 6f 00 d8 07 00 00 00 00 ....o...o.......
backtrace:
__kmem_cache_alloc_node+0x30e/0x3d0
kmalloc_trace+0x2a/0x90
ref_tracker_alloc+0x8e/0x1d0
sk_alloc+0x18c/0x1c0
inet_create+0xf1/0x370
__sock_create+0xd7/0x1e0
generic_ip_connect+0x1d4/0x5a0 [cifs]
cifs_get_tcp_session+0x5d0/0x8a0 [cifs]
cifs_mount_get_session+0x47/0x1b0 [cifs]
dfs_mount_share+0xfa/0xa10 [cifs]
cifs_mount+0x68/0x2b0 [cifs]
cifs_smb3_do_mount+0x10b/0x760 [cifs]
smb3_get_tree+0x112/0x2e0 [cifs]
vfs_get_tree+0x29/0xf0
path_mount+0x2d4/0xa00
__se_sys_mount+0x165/0x1d0
Root cause:
When creating kernel sockets, sk_alloc() calls net_passive_inc() for
sockets with sk_net_refcnt=0. The CIFS code manually converts kernel
sockets to user sockets by setting sk_net_refcnt=1, but doesn't call
the corresponding net_passive_dec(). This creates an imbalance in the
net_passive counter, which prevents the network namespace from being
destroyed when its last user reference is dropped. As a result, the
entire namespace and all its associated resources remain allocated.
Timeline of patches leading to this issue:
- commit ef7134c7fc48 ("smb: client: Fix use-after-free of network
namespace.") in v6.12 fixed the original netns UAF by manually
managing socket refcounts
- commit e9f2517a3e18 ("smb: client: fix TCP timers deadlock after
rmmod") in v6.13 attempted to use kernel sockets but introduced
TCP timer issues
- commit 5c70eb5c593d ("net: better track kernel sockets lifetime")
in v6.14-rc5 introduced the net_passive mechanism with
sk_net_refcnt_upgrade() for proper socket conversion
- commit 95d2b9f693ff ("Revert "smb: client: fix TCP timers deadlock
after rmmod"") in v6.15-rc3 reverted to manual refcount management
without adapting to the new net_passive changes
Fix this by using sk_net_refcnt_upgrade() which properly handles the
net_passive counter when converting kernel sockets to user sockets.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=220343 [1]
Fixes: 95d2b9f693ff ("Revert "smb: client: fix TCP timers deadlock after rmmod"")
Cc: stable@vger.kernel.org
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Enzo Matsumiya <ematsumiya@suse.de>
Signed-off-by: Wang Zhaolong <wangzhaolong@huaweicloud.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
The hfs_find_init() method can trigger the crash
if tree pointer is NULL:
[ 45.746290][ T9787] Oops: general protection fault, probably for non-canonical address 0xdffffc0000000008: 0000 [#1] SMP KAI
[ 45.747287][ T9787] KASAN: null-ptr-deref in range [0x0000000000000040-0x0000000000000047]
[ 45.748716][ T9787] CPU: 2 UID: 0 PID: 9787 Comm: repro Not tainted 6.16.0-rc3 #10 PREEMPT(full)
[ 45.750250][ T9787] Hardware name: QEMU Ubuntu 24.04 PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[ 45.751983][ T9787] RIP: 0010:hfs_find_init+0x86/0x230
[ 45.752834][ T9787] Code: c1 ea 03 80 3c 02 00 0f 85 9a 01 00 00 4c 8d 6b 40 48 c7 45 18 00 00 00 00 48 b8 00 00 00 00 00 fc
[ 45.755574][ T9787] RSP: 0018:ffffc90015157668 EFLAGS: 00010202
[ 45.756432][ T9787] RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffffff819a4d09
[ 45.757457][ T9787] RDX: 0000000000000008 RSI: ffffffff819acd3a RDI: ffffc900151576e8
[ 45.758282][ T9787] RBP: ffffc900151576d0 R08: 0000000000000005 R09: 0000000000000000
[ 45.758943][ T9787] R10: 0000000080000000 R11: 0000000000000001 R12: 0000000000000004
[ 45.759619][ T9787] R13: 0000000000000040 R14: ffff88802c50814a R15: 0000000000000000
[ 45.760293][ T9787] FS: 00007ffb72734540(0000) GS:ffff8880cec64000(0000) knlGS:0000000000000000
[ 45.761050][ T9787] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 45.761606][ T9787] CR2: 00007f9bd8225000 CR3: 000000010979a000 CR4: 00000000000006f0
[ 45.762286][ T9787] Call Trace:
[ 45.762570][ T9787] <TASK>
[ 45.762824][ T9787] hfs_ext_read_extent+0x190/0x9d0
[ 45.763269][ T9787] ? submit_bio_noacct_nocheck+0x2dd/0xce0
[ 45.763766][ T9787] ? __pfx_hfs_ext_read_extent+0x10/0x10
[ 45.764250][ T9787] hfs_get_block+0x55f/0x830
[ 45.764646][ T9787] block_read_full_folio+0x36d/0x850
[ 45.765105][ T9787] ? __pfx_hfs_get_block+0x10/0x10
[ 45.765541][ T9787] ? const_folio_flags+0x5b/0x100
[ 45.765972][ T9787] ? __pfx_hfs_read_folio+0x10/0x10
[ 45.766415][ T9787] filemap_read_folio+0xbe/0x290
[ 45.766840][ T9787] ? __pfx_filemap_read_folio+0x10/0x10
[ 45.767325][ T9787] ? __filemap_get_folio+0x32b/0xbf0
[ 45.767780][ T9787] do_read_cache_folio+0x263/0x5c0
[ 45.768223][ T9787] ? __pfx_hfs_read_folio+0x10/0x10
[ 45.768666][ T9787] read_cache_page+0x5b/0x160
[ 45.769070][ T9787] hfs_btree_open+0x491/0x1740
[ 45.769481][ T9787] hfs_mdb_get+0x15e2/0x1fb0
[ 45.769877][ T9787] ? __pfx_hfs_mdb_get+0x10/0x10
[ 45.770316][ T9787] ? find_held_lock+0x2b/0x80
[ 45.770731][ T9787] ? lockdep_init_map_type+0x5c/0x280
[ 45.771200][ T9787] ? lockdep_init_map_type+0x5c/0x280
[ 45.771674][ T9787] hfs_fill_super+0x38e/0x720
[ 45.772092][ T9787] ? __pfx_hfs_fill_super+0x10/0x10
[ 45.772549][ T9787] ? snprintf+0xbe/0x100
[ 45.772931][ T9787] ? __pfx_snprintf+0x10/0x10
[ 45.773350][ T9787] ? do_raw_spin_lock+0x129/0x2b0
[ 45.773796][ T9787] ? find_held_lock+0x2b/0x80
[ 45.774215][ T9787] ? set_blocksize+0x40a/0x510
[ 45.774636][ T9787] ? sb_set_blocksize+0x176/0x1d0
[ 45.775087][ T9787] ? setup_bdev_super+0x369/0x730
[ 45.775533][ T9787] get_tree_bdev_flags+0x384/0x620
[ 45.775985][ T9787] ? __pfx_hfs_fill_super+0x10/0x10
[ 45.776453][ T9787] ? __pfx_get_tree_bdev_flags+0x10/0x10
[ 45.776950][ T9787] ? bpf_lsm_capable+0x9/0x10
[ 45.777365][ T9787] ? security_capable+0x80/0x260
[ 45.777803][ T9787] vfs_get_tree+0x8e/0x340
[ 45.778203][ T9787] path_mount+0x13de/0x2010
[ 45.778604][ T9787] ? kmem_cache_free+0x2b0/0x4c0
[ 45.779052][ T9787] ? __pfx_path_mount+0x10/0x10
[ 45.779480][ T9787] ? getname_flags.part.0+0x1c5/0x550
[ 45.779954][ T9787] ? putname+0x154/0x1a0
[ 45.780335][ T9787] __x64_sys_mount+0x27b/0x300
[ 45.780758][ T9787] ? __pfx___x64_sys_mount+0x10/0x10
[ 45.781232][ T9787] do_syscall_64+0xc9/0x480
[ 45.781631][ T9787] entry_SYSCALL_64_after_hwframe+0x77/0x7f
[ 45.782149][ T9787] RIP: 0033:0x7ffb7265b6ca
[ 45.782539][ T9787] Code: 48 8b 0d c9 17 0d 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48
[ 45.784212][ T9787] RSP: 002b:00007ffc0c10cfb8 EFLAGS: 00000206 ORIG_RAX: 00000000000000a5
[ 45.784935][ T9787] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007ffb7265b6ca
[ 45.785626][ T9787] RDX: 0000200000000240 RSI: 0000200000000280 RDI: 00007ffc0c10d100
[ 45.786316][ T9787] RBP: 00007ffc0c10d190 R08: 00007ffc0c10d000 R09: 0000000000000000
[ 45.787011][ T9787] R10: 0000000000000048 R11: 0000000000000206 R12: 0000560246733250
[ 45.787697][ T9787] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 45.788393][ T9787] </TASK>
[ 45.788665][ T9787] Modules linked in:
[ 45.789058][ T9787] ---[ end trace 0000000000000000 ]---
[ 45.789554][ T9787] RIP: 0010:hfs_find_init+0x86/0x230
[ 45.790028][ T9787] Code: c1 ea 03 80 3c 02 00 0f 85 9a 01 00 00 4c 8d 6b 40 48 c7 45 18 00 00 00 00 48 b8 00 00 00 00 00 fc
[ 45.792364][ T9787] RSP: 0018:ffffc90015157668 EFLAGS: 00010202
[ 45.793155][ T9787] RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffffff819a4d09
[ 45.794123][ T9787] RDX: 0000000000000008 RSI: ffffffff819acd3a RDI: ffffc900151576e8
[ 45.795105][ T9787] RBP: ffffc900151576d0 R08: 0000000000000005 R09: 0000000000000000
[ 45.796135][ T9787] R10: 0000000080000000 R11: 0000000000000001 R12: 0000000000000004
[ 45.797114][ T9787] R13: 0000000000000040 R14: ffff88802c50814a R15: 0000000000000000
[ 45.798024][ T9787] FS: 00007ffb72734540(0000) GS:ffff8880cec64000(0000) knlGS:0000000000000000
[ 45.799019][ T9787] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 45.799822][ T9787] CR2: 00007f9bd8225000 CR3: 000000010979a000 CR4: 00000000000006f0
[ 45.800747][ T9787] Kernel panic - not syncing: Fatal exception
The hfs_fill_super() calls hfs_mdb_get() method that tries
to construct Extents Tree and Catalog Tree:
HFS_SB(sb)->ext_tree = hfs_btree_open(sb, HFS_EXT_CNID, hfs_ext_keycmp);
if (!HFS_SB(sb)->ext_tree) {
pr_err("unable to open extent tree\n");
goto out;
}
HFS_SB(sb)->cat_tree = hfs_btree_open(sb, HFS_CAT_CNID, hfs_cat_keycmp);
if (!HFS_SB(sb)->cat_tree) {
pr_err("unable to open catalog tree\n");
goto out;
}
However, hfs_btree_open() calls read_mapping_page() that
calls hfs_get_block(). And this method calls hfs_ext_read_extent():
static int hfs_ext_read_extent(struct inode *inode, u16 block)
{
struct hfs_find_data fd;
int res;
if (block >= HFS_I(inode)->cached_start &&
block < HFS_I(inode)->cached_start + HFS_I(inode)->cached_blocks)
return 0;
res = hfs_find_init(HFS_SB(inode->i_sb)->ext_tree, &fd);
if (!res) {
res = __hfs_ext_cache_extent(&fd, inode, block);
hfs_find_exit(&fd);
}
return res;
}
The problem here that hfs_find_init() is trying to use
HFS_SB(inode->i_sb)->ext_tree that is not initialized yet.
It will be initailized when hfs_btree_open() finishes
the execution.
The patch adds checking of tree pointer in hfs_find_init()
and it reworks the logic of hfs_btree_open() by reading
the b-tree's header directly from the volume. The read_mapping_page()
is exchanged on filemap_grab_folio() that grab the folio from
mapping. Then, sb_bread() extracts the b-tree's header
content and copy it into the folio.
Reported-by: Wenzhi Wang <wenzhi.wang@uwaterloo.ca>
Signed-off-by: Viacheslav Dubeyko <slava@dubeyko.com>
cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
cc: Yangtao Li <frank.li@vivo.com>
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/20250710213657.108285-1-slava@dubeyko.com
Signed-off-by: Viacheslav Dubeyko <slava@dubeyko.com>
|
|
This patch introduces is_bnode_offset_valid() method that checks
the requested offset value. Also, it introduces
check_and_correct_requested_length() method that checks and
correct the requested length (if it is necessary). These methods
are used in hfs_bnode_read(), hfs_bnode_write(), hfs_bnode_clear(),
hfs_bnode_copy(), and hfs_bnode_move() with the goal to prevent
the access out of allocated memory and triggering the crash.
Signed-off-by: Viacheslav Dubeyko <slava@dubeyko.com>
Link: https://lore.kernel.org/r/20250703214912.244138-1-slava@dubeyko.com
Signed-off-by: Viacheslav Dubeyko <slava@dubeyko.com>
|
|
The hfsplus_bnode_read() method can trigger the issue:
[ 174.852007][ T9784] ==================================================================
[ 174.852709][ T9784] BUG: KASAN: slab-out-of-bounds in hfsplus_bnode_read+0x2f4/0x360
[ 174.853412][ T9784] Read of size 8 at addr ffff88810b5fc6c0 by task repro/9784
[ 174.854059][ T9784]
[ 174.854272][ T9784] CPU: 1 UID: 0 PID: 9784 Comm: repro Not tainted 6.16.0-rc3 #7 PREEMPT(full)
[ 174.854281][ T9784] Hardware name: QEMU Ubuntu 24.04 PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[ 174.854286][ T9784] Call Trace:
[ 174.854289][ T9784] <TASK>
[ 174.854292][ T9784] dump_stack_lvl+0x10e/0x1f0
[ 174.854305][ T9784] print_report+0xd0/0x660
[ 174.854315][ T9784] ? __virt_addr_valid+0x81/0x610
[ 174.854323][ T9784] ? __phys_addr+0xe8/0x180
[ 174.854330][ T9784] ? hfsplus_bnode_read+0x2f4/0x360
[ 174.854337][ T9784] kasan_report+0xc6/0x100
[ 174.854346][ T9784] ? hfsplus_bnode_read+0x2f4/0x360
[ 174.854354][ T9784] hfsplus_bnode_read+0x2f4/0x360
[ 174.854362][ T9784] hfsplus_bnode_dump+0x2ec/0x380
[ 174.854370][ T9784] ? __pfx_hfsplus_bnode_dump+0x10/0x10
[ 174.854377][ T9784] ? hfsplus_bnode_write_u16+0x83/0xb0
[ 174.854385][ T9784] ? srcu_gp_start+0xd0/0x310
[ 174.854393][ T9784] ? __mark_inode_dirty+0x29e/0xe40
[ 174.854402][ T9784] hfsplus_brec_remove+0x3d2/0x4e0
[ 174.854411][ T9784] __hfsplus_delete_attr+0x290/0x3a0
[ 174.854419][ T9784] ? __pfx_hfs_find_1st_rec_by_cnid+0x10/0x10
[ 174.854427][ T9784] ? __pfx___hfsplus_delete_attr+0x10/0x10
[ 174.854436][ T9784] ? __asan_memset+0x23/0x50
[ 174.854450][ T9784] hfsplus_delete_all_attrs+0x262/0x320
[ 174.854459][ T9784] ? __pfx_hfsplus_delete_all_attrs+0x10/0x10
[ 174.854469][ T9784] ? rcu_is_watching+0x12/0xc0
[ 174.854476][ T9784] ? __mark_inode_dirty+0x29e/0xe40
[ 174.854483][ T9784] hfsplus_delete_cat+0x845/0xde0
[ 174.854493][ T9784] ? __pfx_hfsplus_delete_cat+0x10/0x10
[ 174.854507][ T9784] hfsplus_unlink+0x1ca/0x7c0
[ 174.854516][ T9784] ? __pfx_hfsplus_unlink+0x10/0x10
[ 174.854525][ T9784] ? down_write+0x148/0x200
[ 174.854532][ T9784] ? __pfx_down_write+0x10/0x10
[ 174.854540][ T9784] vfs_unlink+0x2fe/0x9b0
[ 174.854549][ T9784] do_unlinkat+0x490/0x670
[ 174.854557][ T9784] ? __pfx_do_unlinkat+0x10/0x10
[ 174.854565][ T9784] ? __might_fault+0xbc/0x130
[ 174.854576][ T9784] ? getname_flags.part.0+0x1c5/0x550
[ 174.854584][ T9784] __x64_sys_unlink+0xc5/0x110
[ 174.854592][ T9784] do_syscall_64+0xc9/0x480
[ 174.854600][ T9784] entry_SYSCALL_64_after_hwframe+0x77/0x7f
[ 174.854608][ T9784] RIP: 0033:0x7f6fdf4c3167
[ 174.854614][ T9784] Code: f0 ff ff 73 01 c3 48 8b 0d 26 0d 0e 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 08
[ 174.854622][ T9784] RSP: 002b:00007ffcb948bca8 EFLAGS: 00000206 ORIG_RAX: 0000000000000057
[ 174.854630][ T9784] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f6fdf4c3167
[ 174.854636][ T9784] RDX: 00007ffcb948bcc0 RSI: 00007ffcb948bcc0 RDI: 00007ffcb948bd50
[ 174.854641][ T9784] RBP: 00007ffcb948cd90 R08: 0000000000000001 R09: 00007ffcb948bb40
[ 174.854645][ T9784] R10: 00007f6fdf564fc0 R11: 0000000000000206 R12: 0000561e1bc9c2d0
[ 174.854650][ T9784] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 174.854658][ T9784] </TASK>
[ 174.854661][ T9784]
[ 174.879281][ T9784] Allocated by task 9784:
[ 174.879664][ T9784] kasan_save_stack+0x20/0x40
[ 174.880082][ T9784] kasan_save_track+0x14/0x30
[ 174.880500][ T9784] __kasan_kmalloc+0xaa/0xb0
[ 174.880908][ T9784] __kmalloc_noprof+0x205/0x550
[ 174.881337][ T9784] __hfs_bnode_create+0x107/0x890
[ 174.881779][ T9784] hfsplus_bnode_find+0x2d0/0xd10
[ 174.882222][ T9784] hfsplus_brec_find+0x2b0/0x520
[ 174.882659][ T9784] hfsplus_delete_all_attrs+0x23b/0x320
[ 174.883144][ T9784] hfsplus_delete_cat+0x845/0xde0
[ 174.883595][ T9784] hfsplus_rmdir+0x106/0x1b0
[ 174.884004][ T9784] vfs_rmdir+0x206/0x690
[ 174.884379][ T9784] do_rmdir+0x2b7/0x390
[ 174.884751][ T9784] __x64_sys_rmdir+0xc5/0x110
[ 174.885167][ T9784] do_syscall_64+0xc9/0x480
[ 174.885568][ T9784] entry_SYSCALL_64_after_hwframe+0x77/0x7f
[ 174.886083][ T9784]
[ 174.886293][ T9784] The buggy address belongs to the object at ffff88810b5fc600
[ 174.886293][ T9784] which belongs to the cache kmalloc-192 of size 192
[ 174.887507][ T9784] The buggy address is located 40 bytes to the right of
[ 174.887507][ T9784] allocated 152-byte region [ffff88810b5fc600, ffff88810b5fc698)
[ 174.888766][ T9784]
[ 174.888976][ T9784] The buggy address belongs to the physical page:
[ 174.889533][ T9784] page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x10b5fc
[ 174.890295][ T9784] flags: 0x57ff00000000000(node=1|zone=2|lastcpupid=0x7ff)
[ 174.890927][ T9784] page_type: f5(slab)
[ 174.891284][ T9784] raw: 057ff00000000000 ffff88801b4423c0 ffffea000426dc80 dead000000000002
[ 174.892032][ T9784] raw: 0000000000000000 0000000080100010 00000000f5000000 0000000000000000
[ 174.892774][ T9784] page dumped because: kasan: bad access detected
[ 174.893327][ T9784] page_owner tracks the page as allocated
[ 174.893825][ T9784] page last allocated via order 0, migratetype Unmovable, gfp_mask 0x52c00(GFP_NOIO|__GFP_NOWARN|__GFP_NO1
[ 174.895373][ T9784] post_alloc_hook+0x1c0/0x230
[ 174.895801][ T9784] get_page_from_freelist+0xdeb/0x3b30
[ 174.896284][ T9784] __alloc_frozen_pages_noprof+0x25c/0x2460
[ 174.896810][ T9784] alloc_pages_mpol+0x1fb/0x550
[ 174.897242][ T9784] new_slab+0x23b/0x340
[ 174.897614][ T9784] ___slab_alloc+0xd81/0x1960
[ 174.898028][ T9784] __slab_alloc.isra.0+0x56/0xb0
[ 174.898468][ T9784] __kmalloc_noprof+0x2b0/0x550
[ 174.898896][ T9784] usb_alloc_urb+0x73/0xa0
[ 174.899289][ T9784] usb_control_msg+0x1cb/0x4a0
[ 174.899718][ T9784] usb_get_string+0xab/0x1a0
[ 174.900133][ T9784] usb_string_sub+0x107/0x3c0
[ 174.900549][ T9784] usb_string+0x307/0x670
[ 174.900933][ T9784] usb_cache_string+0x80/0x150
[ 174.901355][ T9784] usb_new_device+0x1d0/0x19d0
[ 174.901786][ T9784] register_root_hub+0x299/0x730
[ 174.902231][ T9784] page last free pid 10 tgid 10 stack trace:
[ 174.902757][ T9784] __free_frozen_pages+0x80c/0x1250
[ 174.903217][ T9784] vfree.part.0+0x12b/0xab0
[ 174.903645][ T9784] delayed_vfree_work+0x93/0xd0
[ 174.904073][ T9784] process_one_work+0x9b5/0x1b80
[ 174.904519][ T9784] worker_thread+0x630/0xe60
[ 174.904927][ T9784] kthread+0x3a8/0x770
[ 174.905291][ T9784] ret_from_fork+0x517/0x6e0
[ 174.905709][ T9784] ret_from_fork_asm+0x1a/0x30
[ 174.906128][ T9784]
[ 174.906338][ T9784] Memory state around the buggy address:
[ 174.906828][ T9784] ffff88810b5fc580: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
[ 174.907528][ T9784] ffff88810b5fc600: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 174.908222][ T9784] >ffff88810b5fc680: 00 00 00 fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 174.908917][ T9784] ^
[ 174.909481][ T9784] ffff88810b5fc700: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 174.910432][ T9784] ffff88810b5fc780: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
[ 174.911401][ T9784] ==================================================================
The reason of the issue that code doesn't check the correctness
of the requested offset and length. As a result, incorrect value
of offset or/and length could result in access out of allocated
memory.
This patch introduces is_bnode_offset_valid() method that checks
the requested offset value. Also, it introduces
check_and_correct_requested_length() method that checks and
correct the requested length (if it is necessary). These methods
are used in hfsplus_bnode_read(), hfsplus_bnode_write(),
hfsplus_bnode_clear(), hfsplus_bnode_copy(), and hfsplus_bnode_move()
with the goal to prevent the access out of allocated memory
and triggering the crash.
Reported-by: Kun Hu <huk23@m.fudan.edu.cn>
Reported-by: Jiaji Qin <jjtan24@m.fudan.edu.cn>
Reported-by: Shuoran Bai <baishuoran@hrbeu.edu.cn>
Signed-off-by: Viacheslav Dubeyko <slava@dubeyko.com>
Link: https://lore.kernel.org/r/20250703214804.244077-1-slava@dubeyko.com
Signed-off-by: Viacheslav Dubeyko <slava@dubeyko.com>
|
|
The hfsplus_readdir() method is capable to crash by calling
hfsplus_uni2asc():
[ 667.121659][ T9805] ==================================================================
[ 667.122651][ T9805] BUG: KASAN: slab-out-of-bounds in hfsplus_uni2asc+0x902/0xa10
[ 667.123627][ T9805] Read of size 2 at addr ffff88802592f40c by task repro/9805
[ 667.124578][ T9805]
[ 667.124876][ T9805] CPU: 3 UID: 0 PID: 9805 Comm: repro Not tainted 6.16.0-rc3 #1 PREEMPT(full)
[ 667.124886][ T9805] Hardware name: QEMU Ubuntu 24.04 PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[ 667.124890][ T9805] Call Trace:
[ 667.124893][ T9805] <TASK>
[ 667.124896][ T9805] dump_stack_lvl+0x10e/0x1f0
[ 667.124911][ T9805] print_report+0xd0/0x660
[ 667.124920][ T9805] ? __virt_addr_valid+0x81/0x610
[ 667.124928][ T9805] ? __phys_addr+0xe8/0x180
[ 667.124934][ T9805] ? hfsplus_uni2asc+0x902/0xa10
[ 667.124942][ T9805] kasan_report+0xc6/0x100
[ 667.124950][ T9805] ? hfsplus_uni2asc+0x902/0xa10
[ 667.124959][ T9805] hfsplus_uni2asc+0x902/0xa10
[ 667.124966][ T9805] ? hfsplus_bnode_read+0x14b/0x360
[ 667.124974][ T9805] hfsplus_readdir+0x845/0xfc0
[ 667.124984][ T9805] ? __pfx_hfsplus_readdir+0x10/0x10
[ 667.124994][ T9805] ? stack_trace_save+0x8e/0xc0
[ 667.125008][ T9805] ? iterate_dir+0x18b/0xb20
[ 667.125015][ T9805] ? trace_lock_acquire+0x85/0xd0
[ 667.125022][ T9805] ? lock_acquire+0x30/0x80
[ 667.125029][ T9805] ? iterate_dir+0x18b/0xb20
[ 667.125037][ T9805] ? down_read_killable+0x1ed/0x4c0
[ 667.125044][ T9805] ? putname+0x154/0x1a0
[ 667.125051][ T9805] ? __pfx_down_read_killable+0x10/0x10
[ 667.125058][ T9805] ? apparmor_file_permission+0x239/0x3e0
[ 667.125069][ T9805] iterate_dir+0x296/0xb20
[ 667.125076][ T9805] __x64_sys_getdents64+0x13c/0x2c0
[ 667.125084][ T9805] ? __pfx___x64_sys_getdents64+0x10/0x10
[ 667.125091][ T9805] ? __x64_sys_openat+0x141/0x200
[ 667.125126][ T9805] ? __pfx_filldir64+0x10/0x10
[ 667.125134][ T9805] ? do_user_addr_fault+0x7fe/0x12f0
[ 667.125143][ T9805] do_syscall_64+0xc9/0x480
[ 667.125151][ T9805] entry_SYSCALL_64_after_hwframe+0x77/0x7f
[ 667.125158][ T9805] RIP: 0033:0x7fa8753b2fc9
[ 667.125164][ T9805] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 48
[ 667.125172][ T9805] RSP: 002b:00007ffe96f8e0f8 EFLAGS: 00000217 ORIG_RAX: 00000000000000d9
[ 667.125181][ T9805] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fa8753b2fc9
[ 667.125185][ T9805] RDX: 0000000000000400 RSI: 00002000000063c0 RDI: 0000000000000004
[ 667.125190][ T9805] RBP: 00007ffe96f8e110 R08: 00007ffe96f8e110 R09: 00007ffe96f8e110
[ 667.125195][ T9805] R10: 0000000000000000 R11: 0000000000000217 R12: 0000556b1e3b4260
[ 667.125199][ T9805] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 667.125207][ T9805] </TASK>
[ 667.125210][ T9805]
[ 667.145632][ T9805] Allocated by task 9805:
[ 667.145991][ T9805] kasan_save_stack+0x20/0x40
[ 667.146352][ T9805] kasan_save_track+0x14/0x30
[ 667.146717][ T9805] __kasan_kmalloc+0xaa/0xb0
[ 667.147065][ T9805] __kmalloc_noprof+0x205/0x550
[ 667.147448][ T9805] hfsplus_find_init+0x95/0x1f0
[ 667.147813][ T9805] hfsplus_readdir+0x220/0xfc0
[ 667.148174][ T9805] iterate_dir+0x296/0xb20
[ 667.148549][ T9805] __x64_sys_getdents64+0x13c/0x2c0
[ 667.148937][ T9805] do_syscall_64+0xc9/0x480
[ 667.149291][ T9805] entry_SYSCALL_64_after_hwframe+0x77/0x7f
[ 667.149809][ T9805]
[ 667.150030][ T9805] The buggy address belongs to the object at ffff88802592f000
[ 667.150030][ T9805] which belongs to the cache kmalloc-2k of size 2048
[ 667.151282][ T9805] The buggy address is located 0 bytes to the right of
[ 667.151282][ T9805] allocated 1036-byte region [ffff88802592f000, ffff88802592f40c)
[ 667.152580][ T9805]
[ 667.152798][ T9805] The buggy address belongs to the physical page:
[ 667.153373][ T9805] page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x25928
[ 667.154157][ T9805] head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
[ 667.154916][ T9805] anon flags: 0xfff00000000040(head|node=0|zone=1|lastcpupid=0x7ff)
[ 667.155631][ T9805] page_type: f5(slab)
[ 667.155997][ T9805] raw: 00fff00000000040 ffff88801b442f00 0000000000000000 dead000000000001
[ 667.156770][ T9805] raw: 0000000000000000 0000000080080008 00000000f5000000 0000000000000000
[ 667.157536][ T9805] head: 00fff00000000040 ffff88801b442f00 0000000000000000 dead000000000001
[ 667.158317][ T9805] head: 0000000000000000 0000000080080008 00000000f5000000 0000000000000000
[ 667.159088][ T9805] head: 00fff00000000003 ffffea0000964a01 00000000ffffffff 00000000ffffffff
[ 667.159865][ T9805] head: ffffffffffffffff 0000000000000000 00000000ffffffff 0000000000000008
[ 667.160643][ T9805] page dumped because: kasan: bad access detected
[ 667.161216][ T9805] page_owner tracks the page as allocated
[ 667.161732][ T9805] page last allocated via order 3, migratetype Unmovable, gfp_mask 0xd20c0(__GFP_IO|__GFP_FS|__GFP_NOWARN9
[ 667.163566][ T9805] post_alloc_hook+0x1c0/0x230
[ 667.164003][ T9805] get_page_from_freelist+0xdeb/0x3b30
[ 667.164503][ T9805] __alloc_frozen_pages_noprof+0x25c/0x2460
[ 667.165040][ T9805] alloc_pages_mpol+0x1fb/0x550
[ 667.165489][ T9805] new_slab+0x23b/0x340
[ 667.165872][ T9805] ___slab_alloc+0xd81/0x1960
[ 667.166313][ T9805] __slab_alloc.isra.0+0x56/0xb0
[ 667.166767][ T9805] __kmalloc_cache_noprof+0x255/0x3e0
[ 667.167255][ T9805] psi_cgroup_alloc+0x52/0x2d0
[ 667.167693][ T9805] cgroup_mkdir+0x694/0x1210
[ 667.168118][ T9805] kernfs_iop_mkdir+0x111/0x190
[ 667.168568][ T9805] vfs_mkdir+0x59b/0x8d0
[ 667.168956][ T9805] do_mkdirat+0x2ed/0x3d0
[ 667.169353][ T9805] __x64_sys_mkdir+0xef/0x140
[ 667.169784][ T9805] do_syscall_64+0xc9/0x480
[ 667.170195][ T9805] entry_SYSCALL_64_after_hwframe+0x77/0x7f
[ 667.170730][ T9805] page last free pid 1257 tgid 1257 stack trace:
[ 667.171304][ T9805] __free_frozen_pages+0x80c/0x1250
[ 667.171770][ T9805] vfree.part.0+0x12b/0xab0
[ 667.172182][ T9805] delayed_vfree_work+0x93/0xd0
[ 667.172612][ T9805] process_one_work+0x9b5/0x1b80
[ 667.173067][ T9805] worker_thread+0x630/0xe60
[ 667.173486][ T9805] kthread+0x3a8/0x770
[ 667.173857][ T9805] ret_from_fork+0x517/0x6e0
[ 667.174278][ T9805] ret_from_fork_asm+0x1a/0x30
[ 667.174703][ T9805]
[ 667.174917][ T9805] Memory state around the buggy address:
[ 667.175411][ T9805] ffff88802592f300: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 667.176114][ T9805] ffff88802592f380: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 667.176830][ T9805] >ffff88802592f400: 00 04 fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 667.177547][ T9805] ^
[ 667.177933][ T9805] ffff88802592f480: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 667.178640][ T9805] ffff88802592f500: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 667.179350][ T9805] ==================================================================
The hfsplus_uni2asc() method operates by struct hfsplus_unistr:
struct hfsplus_unistr {
__be16 length;
hfsplus_unichr unicode[HFSPLUS_MAX_STRLEN];
} __packed;
where HFSPLUS_MAX_STRLEN is 255 bytes. The issue happens if length
of the structure instance has value bigger than 255 (for example,
65283). In such case, pointer on unicode buffer is going beyond of
the allocated memory.
The patch fixes the issue by checking the length value of
hfsplus_unistr instance and using 255 value in the case if length
value is bigger than HFSPLUS_MAX_STRLEN. Potential reason of such
situation could be a corruption of Catalog File b-tree's node.
Reported-by: Wenzhi Wang <wenzhi.wang@uwaterloo.ca>
Signed-off-by: Liu Shixin <liushixin2@huawei.com>
Signed-off-by: Viacheslav Dubeyko <slava@dubeyko.com>
cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
cc: Yangtao Li <frank.li@vivo.com>
cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Yangtao Li <frank.li@vivo.com>
Link: https://lore.kernel.org/r/20250710230830.110500-1-slava@dubeyko.com
Signed-off-by: Viacheslav Dubeyko <slava@dubeyko.com>
|
|
When the volume header contains erroneous values that do not reflect
the actual state of the filesystem, hfsplus_fill_super() assumes that
the attributes file is not yet created, which later results in hitting
BUG_ON() when hfsplus_create_attributes_file() is called. Replace this
BUG_ON() with -EIO error with a message to suggest running fsck tool.
Reported-by: syzbot <syzbot+1107451c16b9eb9d29e6@syzkaller.appspotmail.com>
Closes: https://syzkaller.appspot.com/bug?extid=1107451c16b9eb9d29e6
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by: Viacheslav Dubeyko <slava@dubeyko.com>
Link: https://lore.kernel.org/r/7b587d24-c8a1-4413-9b9a-00a33fbd849f@I-love.SAKURA.ne.jp
Signed-off-by: Viacheslav Dubeyko <slava@dubeyko.com>
|
|
hfsplus_submit_bio() called by hfsplus_sync_fs() uses bdev_virt_rw() which
in turn uses submit_bio_wait() to submit the BIO.
But submit_bio_wait() already sets the REQ_SYNC flag on the BIO so there
is no need for setting the flag in hfsplus_sync_fs() when calling
hfsplus_submit_bio().
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Yangtao Li <frank.li@vivo.com>
Reviewed-by: Viacheslav Dubeyko <slava@dubeyko.com>
Signed-off-by: Viacheslav Dubeyko <slava@dubeyko.com>
Link: https://lore.kernel.org/r/20250710063553.4805-1-johannes.thumshirn@wdc.com
Link: https://lore.kernel.org/r/20250710063553.4805-1-johannes.thumshirn@wdc.com
Signed-off-by: Viacheslav Dubeyko <slava@dubeyko.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs fixes from Christian Brauner:
"Two last-minute fixes for this cycle:
- Set afs vllist to NULL if addr parsing fails
- Add a missing check for reaching the end of the string in afs"
* tag 'vfs-6.16-rc8.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
afs: Set vllist to NULL if addr parsing fails
afs: Fix check for NULL terminator
|
|
Pull bcachefs fixes from Kent Overstreet:
"User reported fixes:
- Fix btree node scan on encrypted filesystems by not using btree
node header fields encrypted
- Fix a race in btree write buffer flush; this caused EROs primarily
during fsck for some people"
* tag 'bcachefs-2025-07-24' of git://evilpiepirate.org/bcachefs:
bcachefs: Add missing snapshots_seen_add_inorder()
bcachefs: Fix write buffer flushing from open journal entry
bcachefs: btree_node_scan: don't re-read before initializing found_btree_node
|
|
A syzbot fuzzed image triggered a BUG_ON in ext4_update_inline_data()
when an inode had the INLINE_DATA_FL flag set but was missing the
system.data extended attribute.
Since this can happen due to a maiciouly fuzzed file system, we
shouldn't BUG, but rather, report it as a corrupted file system.
Add similar replacements of BUG_ON with EXT4_ERROR_INODE() ii
ext4_create_inline_data() and ext4_inline_data_truncate().
Reported-by: syzbot+544248a761451c0df72f@syzkaller.appspotmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Although we now perform ordered traversal within an xarray, this is
currently limited to a single xarray. However, we have multiple such
xarrays, which prevents us from guaranteeing a linear-like traversal
where all groups on the right are visited before all groups on the left.
For example, suppose we have 128 block groups, with a target group of 64,
a target length corresponding to an order of 1, and available free groups
of 16 (order 1) and group 65 (order 8):
For linear traversal, when no suitable free block is found in group 64, it
will search in the next block group until group 127, then start searching
from 0 up to block group 63. It ensures continuous forward traversal, which
is consistent with the unidirectional rotation behavior of HDD platters.
Additionally, the block group lock contention during freeing block is
unavoidable. The goal increasing from 0 to 64 indicates that previously
scanned groups (which had no suitable free space and are likely to free
blocks later) and skipped groups (which are currently in use) have newly
freed some used blocks. If we allocate blocks in these groups, the
probability of competing with other processes increases.
For non-linear traversal, we first traverse all groups in order_1. If only
group 16 has free space in this list, we first traverse [63, 128), then
traverse [0, 64) to find the available group 16, and then allocate blocks
in group 16. Therefore, it cannot guarantee continuous traversal in one
direction, thus increasing the probability of contention.
So refactor ext4_mb_scan_groups_xarray() to ext4_mb_scan_groups_xa_range()
to only traverse a fixed range of groups, and move the logic for handling
wrap around to the caller. The caller first iterates through all xarrays
in the range [start, ngroups) and then through the range [0, start). This
approach simulates a linear scan, which reduces contention between freeing
blocks and allocating blocks.
Assume we have the following groups, where "|" denotes the xarray traversal
start position:
order_1_groups: AB | CD
order_2_groups: EF | GH
Traversal order:
Before: C > D > A > B > G > H > E > F
After: C > D > G > H > A > B > E > F
Performance test data follows:
|CPU: Kunpeng 920 | P80 | P1 |
|Memory: 512GB |------------------------|-------------------------|
|960GB SSD (0.5GB/s)| base | patched | base | patched |
|-------------------|-------|----------------|--------|----------------|
|mb_optimize_scan=0 | 19555 | 20049 (+2.5%) | 315636 | 316724 (-0.3%) |
|mb_optimize_scan=1 | 15496 | 19342 (+24.8%) | 323569 | 328324 (+1.4%) |
|CPU: AMD 9654 * 2 | P96 | P1 |
|Memory: 1536GB |------------------------|-------------------------|
|960GB SSD (1GB/s) | base | patched | base | patched |
|-------------------|-------|----------------|--------|----------------|
|mb_optimize_scan=0 | 53192 | 52125 (-2.0%) | 212678 | 215136 (+1.1%) |
|mb_optimize_scan=1 | 37636 | 50331 (+33.7%) | 214189 | 209431 (-2.2%) |
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250714130327.1830534-18-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
This commit converts the `choose group` logic to `scan group` using
previously prepared helper functions. This allows us to leverage xarrays
for ordered non-linear traversal, thereby mitigating the "bouncing" issue
inherent in the `choose group` mechanism.
This also decouples linear and non-linear traversals, leading to cleaner
and more readable code.
Key changes:
* ext4_mb_choose_next_group() is refactored to ext4_mb_scan_groups().
* Replaced ext4_mb_good_group() with ext4_mb_scan_group() in non-linear
traversals, and related functions now return error codes instead of
group info.
* Added ext4_mb_scan_groups_linear() for performing linear scans starting
from a specific group for a set number of times.
* Linear scans now execute up to sbi->s_mb_max_linear_groups times,
so ac_groups_linear_remaining is removed as it's no longer used.
* ac->ac_criteria is now used directly instead of passing cr around.
Also, ac->ac_criteria is incremented directly after groups scan fails
for the corresponding criteria.
* Since we're now directly scanning groups instead of finding a good group
then scanning, the following variables and flags are no longer needed,
s_bal_cX_groups_considered is sufficient.
s_bal_p2_aligned_bad_suggestions
s_bal_goal_fast_bad_suggestions
s_bal_best_avail_bad_suggestions
EXT4_MB_CR_POWER2_ALIGNED_OPTIMIZED
EXT4_MB_CR_GOAL_LEN_FAST_OPTIMIZED
EXT4_MB_CR_BEST_AVAIL_LEN_OPTIMIZED
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250714130327.1830534-17-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
While traversing the list, holding a spin_lock prevents load_buddy, making
direct use of ext4_try_lock_group impossible. This can lead to a bouncing
scenario where spin_is_locked(grp_A) succeeds, but ext4_try_lock_group()
fails, forcing the list traversal to repeatedly restart from grp_A.
In contrast, linear traversal directly uses ext4_try_lock_group(),
avoiding this bouncing. Therefore, we need a lockless, ordered traversal
to achieve linear-like efficiency.
Therefore, this commit converts both average fragment size lists and
largest free order lists into ordered xarrays.
In an xarray, the index represents the block group number and the value
holds the block group information; a non-empty value indicates the block
group's presence.
While insertion and deletion complexity remain O(1), lookup complexity
changes from O(1) to O(nlogn), which may slightly reduce single-threaded
performance.
Additionally, xarray insertions might fail, potentially due to memory
allocation issues. However, since we have linear traversal as a fallback,
this isn't a major problem. Therefore, we've only added a warning message
for insertion failures here.
A helper function ext4_mb_find_good_group_xarray() is added to find good
groups in the specified xarray starting at the specified position start,
and when it reaches ngroups-1, it wraps around to 0 and then to start-1.
This ensures an ordered traversal within the xarray.
Performance test results are as follows: Single-process operations
on an empty disk show negligible impact, while multi-process workloads
demonstrate a noticeable performance gain.
|CPU: Kunpeng 920 | P80 | P1 |
|Memory: 512GB |------------------------|-------------------------|
|960GB SSD (0.5GB/s)| base | patched | base | patched |
|-------------------|-------|----------------|--------|----------------|
|mb_optimize_scan=0 | 20097 | 19555 (-2.6%) | 316141 | 315636 (-0.2%) |
|mb_optimize_scan=1 | 13318 | 15496 (+16.3%) | 325273 | 323569 (-0.5%) |
|CPU: AMD 9654 * 2 | P96 | P1 |
|Memory: 1536GB |------------------------|-------------------------|
|960GB SSD (1GB/s) | base | patched | base | patched |
|-------------------|-------|----------------|--------|----------------|
|mb_optimize_scan=0 | 53603 | 53192 (-0.7%) | 214243 | 212678 (-0.7%) |
|mb_optimize_scan=1 | 20887 | 37636 (+80.1%) | 213632 | 214189 (+0.2%) |
[ Applied spelling fixes per discussion on the ext4-list see thread
referened in the Link tag. --tytso]
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250714130327.1830534-16-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Extract ext4_mb_scan_group() to make the code clearer and to
prepare for the later conversion of 'choose group' to 'scan groups'.
No functional changes.
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250714130327.1830534-15-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Extract ext4_mb_might_prefetch() to make the code clearer and to
prepare for the later conversion of 'choose group' to 'scan groups'.
No functional changes.
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250714130327.1830534-14-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Extract __ext4_mb_scan_group() to make the code clearer and to
prepare for the later conversion of 'choose group' to 'scan groups'.
No functional changes.
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250714130327.1830534-13-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
The grp->bb_largest_free_order is updated regardless of whether
mb_optimize_scan is enabled. This can lead to inconsistencies between
grp->bb_largest_free_order and the actual s_mb_largest_free_orders list
index when mb_optimize_scan is repeatedly enabled and disabled via remount.
For example, if mb_optimize_scan is initially enabled, largest free
order is 3, and the group is in s_mb_largest_free_orders[3]. Then,
mb_optimize_scan is disabled via remount, block allocations occur,
updating largest free order to 2. Finally, mb_optimize_scan is re-enabled
via remount, more block allocations update largest free order to 1.
At this point, the group would be removed from s_mb_largest_free_orders[3]
under the protection of s_mb_largest_free_orders_locks[2]. This lock
mismatch can lead to list corruption.
To fix this, whenever grp->bb_largest_free_order changes, we now always
attempt to remove the group from its old order list. However, we only
insert the group into the new order list if `mb_optimize_scan` is enabled.
This approach helps prevent lock inconsistencies and ensures the data in
the order lists remains reliable.
Fixes: 196e402adf2e ("ext4: improve cr 0 / cr 1 group scanning")
CC: stable@vger.kernel.org
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250714130327.1830534-12-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Groups with no free blocks shouldn't be in any average fragment size list.
However, when all blocks in a group are allocated(i.e., bb_fragments or
bb_free is 0), we currently skip updating the average fragment size, which
means the group isn't removed from its previous s_mb_avg_fragment_size[old]
list.
This created "zombie" groups that were always skipped during traversal as
they couldn't satisfy any block allocation requests, negatively impacting
traversal efficiency.
Therefore, when a group becomes completely full, bb_avg_fragment_size_order
is now set to -1. If the old order was not -1, a removal operation is
performed; if the new order is not -1, an insertion is performed.
Fixes: 196e402adf2e ("ext4: improve cr 0 / cr 1 group scanning")
CC: stable@vger.kernel.org
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250714130327.1830534-11-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Attempt to merge ext4_free_data with already inserted free extents prior
to adding new ones. This strategy drastically cuts down the number of
times locks are held.
For example, if prev, new, and next extents are all mergeable, the existing
code (before this patch) requires acquiring the s_md_lock three times:
prev merge into new and free prev // hold lock
next merge into new and free next // hold lock
insert new // hold lock
After the patch, it only needs to be acquired once:
new merge into next and free new // no lock
next merge into prev and free next // hold lock
Performance test data follows:
Test: Running will-it-scale/fallocate2 on CPU-bound containers.
Observation: Average fallocate operations per container per second.
|CPU: Kunpeng 920 | P80 | P1 |
|Memory: 512GB |------------------------|-------------------------|
|960GB SSD (0.5GB/s)| base | patched | base | patched |
|-------------------|-------|----------------|--------|----------------|
|mb_optimize_scan=0 | 20043 | 20097 (+0.2%) | 314331 | 316141 (+0.5%) |
|mb_optimize_scan=1 | 7290 | 13318 (+87.4%) | 324226 | 325273 (+0.3%) |
|CPU: AMD 9654 * 2 | P96 | P1 |
|Memory: 1536GB |------------------------|-------------------------|
|960GB SSD (1GB/s) | base | patched | base | patched |
|-------------------|-------|----------------|--------|----------------|
|mb_optimize_scan=0 | 54999 | 53603 (-2.5%) | 214380 | 214243 (-0.06%)|
|mb_optimize_scan=1 | 13497 | 20887 (+54.6%) | 216276 | 213632 (-1.2%) |
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250714130327.1830534-10-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Previously, s_md_lock was used to protect s_mb_free_pending during
modifications, while smp_mb() ensured fresh reads, so s_md_lock just
guarantees the atomicity of s_mb_free_pending. Thus we optimized it by
converting s_mb_free_pending into an atomic variable, thereby eliminating
s_md_lock and minimizing lock contention. This also prepares for future
lockless merging of free extents.
Following this modification, s_md_lock is exclusively responsible for
managing insertions and deletions within s_freed_data_list, along with
operations involving list_splice.
Performance test data follows:
Test: Running will-it-scale/fallocate2 on CPU-bound containers.
Observation: Average fallocate operations per container per second.
|CPU: Kunpeng 920 | P80 | P1 |
|Memory: 512GB |------------------------|-------------------------|
|960GB SSD (0.5GB/s)| base | patched | base | patched |
|-------------------|-------|----------------|--------|----------------|
|mb_optimize_scan=0 | 19628 | 20043 (+2.1%) | 320885 | 314331 (-2.0%) |
|mb_optimize_scan=1 | 7129 | 7290 (+2.2%) | 321275 | 324226 (+0.9%) |
|CPU: AMD 9654 * 2 | P96 | P1 |
|Memory: 1536GB |------------------------|-------------------------|
|960GB SSD (1GB/s) | base | patched | base | patched |
|-------------------|-------|----------------|--------|----------------|
|mb_optimize_scan=0 | 53760 | 54999 (+2.3%) | 213145 | 214380 (+0.5%) |
|mb_optimize_scan=1 | 12716 | 13497 (+6.1%) | 215262 | 216276 (+0.4%) |
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250714130327.1830534-9-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Remove the superfluous "find_".
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250714130327.1830534-8-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Since nobody has used these EXT4_MB_HINT flags for ages,
let's remove them.
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250714130327.1830534-7-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
When allocating data blocks, if the first try (goal allocation) fails and
stream allocation is on, it tries a global goal starting from the last
group we used (s_mb_last_group). This helps cluster large files together
to reduce free space fragmentation, and the data block contiguity also
accelerates write-back to disk.
However, when multiple processes allocate blocks, having just one global
goal means they all fight over the same group. This drastically lowers
the chances of extents merging and leads to much worse file fragmentation.
To mitigate this multi-process contention, we now employ multiple global
goals, with the number of goals being the minimum between the number of
possible CPUs and one-quarter of the filesystem's total block group count.
To ensure a consistent goal for each inode, we select the corresponding
goal by taking the inode number modulo the total number of goals.
Performance test data follows:
Test: Running will-it-scale/fallocate2 on CPU-bound containers.
Observation: Average fallocate operations per container per second.
|CPU: Kunpeng 920 | P80 | P1 |
|Memory: 512GB |------------------------|-------------------------|
|960GB SSD (0.5GB/s)| base | patched | base | patched |
|-------------------|-------|----------------|--------|----------------|
|mb_optimize_scan=0 | 9636 | 19628 (+103%) | 337597 | 320885 (-4.9%) |
|mb_optimize_scan=1 | 4834 | 7129 (+47.4%) | 341440 | 321275 (-5.9%) |
|CPU: AMD 9654 * 2 | P96 | P1 |
|Memory: 1536GB |------------------------|-------------------------|
|960GB SSD (1GB/s) | base | patched | base | patched |
|-------------------|-------|----------------|--------|----------------|
|mb_optimize_scan=0 | 22341 | 53760 (+140%) | 219707 | 213145 (-2.9%) |
|mb_optimize_scan=1 | 9177 | 12716 (+38.5%) | 215732 | 215262 (+0.2%) |
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250714130327.1830534-6-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
After we optimized the block group lock, we found another lock
contention issue when running will-it-scale/fallocate2 with multiple
processes. The fallocate's block allocation and the truncate's block
release were fighting over the s_md_lock. The problem is, this lock
protects totally different things in those two processes: the list of
freed data blocks (s_freed_data_list) when releasing, and where to start
looking for new blocks (mb_last_group) when allocating.
Now we only need to track s_mb_last_group and no longer need to track
s_mb_last_start, so we don't need the s_md_lock lock to ensure that the
two are consistent. Since s_mb_last_group is merely a hint and doesn't
require strong synchronization, READ_ONCE/WRITE_ONCE is sufficient.
Besides, the s_mb_last_group data type only requires ext4_group_t
(i.e., unsigned int), rendering unsigned long superfluous.
Performance test data follows:
Test: Running will-it-scale/fallocate2 on CPU-bound containers.
Observation: Average fallocate operations per container per second.
|CPU: Kunpeng 920 | P80 | P1 |
|Memory: 512GB |------------------------|-------------------------|
|960GB SSD (0.5GB/s)| base | patched | base | patched |
|-------------------|-------|----------------|--------|----------------|
|mb_optimize_scan=0 | 4821 | 9636 (+99.8%) | 314065 | 337597 (+7.4%) |
|mb_optimize_scan=1 | 4784 | 4834 (+1.04%) | 316344 | 341440 (+7.9%) |
|CPU: AMD 9654 * 2 | P96 | P1 |
|Memory: 1536GB |------------------------|-------------------------|
|960GB SSD (1GB/s) | base | patched | base | patched |
|-------------------|-------|----------------|--------|----------------|
|mb_optimize_scan=0 | 15371 | 22341 (+45.3%) | 205851 | 219707 (+6.7%) |
|mb_optimize_scan=1 | 6101 | 9177 (+50.4%) | 207373 | 215732 (+4.0%) |
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250714130327.1830534-5-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Since stream allocation does not use ac->ac_f_ex.fe_start, it is set to -1
by default, so the no longer needed sbi->s_mb_last_start is removed.
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250714130327.1830534-4-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
In ext4_mb_regular_allocator(), after the call to ext4_mb_find_by_goal()
fails to achieve the inode goal, allocation continues with the stream
allocation global goal. Currently, hits for both are combined in
sbi->s_bal_goals, hindering accurate optimization.
This commit separates global goal hits into sbi->s_bal_stream_goals. Since
stream allocation doesn't use ac->ac_g_ex.fe_start, set fe_start to -1.
This prevents stream allocations from being counted in s_bal_goals. Also
clear EXT4_MB_HINT_TRY_GOAL to avoid calling ext4_mb_find_by_goal again.
After adding `stream_goal_hits`, `/proc/fs/ext4/sdx/mb_stats` will show:
mballoc:
reqs: 840347
success: 750992
groups_scanned: 1230506
cr_p2_aligned_stats:
hits: 21531
groups_considered: 411664
extents_scanned: 21531
useless_loops: 0
bad_suggestions: 6
cr_goal_fast_stats:
hits: 111222
groups_considered: 1806728
extents_scanned: 467908
useless_loops: 0
bad_suggestions: 13
cr_best_avail_stats:
hits: 36267
groups_considered: 1817631
extents_scanned: 156143
useless_loops: 0
bad_suggestions: 204
cr_goal_slow_stats:
hits: 106396
groups_considered: 5671710
extents_scanned: 22540056
useless_loops: 123747
cr_any_free_stats:
hits: 138071
groups_considered: 724692
extents_scanned: 23615593
useless_loops: 585
extents_scanned: 46804261
goal_hits: 1307
stream_goal_hits: 236317
len_goal_hits: 155549
2^n_hits: 21531
breaks: 225096
lost: 35062
buddies_generated: 40/40
buddies_time_used: 48004
preallocated: 5962467
discarded: 4847560
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250714130327.1830534-3-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|