Age | Commit message (Collapse) | Author |
|
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse updates from Miklos Szeredi:
- Add page -> folio conversions (Joanne Koong, Josef Bacik)
- Allow max size of fuse requests to be configurable with a sysctl
(Joanne Koong)
- Allow FOPEN_DIRECT_IO to take advantage of async code path (yangyun)
- Fix large kernel reads (like a module load) in virtio_fs (Hou Tao)
- Fix attribute inconsistency in case readdirplus (and plain lookup in
corner cases) is racing with inode eviction (Zhang Tianci)
- Fix a WARN_ON triggered by virtio_fs (Asahi Lina)
* tag 'fuse-update-6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: (30 commits)
virtiofs: dax: remove ->writepages() callback
fuse: check attributes staleness on fuse_iget()
fuse: remove pages for requests and exclusively use folios
fuse: convert direct io to use folios
mm/writeback: add folio_mark_dirty_lock()
fuse: convert writebacks to use folios
fuse: convert retrieves to use folios
fuse: convert ioctls to use folios
fuse: convert writes (non-writeback) to use folios
fuse: convert reads to use folios
fuse: convert readdir to use folios
fuse: convert readlink to use folios
fuse: convert cuse to use folios
fuse: add support in virtio for requests using folios
fuse: support folios in struct fuse_args_pages and fuse_copy_pages()
fuse: convert fuse_notify_store to use folios
fuse: convert fuse_retrieve to use folios
fuse: use the folio based vmstat helpers
fuse: convert fuse_writepage_need_send to take a folio
fuse: convert fuse_do_readpage to use folios
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull gfs2 updates from Andreas Gruenbacher:
- Fix the code that cleans up left-over unlinked files.
Various fixes and minor improvements in deleting files cached or held
open remotely.
- Simplify the use of dlm's DLM_LKF_QUECVT flag.
- A few other minor cleanups.
* tag 'gfs2-for-6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: (21 commits)
gfs2: Prevent inode creation race
gfs2: Only defer deletes when we have an iopen glock
gfs2: Simplify DLM_LKF_QUECVT use
gfs2: gfs2_evict_inode clarification
gfs2: Make gfs2_inode_refresh static
gfs2: Use get_random_u32 in gfs2_orlov_skip
gfs2: Randomize GLF_VERIFY_DELETE work delay
gfs2: Use mod_delayed_work in gfs2_queue_try_to_evict
gfs2: Update to the evict / remote delete documentation
gfs2: Call gfs2_queue_verify_delete from gfs2_evict_inode
gfs2: Clean up delete work processing
gfs2: Minor delete_work_func cleanup
gfs2: Return enum evict_behavior from gfs2_upgrade_iopen_glock
gfs2: Rename dinode_demise to evict_behavior
gfs2: Rename GIF_{DEFERRED -> DEFER}_DELETE
gfs2: Faster gfs2_upgrade_iopen_glock wakeups
KMSAN: uninit-value in inode_go_dump (5)
gfs2: Fix unlinked inode cleanup
gfs2: Allow immediate GLF_VERIFY_DELETE work
gfs2: Initialize gl_no_formal_ino earlier
...
|
|
Bring in an overlayfs fix for v6.13-rc1 that fixes a bug introduced by
the overlayfs changes merged for v6.13.
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Commit 48b50624aec4 ("backing-file: clean up the API") unintentionally
changed the argument in the ->accessed() callback from the user file to
the backing file.
Fixes: 48b50624aec4 ("backing-file: clean up the API")
Reported-by: syzbot+8d1206605b05ca9a0e6a@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/linux-unionfs/67447b3c.050a0220.1cc393.0085.GAE@google.com/
Tested-by: syzbot+8d1206605b05ca9a0e6a@syzkaller.appspotmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://lore.kernel.org/r/20241126145342.364869-1-amir73il@gmail.com
Acked-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
A race condition exists between SMB request handling in
`ksmbd_conn_handler_loop()` and the freeing of `ksmbd_conn` in the
workqueue handler `handle_ksmbd_work()`. This leads to a UAF.
- KASAN: slab-use-after-free Read in handle_ksmbd_work
- KASAN: slab-use-after-free in rtlock_slowlock_locked
This race condition arises as follows:
- `ksmbd_conn_handler_loop()` waits for `conn->r_count` to reach zero:
`wait_event(conn->r_count_q, atomic_read(&conn->r_count) == 0);`
- Meanwhile, `handle_ksmbd_work()` decrements `conn->r_count` using
`atomic_dec_return(&conn->r_count)`, and if it reaches zero, calls
`ksmbd_conn_free()`, which frees `conn`.
- However, after `handle_ksmbd_work()` decrements `conn->r_count`,
it may still access `conn->r_count_q` in the following line:
`waitqueue_active(&conn->r_count_q)` or `wake_up(&conn->r_count_q)`
This results in a UAF, as `conn` has already been freed.
The discovery of this UAF can be referenced in the following PR for
syzkaller's support for SMB requests.
Link: https://github.com/google/syzkaller/pull/5524
Fixes: ee426bfb9d09 ("ksmbd: add refcnt to ksmbd_conn struct")
Cc: linux-cifs@vger.kernel.org
Cc: stable@vger.kernel.org # v6.6.55+, v6.10.14+, v6.11.3+
Cc: syzkaller@googlegroups.com
Signed-off-by: Yunseong Kim <yskelg@gmail.com>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
We need to know how many pending requests are left at the end of server
shutdown. That means we need to know how long the server will wait
to process pending requests in case of a server shutdown.
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Add netdev-up/down event debug print to find what netdev is connected or
disconnected.
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Add debug prints to know what smb2 requests were received.
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Add debug print to know if netdevice is RDMA-capable network adapter.
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
use msleep instaed of schedule_timeout_interruptible()
to guarantee the task delays as expected.
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Prefer to report ENOMEM rather than incur the oom for allocations in
ksmbd. __GFP_NORETRY could not achieve that, It would fail the allocations
just too easily. __GFP_RETRY_MAYFAIL will keep retrying the allocation
until there is no more progress and fail the allocation instead go OOM
and let the caller to deal with it.
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull non-MM updates from Andrew Morton:
- The series "resource: A couple of cleanups" from Andy Shevchenko
performs some cleanups in the resource management code
- The series "Improve the copy of task comm" from Yafang Shao addresses
possible race-induced overflows in the management of
task_struct.comm[]
- The series "Remove unnecessary header includes from
{tools/}lib/list_sort.c" from Kuan-Wei Chiu adds some cleanups and a
small fix to the list_sort library code and to its selftest
- The series "Enhance min heap API with non-inline functions and
optimizations" also from Kuan-Wei Chiu optimizes and cleans up the
min_heap library code
- The series "nilfs2: Finish folio conversion" from Ryusuke Konishi
finishes off nilfs2's folioification
- The series "add detect count for hung tasks" from Lance Yang adds
more userspace visibility into the hung-task detector's activity
- Apart from that, singelton patches in many places - please see the
individual changelogs for details
* tag 'mm-nonmm-stable-2024-11-24-02-05' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (71 commits)
gdb: lx-symbols: do not error out on monolithic build
kernel/reboot: replace sprintf() with sysfs_emit()
lib: util_macros_kunit: add kunit test for util_macros.h
util_macros.h: fix/rework find_closest() macros
Improve consistency of '#error' directive messages
ocfs2: fix uninitialized value in ocfs2_file_read_iter()
hung_task: add docs for hung_task_detect_count
hung_task: add detect count for hung tasks
dma-buf: use atomic64_inc_return() in dma_buf_getfile()
fs/proc/kcore.c: fix coccinelle reported ERROR instances
resource: avoid unnecessary resource tree walking in __region_intersects()
ocfs2: remove unused errmsg function and table
ocfs2: cluster: fix a typo
lib/scatterlist: use sg_phys() helper
checkpatch: always parse orig_commit in fixes tag
nilfs2: convert metadata aops from writepage to writepages
nilfs2: convert nilfs_recovery_copy_block() to take a folio
nilfs2: convert nilfs_page_count_clean_buffers() to take a folio
nilfs2: remove nilfs_writepage
nilfs2: convert checkpoint file to be folio-based
...
|
|
SMB1 NT_TRANSACT_IOCTL/FSCTL_GET_REPARSE_POINT even in non-UNICODE mode
returns reparse buffer in UNICODE/UTF-16 format.
This is because FSCTL_GET_REPARSE_POINT is NT-based IOCTL which does not
distinguish between 8-bit non-UNICODE and 16-bit UNICODE modes and its path
buffers are always encoded in UTF-16.
This change fixes reading of native symlinks in SMB1 when UNICODE session
is not active.
Fixes: ed3e0a149b58 ("smb: client: implement ->query_reparse_point() for SMB1")
Signed-off-by: Pali Rohár <pali@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
WSL socket, fifo, char and block devices have empty reparse buffer.
Validate the length of the reparse buffer.
Signed-off-by: Pali Rohár <pali@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
$LXDEV xattr is for storing block/char device's major and minor number.
Change guard which excludes storing $LXDEV xattr to explicitly filter
everything except block and char device. Current guard is opposite, which
is currently correct but is less-safe. This change is required for adding
support for creating WSL-style symlinks as symlinks also do not use
device's major and minor numbers.
Signed-off-by: Pali Rohár <pali@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Linux CIFS client currently does not implement readlink() for WSL-style
symlinks. It is only able to detect that file is of WSL-style symlink, but
is not able to read target symlink location.
Add this missing functionality and implement support for parsing content of
WSL-style symlink.
The important note is that symlink target location stored for WSL symlink
reparse point (IO_REPARSE_TAG_LX_SYMLINK) is in UTF-8 encoding instead of
UTF-16 (which is used in whole SMB protocol and also in all other symlink
styles). So for proper locale/cp support it is needed to do conversion from
UTF-8 to local_nls.
Signed-off-by: Pali Rohár <pali@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Check that path buffer has correct length (it is non-zero and in UNICODE
mode it has even number of bytes) and check that buffer does not contain
null character (UTF-16 null codepoint in UNICODE mode or null byte in
non-unicode mode) because Linux cannot process symlink with null byte.
Signed-off-by: Pali Rohár <pali@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
SMB symlink which has SYMLINK_FLAG_RELATIVE set is relative (as opposite of
the absolute) and it can be relative either to the current directory (where
is the symlink stored) or relative to the top level export path. To what it
is relative depends on the first character of the symlink target path.
If the first character is path separator then symlink is relative to the
export, otherwise to the current directory. Linux (and generally POSIX
systems) supports only symlink paths relative to the current directory
where is symlink stored.
Currently if Linux SMB client reads relative SMB symlink with first
character as path separator (slash), it let as is. Which means that Linux
interpret it as absolute symlink pointing from the root (/). But this
location is different than the top level directory of SMB export (unless
SMB export was mounted to the root) and thefore SMB symlinks relative to
the export are interpreted wrongly by Linux SMB client.
Fix this problem. As Linux does not have equivalent of the path relative to
the top of the mount point, convert such symlink target path relative to
the current directory. Do this by prepending "../" pattern N times before
the SMB target path, where N is the number of path separators found in SMB
symlink path.
So for example, if SMB share is mounted to Linux path /mnt/share/, symlink
is stored in file /mnt/share/test/folder1/symlink (so SMB symlink path is
test\folder1\symlink) and SMB symlink target points to \test\folder2\file,
then convert symlink target path to Linux path ../../test/folder2/file.
Deduplicate code for parsing SMB symlinks in native form from functions
smb2_parse_symlink_response() and parse_reparse_native_symlink() into new
function smb2_parse_native_symlink() and pass into this new function a new
full_path parameter from callers, which specify SMB full path where is
symlink stored.
This change fixes resolving of the native Windows symlinks relative to the
top level directory of the SMB share.
Signed-off-by: Pali Rohár <pali@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Neither SMB3.0 or SMB3.02 supports encryption negotiate context, so
when SMB2_GLOBAL_CAP_ENCRYPTION flag is set in the negotiate response,
the client uses AES-128-CCM as the default cipher. See MS-SMB2
3.3.5.4.
Commit b0abcd65ec54 ("smb: client: fix UAF in async decryption") added
a @server->cipher_type check to conditionally call
smb3_crypto_aead_allocate(), but that check would always be false as
@server->cipher_type is unset for SMB3.02.
Fix the following KASAN splat by setting @server->cipher_type for
SMB3.02 as well.
mount.cifs //srv/share /mnt -o vers=3.02,seal,...
BUG: KASAN: null-ptr-deref in crypto_aead_setkey+0x2c/0x130
Read of size 8 at addr 0000000000000020 by task mount.cifs/1095
CPU: 1 UID: 0 PID: 1095 Comm: mount.cifs Not tainted 6.12.0 #1
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-3.fc41
04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x5d/0x80
? crypto_aead_setkey+0x2c/0x130
kasan_report+0xda/0x110
? crypto_aead_setkey+0x2c/0x130
crypto_aead_setkey+0x2c/0x130
crypt_message+0x258/0xec0 [cifs]
? __asan_memset+0x23/0x50
? __pfx_crypt_message+0x10/0x10 [cifs]
? mark_lock+0xb0/0x6a0
? hlock_class+0x32/0xb0
? mark_lock+0xb0/0x6a0
smb3_init_transform_rq+0x352/0x3f0 [cifs]
? lock_acquire.part.0+0xf4/0x2a0
smb_send_rqst+0x144/0x230 [cifs]
? __pfx_smb_send_rqst+0x10/0x10 [cifs]
? hlock_class+0x32/0xb0
? smb2_setup_request+0x225/0x3a0 [cifs]
? __pfx_cifs_compound_last_callback+0x10/0x10 [cifs]
compound_send_recv+0x59b/0x1140 [cifs]
? __pfx_compound_send_recv+0x10/0x10 [cifs]
? __create_object+0x5e/0x90
? hlock_class+0x32/0xb0
? do_raw_spin_unlock+0x9a/0xf0
cifs_send_recv+0x23/0x30 [cifs]
SMB2_tcon+0x3ec/0xb30 [cifs]
? __pfx_SMB2_tcon+0x10/0x10 [cifs]
? lock_acquire.part.0+0xf4/0x2a0
? __pfx_lock_release+0x10/0x10
? do_raw_spin_trylock+0xc6/0x120
? lock_acquire+0x3f/0x90
? _get_xid+0x16/0xd0 [cifs]
? __pfx_SMB2_tcon+0x10/0x10 [cifs]
? cifs_get_smb_ses+0xcdd/0x10a0 [cifs]
cifs_get_smb_ses+0xcdd/0x10a0 [cifs]
? __pfx_cifs_get_smb_ses+0x10/0x10 [cifs]
? cifs_get_tcp_session+0xaa0/0xca0 [cifs]
cifs_mount_get_session+0x8a/0x210 [cifs]
dfs_mount_share+0x1b0/0x11d0 [cifs]
? __pfx___lock_acquire+0x10/0x10
? __pfx_dfs_mount_share+0x10/0x10 [cifs]
? lock_acquire.part.0+0xf4/0x2a0
? find_held_lock+0x8a/0xa0
? hlock_class+0x32/0xb0
? lock_release+0x203/0x5d0
cifs_mount+0xb3/0x3d0 [cifs]
? do_raw_spin_trylock+0xc6/0x120
? __pfx_cifs_mount+0x10/0x10 [cifs]
? lock_acquire+0x3f/0x90
? find_nls+0x16/0xa0
? smb3_update_mnt_flags+0x372/0x3b0 [cifs]
cifs_smb3_do_mount+0x1e2/0xc80 [cifs]
? __pfx_vfs_parse_fs_string+0x10/0x10
? __pfx_cifs_smb3_do_mount+0x10/0x10 [cifs]
smb3_get_tree+0x1bf/0x330 [cifs]
vfs_get_tree+0x4a/0x160
path_mount+0x3c1/0xfb0
? kasan_quarantine_put+0xc7/0x1d0
? __pfx_path_mount+0x10/0x10
? kmem_cache_free+0x118/0x3e0
? user_path_at+0x74/0xa0
__x64_sys_mount+0x1a6/0x1e0
? __pfx___x64_sys_mount+0x10/0x10
? mark_held_locks+0x1a/0x90
do_syscall_64+0xbb/0x1d0
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Cc: Tom Talpey <tom@talpey.com>
Reported-by: Jianhong Yin <jiyin@redhat.com>
Cc: stable@vger.kernel.org # v6.12
Fixes: b0abcd65ec54 ("smb: client: fix UAF in async decryption")
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Since commit 8da33fd11c05 ("cifs: avoid deadlocks while updating iface")
cifs_chan_update_iface now takes the chan_lock itself, so update the
comment accordingly.
Reviewed-by: Enzo Matsumiya <ematsumiya@suse.de>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Change return value from -ENOENT to -EOPNOTSUPP to maintain consistency
with the return value of open_cached_dir() for the same case. This
change is safe as the only calling function does not differentiate
between these return values.
Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
Reviewed-by: Enzo Matsumiya <ematsumiya@suse.de>
Signed-off-by: Henrique Carvalho <henrique.carvalho@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Setting dir_cache_timeout to zero should disable the caching of
directory contents. Currently, even when dir_cache_timeout is zero,
some caching related functions are still invoked, which is unintended
behavior.
Fix the issue by setting tcon->nohandlecache to true when
dir_cache_timeout is zero, ensuring that directory handle caching
is properly disabled.
Fixes: 238b351d0935 ("smb3: allow controlling length of time directory entries are cached with dir leases")
Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
Reviewed-by: Enzo Matsumiya <ematsumiya@suse.de>
Signed-off-by: Henrique Carvalho <henrique.carvalho@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Checks inside open_cached_dir() can be removed because if dir caching is
disabled then tcon->cfids is necessarily NULL. Therefore, all other checks
are redundant.
Signed-off-by: Henrique Carvalho <henrique.carvalho@suse.com>
Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
Reviewed-by: Enzo Matsumiya <ematsumiya@suse.de>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
get_current_cred() increments the reference counter, but the
put_cred() call was missing.
Cc: stable@vger.kernel.org
Fixes: 596afb0b8933 ("ceph: add ceph_mds_check_access() helper")
Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
This eliminates a redundant get_current_cred() call, because
ceph_mds_check_access() has already obtained this pointer.
As a side effect, this also fixes a reference leak in
ceph_mds_auth_match(): by omitting the get_current_cred() call, no
additional cred reference is taken.
Cc: stable@vger.kernel.org
Fixes: 596afb0b8933 ("ceph: add ceph_mds_check_access() helper")
Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
F_SET_RW_HINT controls data placement in the file system and / or
device and should not be available to everyone who can read a given file.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241122122931.90408-2-hch@lst.de
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Before this commit, ->dir and ->entry of exfat_inode_info record the
first cluster of the parent directory and the directory entry index
starting from this cluster.
The directory entry set will be gotten during write-back-inode/rmdir/
unlink/rename. If the clusters of the parent directory are not
continuous, the FAT chain will be traversed from the first cluster of
the parent directory to find the cluster where ->entry is located.
After this commit, ->dir records the cluster where the first directory
entry in the directory entry set is located, and ->entry records the
directory entry index in the cluster, so that there is almost no need
to access the FAT when getting the directory entry set.
Signed-off-by: Yuezhang Mo <Yuezhang.Mo@sony.com>
Reviewed-by: Aoyama Wataru <wataru.aoyama@sony.com>
Reviewed-by: Daniel Palmer <daniel.palmer@sony.com>
Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
|
|
For the root directory and other directories, the clusters
allocated to them can be obtained from exfat_inode_info, and
there is no need to distinguish them.
And there is no need to initialize atime/ctime/mtime/size in
exfat_readdir(), because exfat_iterate() does not use them.
Signed-off-by: Yuezhang Mo <Yuezhang.Mo@sony.com>
Reviewed-by: Aoyama Wataru <wataru.aoyama@sony.com>
Reviewed-by: Daniel Palmer <daniel.palmer@sony.com>
Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
|
|
The output of argument 'p_dir' of exfat_add_entry() is not used
in either exfat_mkdir() or exfat_create(), remove the argument.
Code refinement, no functional changes.
Signed-off-by: Yuezhang Mo <Yuezhang.Mo@sony.com>
Reviewed-by: Aoyama Wataru <wataru.aoyama@sony.com>
Reviewed-by: Daniel Palmer <daniel.palmer@sony.com>
Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
|
|
__exfat_resolve_path() mixes two functions. The first one is to
resolve and check if the path is valid. The second one is to output
the cluster assigned to the directory.
The second one is only needed when need to traverse the directory
entries, and calling exfat_chain_set() so early causes p_dir to be
passed as an argument multiple times, increasing the complexity of
the code.
This commit moves the call to exfat_chain_set() before traversing
directory entries.
Signed-off-by: Yuezhang Mo <Yuezhang.Mo@sony.com>
Reviewed-by: Aoyama Wataru <wataru.aoyama@sony.com>
Reviewed-by: Daniel Palmer <daniel.palmer@sony.com>
Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
|
|
This helper gets the directory entry set of the file for the exfat
inode which has been created.
It's used to remove all the instances of the pattern it replaces
making the code cleaner, it's also a preparation for changing ->dir
to record the cluster where the directory entry set is located and
changing ->entry to record the index of the directory entry within
the cluster.
Signed-off-by: Yuezhang Mo <Yuezhang.Mo@sony.com>
Reviewed-by: Aoyama Wataru <wataru.aoyama@sony.com>
Reviewed-by: Daniel Palmer <daniel.palmer@sony.com>
Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
|
|
In this exfat implementation, the relationship between inode and ei
is ei=EXFAT_I(inode). However, in the arguments of exfat_move_file()
and exfat_rename_file(), argument 'inode' indicates the parent
directory, but argument 'ei' indicates the target file to be renamed.
They do not have the above relationship, which is not friendly to code
readers.
So this commit renames 'inode' to 'parent_inode', making the argument
name match its role.
Signed-off-by: Yuezhang Mo <Yuezhang.Mo@sony.com>
Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
|
|
To determine whether it is a directory, there is no need to read its
directory entry, just use S_ISDIR(inode->i_mode).
Signed-off-by: Yuezhang Mo <Yuezhang.Mo@sony.com>
Reviewed-by: Aoyama Wataru <wataru.aoyama@sony.com>
Reviewed-by: Daniel Palmer <daniel.palmer@sony.com>
Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
|
|
Unaligned direct writes are invalid and should return an error
without making any changes, rather than extending ->valid_size
and then returning an error. Therefore, alignment checking is
required before extending ->valid_size.
Fixes: 11a347fb6cef ("exfat: change to get file size from DataLength")
Signed-off-by: Yuezhang Mo <Yuezhang.Mo@sony.com>
Co-developed-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
|
|
There is no check if stream size and start_clu are invalid.
If start_clu is EOF cluster and stream size is 4096, It will
cause uninit value access. because ei->hint_femp.eidx could
be 128(if cluster size is 4K) and wrong hint will allocate
next cluster. and this cluster will be same with the cluster
that is allocated by exfat_extend_valid_size(). The previous
patch will check invalid start_clu, but for clarity, initialize
hint_femp.eidx to zero.
Cc: stable@vger.kernel.org
Reported-by: syzbot+01218003be74b5e1213a@syzkaller.appspotmail.com
Tested-by: syzbot+01218003be74b5e1213a@syzkaller.appspotmail.com
Reviewed-by: Yuezhang Mo <Yuezhang.Mo@sony.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
|
|
In the case of the directory size is greater than or equal to
the cluster size, if start_clu becomes an EOF cluster(an invalid
cluster) due to file system corruption, then the directory entry
where ei->hint_femp.eidx hint is outside the directory, resulting
in an out-of-bounds access, which may cause further file system
corruption.
This commit adds a check for start_clu, if it is an invalid cluster,
the file or directory will be treated as empty.
Cc: stable@vger.kernel.org
Signed-off-by: Yuezhang Mo <Yuezhang.Mo@sony.com>
Co-developed-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- The series "zram: optimal post-processing target selection" from
Sergey Senozhatsky improves zram's post-processing selection
algorithm. This leads to improved memory savings.
- Wei Yang has gone to town on the mapletree code, contributing several
series which clean up the implementation:
- "refine mas_mab_cp()"
- "Reduce the space to be cleared for maple_big_node"
- "maple_tree: simplify mas_push_node()"
- "Following cleanup after introduce mas_wr_store_type()"
- "refine storing null"
- The series "selftests/mm: hugetlb_fault_after_madv improvements" from
David Hildenbrand fixes this selftest for s390.
- The series "introduce pte_offset_map_{ro|rw}_nolock()" from Qi Zheng
implements some rationaizations and cleanups in the page mapping
code.
- The series "mm: optimize shadow entries removal" from Shakeel Butt
optimizes the file truncation code by speeding up the handling of
shadow entries.
- The series "Remove PageKsm()" from Matthew Wilcox completes the
migration of this flag over to being a folio-based flag.
- The series "Unify hugetlb into arch_get_unmapped_area functions" from
Oscar Salvador implements a bunch of consolidations and cleanups in
the hugetlb code.
- The series "Do not shatter hugezeropage on wp-fault" from Dev Jain
takes away the wp-fault time practice of turning a huge zero page
into small pages. Instead we replace the whole thing with a THP. More
consistent cleaner and potentiall saves a large number of pagefaults.
- The series "percpu: Add a test case and fix for clang" from Andy
Shevchenko enhances and fixes the kernel's built in percpu test code.
- The series "mm/mremap: Remove extra vma tree walk" from Liam Howlett
optimizes mremap() by avoiding doing things which we didn't need to
do.
- The series "Improve the tmpfs large folio read performance" from
Baolin Wang teaches tmpfs to copy data into userspace at the folio
size rather than as individual pages. A 20% speedup was observed.
- The series "mm/damon/vaddr: Fix issue in
damon_va_evenly_split_region()" fro Zheng Yejian fixes DAMON
splitting.
- The series "memcg-v1: fully deprecate charge moving" from Shakeel
Butt removes the long-deprecated memcgv2 charge moving feature.
- The series "fix error handling in mmap_region() and refactor" from
Lorenzo Stoakes cleanup up some of the mmap() error handling and
addresses some potential performance issues.
- The series "x86/module: use large ROX pages for text allocations"
from Mike Rapoport teaches x86 to use large pages for
read-only-execute module text.
- The series "page allocation tag compression" from Suren Baghdasaryan
is followon maintenance work for the new page allocation profiling
feature.
- The series "page->index removals in mm" from Matthew Wilcox remove
most references to page->index in mm/. A slow march towards shrinking
struct page.
- The series "damon/{self,kunit}tests: minor fixups for DAMON debugfs
interface tests" from Andrew Paniakin performs maintenance work for
DAMON's self testing code.
- The series "mm: zswap swap-out of large folios" from Kanchana Sridhar
improves zswap's batching of compression and decompression. It is a
step along the way towards using Intel IAA hardware acceleration for
this zswap operation.
- The series "kasan: migrate the last module test to kunit" from
Sabyrzhan Tasbolatov completes the migration of the KASAN built-in
tests over to the KUnit framework.
- The series "implement lightweight guard pages" from Lorenzo Stoakes
permits userapace to place fault-generating guard pages within a
single VMA, rather than requiring that multiple VMAs be created for
this. Improved efficiencies for userspace memory allocators are
expected.
- The series "memcg: tracepoint for flushing stats" from JP Kobryn uses
tracepoints to provide increased visibility into memcg stats flushing
activity.
- The series "zram: IDLE flag handling fixes" from Sergey Senozhatsky
fixes a zram buglet which potentially affected performance.
- The series "mm: add more kernel parameters to control mTHP" from
Maíra Canal enhances our ability to control/configuremultisize THP
from the kernel boot command line.
- The series "kasan: few improvements on kunit tests" from Sabyrzhan
Tasbolatov has a couple of fixups for the KASAN KUnit tests.
- The series "mm/list_lru: Split list_lru lock into per-cgroup scope"
from Kairui Song optimizes list_lru memory utilization when lockdep
is enabled.
* tag 'mm-stable-2024-11-18-19-27' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (215 commits)
cma: enforce non-zero pageblock_order during cma_init_reserved_mem()
mm/kfence: add a new kunit test test_use_after_free_read_nofault()
zram: fix NULL pointer in comp_algorithm_show()
memcg/hugetlb: add hugeTLB counters to memcg
vmstat: call fold_vm_zone_numa_events() before show per zone NUMA event
mm: mmap_lock: check trace_mmap_lock_$type_enabled() instead of regcount
zram: ZRAM_DEF_COMP should depend on ZRAM
MAINTAINERS/MEMORY MANAGEMENT: add document files for mm
Docs/mm/damon: recommend academic papers to read and/or cite
mm: define general function pXd_init()
kmemleak: iommu/iova: fix transient kmemleak false positive
mm/list_lru: simplify the list_lru walk callback function
mm/list_lru: split the lock to per-cgroup scope
mm/list_lru: simplify reparenting and initial allocation
mm/list_lru: code clean up for reparenting
mm/list_lru: don't export list_lru_add
mm/list_lru: don't pass unnecessary key parameters
kasan: add kunit tests for kmalloc_track_caller, kmalloc_node_track_caller
kasan: change kasan_atomics kunit test as KUNIT_CASE_SLOW
kasan: use EXPORT_SYMBOL_IF_KUNIT to export symbols
...
|
|
Piergiorgio reported a bug in bugzilla as below:
------------[ cut here ]------------
WARNING: CPU: 2 PID: 969 at fs/f2fs/segment.c:1330
RIP: 0010:__submit_discard_cmd+0x27d/0x400 [f2fs]
Call Trace:
__issue_discard_cmd+0x1ca/0x350 [f2fs]
issue_discard_thread+0x191/0x480 [f2fs]
kthread+0xcf/0x100
ret_from_fork+0x31/0x50
ret_from_fork_asm+0x1a/0x30
w/ below testcase, it can reproduce this bug quickly:
- pvcreate /dev/vdb
- vgcreate myvg1 /dev/vdb
- lvcreate -L 1024m -n mylv1 myvg1
- mount /dev/myvg1/mylv1 /mnt/f2fs
- dd if=/dev/zero of=/mnt/f2fs/file bs=1M count=20
- sync
- rm /mnt/f2fs/file
- sync
- lvcreate -L 1024m -s -n mylv1-snapshot /dev/myvg1/mylv1
- umount /mnt/f2fs
The root cause is: it will update discard_max_bytes of mounted lvm
device to zero after creating snapshot on this lvm device, then,
__submit_discard_cmd() will pass parameter @nr_sects w/ zero value
to __blkdev_issue_discard(), it returns a NULL bio pointer, result
in panic.
This patch changes as below for fixing:
1. Let's drop all remained discards in f2fs_unfreeze() if snapshot
of lvm device is created.
2. Checking discard_max_bytes before submitting discard during
__submit_discard_cmd().
Cc: stable@vger.kernel.org
Fixes: 35ec7d574884 ("f2fs: split discard command in prior to block layer")
Reported-by: Piergiorgio Sartor <piergiorgio.sartor@nexgo.de>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219484
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
Quoted:
"at this time, there are still 1086911 extent nodes in this zombie
extent tree that need to be cleaned up.
crash_arm64_sprd_v8.0.3++> extent_tree.node_cnt ffffff80896cc500
node_cnt = {
counter = 1086911
},
"
As reported by Xiuhong, there will be a huge number of extent nodes
in extent tree, it may potentially cause:
- slab memory fragments
- extreme long time shrink on extent tree
- low mapping efficiency
Let's add a sysfs node to limit max read extent count for each inode,
by default, value of this threshold is 10240, it can be updated
according to user's requirement.
Reported-by: Xiuhong Wang <xiuhong.wang@unisoc.com>
Closes: https://lore.kernel.org/linux-f2fs-devel/20241112110627.1314632-1-xiuhong.wang@unisoc.com/
Signed-off-by: Xiuhong Wang <xiuhong.wang@unisoc.com>
Signed-off-by: Zhiguo Niu <zhiguo.niu@unisoc.com>
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
git://git.samba.org/sfrench/cifs-2.6
Pull smb client updates from Steve French:
- Fix two SMB3.1.1 POSIX Extensions problems
- Fixes for special file handling (symlinks and FIFOs)
- Improve compounding
- Four cleanup patches
- Fix use after free in signing
- Add support for handling namespaces for reconnect related upcalls
(e.g. for DNS names resolution and auth)
- Fix various directory lease problems (directory entry caching),
including some important potential use after frees
* tag '6.13-rc-part1-SMB3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
smb: prevent use-after-free due to open_cached_dir error paths
smb: Don't leak cfid when reconnect races with open_cached_dir
smb: client: handle max length for SMB symlinks
smb: client: get rid of bounds check in SMB2_ioctl_init()
smb: client: improve compound padding in encryption
smb3: request handle caching when caching directories
cifs: Recognize SFU char/block devices created by Windows NFS server on Windows Server <<2012
CIFS: New mount option for cifs.upcall namespace resolution
smb/client: Prevent error pointer dereference
fs/smb/client: implement chmod() for SMB3 POSIX Extensions
smb: cached directories can be more than root file handle
smb: client: fix use-after-free of signing key
smb: client: Use str_yes_no() helper function
smb: client: memcpy() with surrounding object base address
cifs: Remove pre-historic unused CIFSSMBCopy
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs
Pull overlayfs updates from Amir Goldstein:
- Fix a syzbot reported NULL pointer deref with bfs lower layers
- Fix a copy up failure of large file from lower fuse fs
- Followup cleanup of backing_file API from Miklos
- Introduction and use of revert/override_creds_light() helpers, that
were suggested by Christian as a mitigation to cache line bouncing
and false sharing of fields in overlayfs creator_cred long lived
struct cred copy.
- Store up to two backing file references (upper and lower) in an
ovl_file container instead of storing a single backing file in
file->private_data.
This is used to avoid the practice of opening a short lived backing
file for the duration of some file operations and to avoid the
specialized use of FDPUT_FPUT in such occasions, that was getting in
the way of Al's fd_file() conversions.
* tag 'ovl-update-6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs:
ovl: Filter invalid inodes with missing lookup function
ovl: convert ovl_real_fdget() callers to ovl_real_file()
ovl: convert ovl_real_fdget_path() callers to ovl_real_file_path()
ovl: store upper real file in ovl_file struct
ovl: allocate a container struct ovl_file for ovl private context
ovl: do not open non-data lower file for fsync
ovl: Optimize override/revert creds
ovl: pass an explicit reference of creators creds to callers
ovl: use wrapper ovl_revert_creds()
fs/backing-file: Convert to revert/override_creds_light()
cred: Add a light version of override/revert_creds()
backing-file: clean up the API
ovl: properly handle large files in ovl_security_fileattr
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/krisman/unicode
Pull unicode updates from Gabriel Krisman Bertazi:
- constify a read-only struct (Thomas Weißschuh)
- fix the error path of unicode_load, avoiding a possible kernel oops
if it fails to find the unicode module (André Almeida)
- documentation fix, updating a filename in the README (Gan Jie)
- add the link of my tree to MAINTAINERS (André Almeida)
* tag 'unicode-next-6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/krisman/unicode:
MAINTAINERS: Add Unicode tree
unicode: change the reference of database file
unicode: Fix utf8_load() error path
unicode: constify utf8 data table
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl
Pull sysctl updates from Joel Granados:
"sysctl ctl_table constification:
- Constifying ctl_table structs prevents the modification of
proc_handler function pointers. All ctl_table struct arguments are
const qualified in the sysctl API in such a way that the ctl_table
arrays being defined elsewhere and passed through sysctl can be
constified one-by-one.
We kick the constification off by qualifying user_table in
kernel/ucount.c and expect all the ctl_tables to be constified in
the coming releases.
Misc fixes:
- Adjust comments in two places to better reflect the code
- Remove superfluous dput calls
- Remove Luis from sysctl maintainership
- Replace comments about holding a lock with calls to
lockdep_assert_held"
* tag 'sysctl-6.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl:
sysctl: Reduce dput(child) calls in proc_sys_fill_cache()
sysctl: Reorganize kerneldoc parameter names
ucounts: constify sysctl table user_table
sysctl: update comments to new registration APIs
MAINTAINERS: remove me from sysctl
sysctl: Convert locking comments to lockdep assertions
const_structs.checkpatch: add ctl_table
sysctl: make internal ctl_tables const
sysctl: allow registration of const struct ctl_table
sysctl: move internal interfaces to const struct ctl_table
bpf: Constify ctl_table argument of filter function
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl
Pull cxl updates from Dave Jiang:
- Constify range_contains() input parameters to prevent changes
- Add support for displaying RCD capabilities in sysfs to support lspci
for CXL device
- Downgrade warning message to debug in cxl_probe_component_regs()
- Add support for adding a printf specifier '%pra' to emit 'struct
range' content:
- Add sanity tests for 'struct resource'
- Add documentation for special case
- Add %pra for 'struct range'
- Add %pra usage in CXL code
- Add preparation code for DCD support:
- Add range_overlaps()
- Add CDAT DSMAS table shared and read only flag in ACPICA
- Add documentation to 'struct dev_dax_range'
- Delay event buffer allocation in CXL PCI code until needed
- Use guard() in cxl_dpa_set_mode()
- Refactor create region code to consolidate common code
* tag 'cxl-for-6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl:
cxl/region: Refactor common create region code
cxl/hdm: Use guard() in cxl_dpa_set_mode()
cxl/pci: Delay event buffer allocation
dax: Document struct dev_dax_range
ACPI/CDAT: Add CDAT/DSMAS shared and read only flag values
range: Add range_overlaps()
cxl/cdat: Use %pra for dpa range outputs
printf: Add print format (%pra) for struct range
Documentation/printf: struct resource add start == end special case
test printf: Add very basic struct resource tests
cxl: downgrade a warning message to debug level in cxl_probe_component_regs()
cxl/pci: Add sysfs attribute for CXL 1.1 device link status
cxl/core/regs: Add rcd_pcie_cap initialization
kernel/range: Const-ify range_contains parameters
|
|
iov_iter_zero
If iov_iter_zero succeeds after failed copy_from_kernel_nofault,
we need to reset the ret value to zero otherwise it will be returned
as final return value of read_kcore_iter.
This fixes objdump -d dump over /proc/kcore for me.
Cc: stable@vger.kernel.org
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Fixes: 3d5854d75e31 ("fs/proc/kcore.c: allow translation of physical memory addresses")
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20241121231118.3212000-1-jolsa@kernel.org
Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
I recently had an fstests hang where there were two internal tasks
stuck like so:
[ 6559.010870] task:kworker/24:45 state:D stack:12152 pid:631308 tgid:631308 ppid:2 flags:0x00004000
[ 6559.016984] Workqueue: xfs-buf/dm-2 xfs_buf_ioend_work
[ 6559.020349] Call Trace:
[ 6559.022002] <TASK>
[ 6559.023426] __schedule+0x650/0xb10
[ 6559.025734] schedule+0x6d/0xf0
[ 6559.027835] schedule_timeout+0x31/0x180
[ 6559.030582] wait_for_common+0x10c/0x1e0
[ 6559.033495] wait_for_completion+0x1d/0x30
[ 6559.036463] __flush_workqueue+0xeb/0x490
[ 6559.039479] ? mempool_alloc_slab+0x15/0x20
[ 6559.042537] xlog_cil_force_seq+0xa1/0x2f0
[ 6559.045498] ? bio_alloc_bioset+0x1d8/0x510
[ 6559.048578] ? submit_bio_noacct+0x2f2/0x380
[ 6559.051665] ? xlog_force_shutdown+0x3b/0x170
[ 6559.054819] xfs_log_force+0x77/0x230
[ 6559.057455] xlog_force_shutdown+0x3b/0x170
[ 6559.060507] xfs_do_force_shutdown+0xd4/0x200
[ 6559.063798] ? xfs_buf_rele+0x1bd/0x580
[ 6559.066541] xfs_buf_ioend_handle_error+0x163/0x2e0
[ 6559.070099] xfs_buf_ioend+0x61/0x200
[ 6559.072728] xfs_buf_ioend_work+0x15/0x20
[ 6559.075706] process_scheduled_works+0x1d4/0x400
[ 6559.078814] worker_thread+0x234/0x2e0
[ 6559.081300] kthread+0x147/0x170
[ 6559.083462] ? __pfx_worker_thread+0x10/0x10
[ 6559.086295] ? __pfx_kthread+0x10/0x10
[ 6559.088771] ret_from_fork+0x3e/0x50
[ 6559.091153] ? __pfx_kthread+0x10/0x10
[ 6559.093624] ret_from_fork_asm+0x1a/0x30
[ 6559.096227] </TASK>
[ 6559.109304] Workqueue: xfs-cil/dm-2 xlog_cil_push_work
[ 6559.112673] Call Trace:
[ 6559.114333] <TASK>
[ 6559.115760] __schedule+0x650/0xb10
[ 6559.118084] schedule+0x6d/0xf0
[ 6559.120175] schedule_timeout+0x31/0x180
[ 6559.122776] ? call_rcu+0xee/0x2f0
[ 6559.125034] __down_common+0xbe/0x1f0
[ 6559.127470] __down+0x1d/0x30
[ 6559.129458] down+0x48/0x50
[ 6559.131343] ? xfs_buf_item_unpin+0x8d/0x380
[ 6559.134213] xfs_buf_lock+0x3d/0xe0
[ 6559.136544] xfs_buf_item_unpin+0x8d/0x380
[ 6559.139253] xlog_cil_committed+0x287/0x520
[ 6559.142019] ? sched_clock+0x10/0x30
[ 6559.144384] ? sched_clock_cpu+0x10/0x190
[ 6559.147039] ? psi_group_change+0x48/0x310
[ 6559.149735] ? _raw_spin_unlock+0xe/0x30
[ 6559.152340] ? finish_task_switch+0xbc/0x310
[ 6559.155163] xlog_cil_process_committed+0x6d/0x90
[ 6559.158265] xlog_state_shutdown_callbacks+0x53/0x110
[ 6559.161564] ? xlog_cil_push_work+0xa70/0xaf0
[ 6559.164441] xlog_state_release_iclog+0xba/0x1b0
[ 6559.167483] xlog_cil_push_work+0xa70/0xaf0
[ 6559.170260] process_scheduled_works+0x1d4/0x400
[ 6559.173286] worker_thread+0x234/0x2e0
[ 6559.175779] kthread+0x147/0x170
[ 6559.177933] ? __pfx_worker_thread+0x10/0x10
[ 6559.180748] ? __pfx_kthread+0x10/0x10
[ 6559.183231] ret_from_fork+0x3e/0x50
[ 6559.185601] ? __pfx_kthread+0x10/0x10
[ 6559.188092] ret_from_fork_asm+0x1a/0x30
[ 6559.190692] </TASK>
This is an ABBA deadlock where buffer IO completion is triggering a
forced shutdown with the buffer lock held. It is waiting for the CIL
to flush as part of the log force. The CIL flush is blocked doing
shutdown processing of all it's objects, trying to unpin a buffer
item. That requires taking the buffer lock....
For the CIL to be doing shutdown processing, the log must be marked
with XLOG_IO_ERROR, but that doesn't happen until after the log
force is issued. Hence for xfs_do_force_shutdown() to be forcing
the log on a shut down log, we must have had a racing
xlog_force_shutdown and xfs_force_shutdown like so:
p0 p1 CIL push
<holds buffer lock>
xlog_force_shutdown
xfs_log_force
test_and_set_bit(XLOG_IO_ERROR)
xlog_state_release_iclog()
sees XLOG_IO_ERROR
xlog_state_shutdown_callbacks
....
xfs_buf_item_unpin
xfs_buf_lock
<blocks on buffer p1 holds>
xfs_force_shutdown
xfs_set_shutdown(mp) wins
xlog_force_shutdown
xfs_log_force
<blocks on CIL push>
xfs_set_shutdown(mp) fails
<shuts down rest of log>
The deadlock can be mitigated by avoiding the log force on the
second pass through xlog_force_shutdown. Do this by adding another
atomic state bit (XLOG_OP_PENDING_SHUTDOWN) that is set on entry to
xlog_force_shutdown() but doesn't mark the log as shutdown.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
I've been seeing this failure on during xfs/050 recently:
XFS: Assertion failed: dst->d_spc_timer != 0, file: fs/xfs/xfs_qm_syscalls.c, line: 435
....
Call Trace:
<TASK>
xfs_qm_scall_getquota_fill_qc+0x2a2/0x2b0
xfs_qm_scall_getquota_next+0x69/0xa0
xfs_fs_get_nextdqblk+0x62/0xf0
quota_getnextxquota+0xbf/0x320
do_quotactl+0x1a1/0x410
__se_sys_quotactl+0x126/0x310
__x64_sys_quotactl+0x21/0x30
x64_sys_call+0x2819/0x2ee0
do_syscall_64+0x68/0x130
entry_SYSCALL_64_after_hwframe+0x76/0x7e
It turns out that the _qmount call has silently been failing to
unmount and mount the filesystem, so when the softlimit is pushed
past with a buffered write, it is not getting synced to disk before
the next quota report is being run.
Hence when the quota report runs, we have 300 blocks of delalloc
data on an inode, with a soft limit of 200 blocks. XFS dquots
account delalloc reservations as used space, hence the dquot is over
the soft limit.
However, we don't update the soft limit timers until we do a
transactional update of the dquot. That is, the dquot sits over the
soft limit without a softlimit timer being started until writeback
occurs and the allocation modifies the dquot and we call
xfs_qm_adjust_dqtimers() from xfs_trans_apply_dquot_deltas() in
xfs_trans_commit() context.
This isn't really a problem, except for this debug code in
xfs_qm_scall_getquota_fill_qc():
if (xfs_dquot_is_enforced(dqp) && dqp->q_id != 0) {
if ((dst->d_space > dst->d_spc_softlimit) &&
(dst->d_spc_softlimit > 0)) {
ASSERT(dst->d_spc_timer != 0);
}
....
It asserts taht if the used block count is over the soft limit,
it *must* have a soft limit timer running. This is clearly not
the case, because we haven't committed the delalloc space to disk
yet. Hence the soft limit is only exceeded temporarily in memory
(which isn't an issue) and we start the timer the moment we exceed
the soft limit in journalled metadata.
This debug was introduced in:
commit 0d5ad8383061fbc0a9804fbb98218750000fe032
Author: Supriya Wickrematillake <sup@sgi.com>
Date: Wed May 15 22:44:44 1996 +0000
initial checkin
quotactl syscall functions.
The very first quota support commit back in 1996. This is zero-day
debug for Irix and, as it turns out, a zero-day bug in the debug
code because the delalloc code on Irix didn't update the softlimit
timers, either.
IOWs, this issue has been in the code for 28 years.
We obviously don't care if soft limit timers are a bit rubbery when
we have delalloc reservations in memory. Production systems running
quota reports have been exposed to this situation for 28 years and
nobody has noticed it, so the debug code is essentially worthless at
this point in time.
We also have the on-disk dquot verifiers checking that the soft
limit timer is running whenever the dquot is over the soft limit
before we write it to disk and after we read it from disk. These
aren't firing, so it is clear the issue is purely a temporary
in-memory incoherency that I never would have noticed had the test
not silently failed to unmount the filesystem.
Hence I'm simply going to trash this runtime debug because it isn't
useful in the slightest for catching quota bugs.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
The runt AG at the end of a filesystem is almost always smaller than
the mp->m_sb.sb_agblocks. Unfortunately, when setting the max_agbno
limit for the inode chunk allocation, we do not take this into
account. This means we can allocate a sparse inode chunk that
overlaps beyond the end of an AG. When we go to allocate an inode
from that sparse chunk, the irec fails validation because the
agbno of the start of the irec is beyond valid limits for the runt
AG.
Prevent this from happening by taking into account the size of the
runt AG when allocating inode chunks. Also convert the various
checks for valid inode chunk agbnos to use xfs_ag_block_count()
so that they will also catch such issues in the future.
Fixes: 56d1115c9bc7 ("xfs: allocate sparse inode chunks on full chunk allocation failure")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
Compat features are new features that older kernels can safely ignore,
allowing read-write mounts without issues. The current sb write validation
implementation returns -EFSCORRUPTED for unknown compat features,
preventing filesystem write operations and contradicting the feature's
definition.
Additionally, if the mounted image is unclean, the log recovery may need
to write to the superblock. Returning an error for unknown compat features
during sb write validation can cause mount failures.
Although XFS currently does not use compat feature flags, this issue
affects current kernels' ability to mount images that may use compat
feature flags in the future.
Since superblock read validation already warns about unknown compat
features, it's unnecessary to repeat this warning during write validation.
Therefore, the relevant code in write validation is being removed.
Fixes: 9e037cb7972f ("xfs: check for unknown v5 feature bits in superblock write verifier")
Cc: stable@vger.kernel.org # v4.19+
Signed-off-by: Long Li <leo.lilong@huawei.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
xfs_attr_shortform_list() only called from a non-transactional context, it
hold ilock before alloc memory and maybe trapped in memory reclaim. Since
commit 204fae32d5f7("xfs: clean up remaining GFP_NOFS users") removed
GFP_NOFS flag, lockdep warning will be report as [1]. Eliminate lockdep
false positives by use __GFP_NOLOCKDEP to alloc memory
in xfs_attr_shortform_list().
[1] https://lore.kernel.org/linux-xfs/000000000000e33add0616358204@google.com/
Reported-by: syzbot+4248e91deb3db78358a2@syzkaller.appspotmail.com
Signed-off-by: Long Li <leo.lilong@huawei.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|