Age | Commit message (Collapse) | Author |
|
Provide a better description of why the GLF_DEMOTE_IN_PROGRESS flag
cannot be set.
Function do_xmote() may block, so make sure it isn't called when
nonblock is true.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Andrew Price <anprice@redhat.com>
|
|
Transform the code in run_queue() to make it more readable. No change
in functionality.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Andrew Price <anprice@redhat.com>
|
|
While not immediately obvious, do_promote() returns whether or not there
are any active holders in the queue. But the function description is
confusing, and this information is easy to come by for callers anyway,
so turn do_promote() into a void function.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Andrew Price <anprice@redhat.com>
|
|
Get rid of the GLF_INVALIDATE_IN_PROGRESS flag: it was originally used
to indicate to add_to_queue() that the ->go_sync() and ->go_invalid()
operations were in progress, but as we have established in commit "gfs2:
Fix LM_FLAG_TRY* logic in add_to_queue", add_to_queue() has no need to
know.
Commit d99724c3c36a describes a race in which GLF_INVALIDATE_IN_PROGRESS
is used to serialize two processes which are both in do_xmote() at the
same time. That analysis is wrong: the serialization happens via the
GLF_LOCK flag, which ensures that at most one glock operation can be
active at any time.
Fixes: d99724c3c36a ("gfs2: Close timing window with GLF_INVALIDATE_IN_PROGRESS")
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Andrew Price <anprice@redhat.com>
|
|
Commit 865cc3e9cc0b ("gfs2: fix a deadlock on withdraw-during-mount")
added a statement to do_xmote() to clear the GLF_INVALIDATE_IN_PROGRESS
flag a second time after it has already been cleared. Fix that.
Fixes: 865cc3e9cc0b ("gfs2: fix a deadlock on withdraw-during-mount")
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Andrew Price <anprice@redhat.com>
|
|
In do_xmote(), remove the duplicate check for the ->go_sync and
->go_inval glock operations. They are either both defined, or none of
them are.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Andrew Price <anprice@redhat.com>
|
|
The logic in add_to_queue() for determining whether a LM_FLAG_TRY or
LM_FLAG_TRY_1CB holder should be queued does not make any sense: we are
interested in wether or not the new operation will block behind an
existing or future holder in the queue, but the current code checks for
ongoing locking or ->go_inval() operations, which has little to do with
that.
Replace that code with something more sensible, remove the incorrect
add_to_queue() function annotations, remove the similarly misguided
do_error(gl, 0) call in do_xmote(), and add a missing comment to the
same call in do_promote().
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Andrew Price <anprice@redhat.com>
|
|
Commit 6802e3400ff45 ("[GFS2] Clean up the glock core") stopped passing
the LM_FLAG_ANY flag down to gdlm_lock() (then gfs2_lm_lock()). Since
then, gfs2 effectively hasn't been using dlm's DLM_LKF_ALTCW /
DLM_LKF_ALTPR flags, but the code still suggests that it does. Recent
testing shows that those flags don't even work reliably anymore, so
instead of fixing code that hasn't been used since 2008, remove it.
In addition, clean up how the flags are passed to [gd]lm_lock().
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Andrew Price <anprice@redhat.com>
|
|
The gl_req field and GLF_BLOCKING flag are only relevant to gdlm_lock(),
its callback gdlm_ast(), and their helpers, so set and clear them inside
lock_dlm.c.
Also, the LM_FLAG_ANY flag is relevant to gdlm_lock(), but do_xmote()
doesn't pass that flag down to gdlm_lock() as it should. Fix that by
passing down all the flags.
In addition, document the effect of the LM_FLAG_ANY flag on locks held
in EX mode locally.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Andrew Price <anprice@redhat.com>
|
|
The GLF_DEMOTE_IN_PROGRESS and GLF_LOCK flags and the glock refcount are
all protected by the glock spin lock, so there is no need for atomic
operations / barriers here.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Andrew Price <anprice@redhat.com>
|
|
Change those functions to either return a useful value, or nothing at
all.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Andrew Price <anprice@redhat.com>
|
|
Since commit 601ef0d52e96 ("gfs2: Force withdraw to replay journals and
wait for it to finish"), gfs2_withdraw() always returns -1, so turn it
into a straight void function.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Andrew Price <anprice@redhat.com>
|
|
When the lm_lock hook which calls dlm_lock() returns an error,
do_xmote() previously reported the error to the syslog ("lm_lock ret
%d\n") but otherwise ignored it during withdraws. Commit 9947a06d29c0
("gfs2: do_xmote fixes") changed that to pass the error on to the glock
layer, but the error would then only result in an unconditional BUG() in
finish_xmote(), which doesn't help. Instead, revert to the previous
behavior.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Andrew Price <anprice@redhat.com>
|
|
In do_xmote(), take the additional glock references close to where those
references are needed. This will simplify the next commit.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Andrew Price <anprice@redhat.com>
|
|
Check for asynchronous completion and clear the GLF_PENDING_REPLY flag
earlier in do_xmote(). This will make future changes more readable.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Andrew Price <anprice@redhat.com>
|
|
There is an extraneous space before a newline in a fs_err message.
Remove it
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Andrew Price <anprice@redhat.com>
|
|
This field was introduced in commit 601ef0d52e96 ("gfs2: Force withdraw
to replay journals and wait for it to finish") and never used.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Andrew Price <anprice@redhat.com>
|
|
Since commit 38552ff676f0 ("gfs2: Fix and clean up create / evict
interaction"), the GIF_FREE_VFS_INODE flag isn't used anymore.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Andrew Price <anprice@redhat.com>
|
|
Currently, xattr name prefixes are forcibly placed into the packed
inode if the fragments feature is enabled, and users have no option
to put them in plain form directly on disk.
This is inflexible. First, as mentioned above, users should be able
to store unwrapped long xattr name prefixes unconditionally
(COMPAT_PLAIN_XATTR_PFX). Second, since we now have the new metabox
inode to store metadata, it should be used when available instead
of the packed inode.
Fixes: 414091322c63 ("erofs: implement metadata compression")
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- fix delayed inode tracking in xarray, eviction can race with
insertion and leave behind a disconnected inode
- on systems with large page (64K) and small block size (4K) fix
compression read that can return partially filled folio
- slightly relax compression option format for backward compatibility,
allow to specify level for LZO although there's only one
- fix simple quota accounting of compressed extents
- validate minimum device size in 'device add'
- update maintainers' entry
* tag 'for-6.17-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: don't allow adding block device of less than 1 MB
MAINTAINERS: update btrfs entry
btrfs: fix subvolume deletion lockup caused by inodes xarray race
btrfs: fix corruption reading compressed range when block size is smaller than page size
btrfs: accept and ignore compression level for lzo
btrfs: fix squota compressed stats leak
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"20 hotfixes. 15 are cc:stable and the remainder address post-6.16
issues or aren't considered necessary for -stable kernels. 14 of these
fixes are for MM.
This includes
- kexec fixes from Breno for a recently introduced
use-uninitialized bug
- DAMON fixes from Quanmin Yan to avoid div-by-zero crashes
which can occur if the operator uses poorly-chosen insmod
parameters
and misc singleton fixes"
* tag 'mm-hotfixes-stable-2025-09-10-20-00' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
MAINTAINERS: add tree entry to numa memblocks and emulation block
mm/damon/sysfs: fix use-after-free in state_show()
proc: fix type confusion in pde_set_flags()
compiler-clang.h: define __SANITIZE_*__ macros only when undefined
mm/vmalloc, mm/kasan: respect gfp mask in kasan_populate_vmalloc()
ocfs2: fix recursive semaphore deadlock in fiemap call
mm/memory-failure: fix VM_BUG_ON_PAGE(PagePoisoned(page)) when unpoison memory
mm/mremap: fix regression in vrm->new_addr check
percpu: fix race on alloc failed warning limit
mm/memory-failure: fix redundant updates for already poisoned pages
s390: kexec: initialize kexec_buf struct
riscv: kexec: initialize kexec_buf struct
arm64: kexec: initialize kexec_buf struct in load_other_segments()
mm/damon/reclaim: avoid divide-by-zero in damon_reclaim_apply_parameters()
mm/damon/lru_sort: avoid divide-by-zero in damon_lru_sort_apply_parameters()
mm/damon/core: set quota->charged_from to jiffies at first charge window
mm/hugetlb: add missing hugetlb_lock in __unmap_hugepage_range()
init/main.c: fix boot time tracing crash
mm/memory_hotplug: fix hwpoisoned large folio handling in do_migrate_range()
mm/khugepaged: fix the address passed to notifier on testing young
|
|
Pull NFS client fixes from Trond Myklebust:
"Stable patches:
- Revert "SUNRPC: Don't allow waiting for exiting tasks" as it is
breaking ltp tests
Bugfixes:
- Another set of fixes to the tracking of NFSv4 server capabilities
when crossing filesystem boundaries
- Localio fix to restore credentials and prevent triggering a
BUG_ON()
- Fix to prevent flapping of the localio on/off trigger
- Protections against 'eof page pollution' as demonstrated in
xfstests generic/363
- Series of patches to ensure correct ordering of O_DIRECT i/o and
truncate, fallocate and copy functions
- Fix a NULL pointer check in flexfiles reads that regresses 6.17
- Correct a typo that breaks flexfiles layout segment processing"
* tag 'nfs-for-6.17-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
NFSv4/flexfiles: Fix layout merge mirror check.
SUNRPC: call xs_sock_process_cmsg for all cmsg
Revert "SUNRPC: Don't allow waiting for exiting tasks"
NFS: Fix the marking of the folio as up to date
NFS: nfs_invalidate_folio() must observe the offset and size arguments
NFSv4.2: Serialise O_DIRECT i/o and copy range
NFSv4.2: Serialise O_DIRECT i/o and clone range
NFSv4.2: Serialise O_DIRECT i/o and fallocate()
NFS: Serialise O_DIRECT i/o and truncate()
NFSv4.2: Protect copy offload and clone against 'eof page pollution'
NFS: Protect against 'eof page pollution'
flexfiles/pNFS: fix NULL checks on result of ff_layout_choose_ds_for_read
nfs/localio: avoid bouncing LOCALIO if nfs_client_is_local()
nfs/localio: restore creds before releasing pageio data
NFSv4: Clear the NFS_CAP_XATTR flag if not supported by the server
NFSv4: Clear NFS_CAP_OPEN_XOR and NFS_CAP_DELEGTIME if not supported
NFSv4: Clear the NFS_CAP_FS_LOCATIONS flag if it is not set
NFSv4: Don't clear capabilities that won't be reset
|
|
Commit 0e2f80afcfa6("fs/dax: ensure all pages are idle prior to
filesystem unmount") introduced the WARN_ON_ONCE to capture whether
the filesystem has removed all DAX entries or not and applied the
fix to xfs and ext4.
Apply the missed fix on erofs to fix the runtime warning:
[ 5.266254] ------------[ cut here ]------------
[ 5.266274] WARNING: CPU: 6 PID: 3109 at mm/truncate.c:89 truncate_folio_batch_exceptionals+0xff/0x260
[ 5.266294] Modules linked in:
[ 5.266999] CPU: 6 UID: 0 PID: 3109 Comm: umount Tainted: G S 6.16.0+ #6 PREEMPT(voluntary)
[ 5.267012] Tainted: [S]=CPU_OUT_OF_SPEC
[ 5.267017] Hardware name: Dell Inc. OptiPlex 5000/05WXFV, BIOS 1.5.1 08/24/2022
[ 5.267024] RIP: 0010:truncate_folio_batch_exceptionals+0xff/0x260
[ 5.267076] Code: 00 00 41 39 df 7f 11 eb 78 83 c3 01 49 83 c4 08 41 39 df 74 6c 48 63 f3 48 83 fe 1f 0f 83 3c 01 00 00 43 f6 44 26 08 01 74 df <0f> 0b 4a 8b 34 22 4c 89 ef 48 89 55 90 e8 ff 54 1f 00 48 8b 55 90
[ 5.267083] RSP: 0018:ffffc900013f36c8 EFLAGS: 00010202
[ 5.267095] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[ 5.267101] RDX: ffffc900013f3790 RSI: 0000000000000000 RDI: ffff8882a1407898
[ 5.267108] RBP: ffffc900013f3740 R08: 0000000000000000 R09: 0000000000000000
[ 5.267113] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[ 5.267119] R13: ffff8882a1407ab8 R14: ffffc900013f3888 R15: 0000000000000001
[ 5.267125] FS: 00007aaa8b437800(0000) GS:ffff88850025b000(0000) knlGS:0000000000000000
[ 5.267132] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5.267138] CR2: 00007aaa8b3aac10 CR3: 000000024f764000 CR4: 0000000000f52ef0
[ 5.267144] PKRU: 55555554
[ 5.267150] Call Trace:
[ 5.267154] <TASK>
[ 5.267181] truncate_inode_pages_range+0x118/0x5e0
[ 5.267193] ? save_trace+0x54/0x390
[ 5.267296] truncate_inode_pages_final+0x43/0x60
[ 5.267309] evict+0x2a4/0x2c0
[ 5.267339] dispose_list+0x39/0x80
[ 5.267352] evict_inodes+0x150/0x1b0
[ 5.267376] generic_shutdown_super+0x41/0x180
[ 5.267390] kill_block_super+0x1b/0x50
[ 5.267402] erofs_kill_sb+0x81/0x90 [erofs]
[ 5.267436] deactivate_locked_super+0x32/0xb0
[ 5.267450] deactivate_super+0x46/0x60
[ 5.267460] cleanup_mnt+0xc3/0x170
[ 5.267475] __cleanup_mnt+0x12/0x20
[ 5.267485] task_work_run+0x5d/0xb0
[ 5.267499] exit_to_user_mode_loop+0x144/0x170
[ 5.267512] do_syscall_64+0x2b9/0x7c0
[ 5.267523] ? __lock_acquire+0x665/0x2ce0
[ 5.267535] ? __lock_acquire+0x665/0x2ce0
[ 5.267560] ? lock_acquire+0xcd/0x300
[ 5.267573] ? find_held_lock+0x31/0x90
[ 5.267582] ? mntput_no_expire+0x97/0x4e0
[ 5.267606] ? mntput_no_expire+0xa1/0x4e0
[ 5.267625] ? mntput+0x24/0x50
[ 5.267634] ? path_put+0x1e/0x30
[ 5.267647] ? do_faccessat+0x120/0x2f0
[ 5.267677] ? do_syscall_64+0x1a2/0x7c0
[ 5.267686] ? from_kgid_munged+0x17/0x30
[ 5.267703] ? from_kuid_munged+0x13/0x30
[ 5.267711] ? __do_sys_getuid+0x3d/0x50
[ 5.267724] ? do_syscall_64+0x1a2/0x7c0
[ 5.267732] ? irqentry_exit+0x77/0xb0
[ 5.267743] ? clear_bhb_loop+0x30/0x80
[ 5.267752] ? clear_bhb_loop+0x30/0x80
[ 5.267765] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 5.267772] RIP: 0033:0x7aaa8b32a9fb
[ 5.267781] Code: c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 f3 0f 1e fa 31 f6 e9 05 00 00 00 0f 1f 44 00 00 f3 0f 1e fa b8 a6 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 05 c3 0f 1f 40 00 48 8b 15 e9 83 0d 00 f7 d8
[ 5.267787] RSP: 002b:00007ffd7c4c9468 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[ 5.267796] RAX: 0000000000000000 RBX: 00005a61592a8b00 RCX: 00007aaa8b32a9fb
[ 5.267802] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00005a61592b2080
[ 5.267806] RBP: 00007ffd7c4c9540 R08: 00007aaa8b403b20 R09: 0000000000000020
[ 5.267812] R10: 0000000000000001 R11: 0000000000000246 R12: 00005a61592a8c00
[ 5.267817] R13: 0000000000000000 R14: 00005a61592b2080 R15: 00005a61592a8f10
[ 5.267849] </TASK>
[ 5.267854] irq event stamp: 4721
[ 5.267859] hardirqs last enabled at (4727): [<ffffffff814abf50>] __up_console_sem+0x90/0xa0
[ 5.267873] hardirqs last disabled at (4732): [<ffffffff814abf35>] __up_console_sem+0x75/0xa0
[ 5.267884] softirqs last enabled at (3044): [<ffffffff8132adb3>] kernel_fpu_end+0x53/0x70
[ 5.267895] softirqs last disabled at (3042): [<ffffffff8132b5f4>] kernel_fpu_begin_mask+0xc4/0x120
[ 5.267905] ---[ end trace 0000000000000000 ]---
Fixes: bde708f1a65d ("fs/dax: always remove DAX page-cache entries when breaking layouts")
Signed-off-by: Yuezhang Mo <Yuezhang.Mo@sony.com>
Reviewed-by: Friendy Su <friendy.su@sony.com>
Reviewed-by: Daniel Palmer <daniel.palmer@sony.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
|
|
Rename of open files in SMB2+ has been broken for a very long time,
resulting in data loss as the CIFS client would fail the rename(2)
call with -ENOENT and then removing the target file.
Fix this by implementing ->rename_pending_delete() for SMB2+, which
will rename busy files to random filenames (e.g. silly rename) during
unlink(2) or rename(2), and then marking them to delete-on-close.
Besides, introduce a FIND_WR_NO_PENDING_DELETE flag to prevent open(2)
from reusing open handles that had been marked as delete pending.
Handle it in cifs_get_readable_path() as well.
Reported-by: Jean-Baptiste Denis <jbdenis@pasteur.fr>
Closes: https://marc.info/?i=16aeb380-30d4-4551-9134-4e7d1dc833c0@pasteur.fr
Reviewed-by: David Howells <dhowells@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
Cc: Frank Sorenson <sorenson@redhat.com>
Cc: Olga Kornievskaia <okorniev@redhat.com>
Cc: Benjamin Coddington <bcodding@redhat.com>
Cc: Scott Mayhew <smayhew@redhat.com>
Cc: linux-cifs@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
The encryption layer can't handle the padding iovs, so flatten the
compound request into a single buffer with required padding to prevent
the server from dropping the connection when finding unaligned
compound requests.
Fixes: bc925c1216f0 ("smb: client: improve compound padding in encryption")
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
Reviewed-by: David Howells <dhowells@redhat.com>
Cc: linux-cifs@vger.kernel.org
Cc: stable@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
BUG: KASAN: slab-out-of-bounds in hfsplus_uni2asc+0xa71/0xb90 fs/hfsplus/unicode.c:186
Read of size 2 at addr ffff8880289ef218 by task syz.6.248/14290
CPU: 0 UID: 0 PID: 14290 Comm: syz.6.248 Not tainted 6.16.4 #1 PREEMPT(full)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:94 [inline]
dump_stack_lvl+0x116/0x1b0 lib/dump_stack.c:120
print_address_description mm/kasan/report.c:378 [inline]
print_report+0xca/0x5f0 mm/kasan/report.c:482
kasan_report+0xca/0x100 mm/kasan/report.c:595
hfsplus_uni2asc+0xa71/0xb90 fs/hfsplus/unicode.c:186
hfsplus_listxattr+0x5b6/0xbd0 fs/hfsplus/xattr.c:738
vfs_listxattr+0xbe/0x140 fs/xattr.c:493
listxattr+0xee/0x190 fs/xattr.c:924
filename_listxattr fs/xattr.c:958 [inline]
path_listxattrat+0x143/0x360 fs/xattr.c:988
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0xcb/0x4c0 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7fe0e9fae16d
Code: 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 a8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007fe0eae67f98 EFLAGS: 00000246 ORIG_RAX: 00000000000000c3
RAX: ffffffffffffffda RBX: 00007fe0ea205fa0 RCX: 00007fe0e9fae16d
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000200000000000
RBP: 00007fe0ea0480f0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007fe0ea206038 R14: 00007fe0ea205fa0 R15: 00007fe0eae48000
</TASK>
Allocated by task 14290:
kasan_save_stack+0x24/0x50 mm/kasan/common.c:47
kasan_save_track+0x14/0x30 mm/kasan/common.c:68
poison_kmalloc_redzone mm/kasan/common.c:377 [inline]
__kasan_kmalloc+0xaa/0xb0 mm/kasan/common.c:394
kasan_kmalloc include/linux/kasan.h:260 [inline]
__do_kmalloc_node mm/slub.c:4333 [inline]
__kmalloc_noprof+0x219/0x540 mm/slub.c:4345
kmalloc_noprof include/linux/slab.h:909 [inline]
hfsplus_find_init+0x95/0x1f0 fs/hfsplus/bfind.c:21
hfsplus_listxattr+0x331/0xbd0 fs/hfsplus/xattr.c:697
vfs_listxattr+0xbe/0x140 fs/xattr.c:493
listxattr+0xee/0x190 fs/xattr.c:924
filename_listxattr fs/xattr.c:958 [inline]
path_listxattrat+0x143/0x360 fs/xattr.c:988
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0xcb/0x4c0 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
When hfsplus_uni2asc is called from hfsplus_listxattr,
it actually passes in a struct hfsplus_attr_unistr*.
The size of the corresponding structure is different from that of hfsplus_unistr,
so the previous fix (94458781aee6) is insufficient.
The pointer on the unicode buffer is still going beyond the allocated memory.
This patch introduces two warpper functions hfsplus_uni2asc_xattr_str and
hfsplus_uni2asc_str to process two unicode buffers,
struct hfsplus_attr_unistr* and struct hfsplus_unistr* respectively.
When ustrlen value is bigger than the allocated memory size,
the ustrlen value is limited to an safe size.
Fixes: 94458781aee6 ("hfsplus: fix slab-out-of-bounds read in hfsplus_uni2asc()")
Signed-off-by: Kang Chen <k.chen@smail.nju.edu.cn>
Reviewed-by: Viacheslav Dubeyko <slava@dubeyko.com>
Signed-off-by: Viacheslav Dubeyko <slava@dubeyko.com>
Link: https://lore.kernel.org/r/20250909031316.1647094-1-k.chen@smail.nju.edu.cn
Signed-off-by: Viacheslav Dubeyko <slava@dubeyko.com>
|
|
The function move_dirty_folio_in_page_array() was created by commit
ce80b76dd327 ("ceph: introduce ceph_process_folio_batch() method") by
moving code from ceph_writepages_start() to this function.
This new function is supposed to return an error code which is checked
by the caller (now ceph_process_folio_batch()), and on error, the
caller invokes redirty_page_for_writepage() and then breaks from the
loop.
However, the refactoring commit has gone wrong, and it by accident, it
always returns 0 (= success) because it first NULLs the pointer and
then returns PTR_ERR(NULL) which is always 0. This means errors are
silently ignored, leaving NULL entries in the page array, which may
later crash the kernel.
The simple solution is to call PTR_ERR() before clearing the pointer.
Cc: stable@vger.kernel.org
Fixes: ce80b76dd327 ("ceph: introduce ceph_process_folio_batch() method")
Link: https://lore.kernel.org/ceph-devel/aK4v548CId5GIKG1@swift.blarg.de/
Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
The function ceph_process_folio_batch() sets folio_batch entries to
NULL, which is an illegal state. Before folio_batch_release() crashes
due to this API violation, the function ceph_shift_unused_folios_left()
is supposed to remove those NULLs from the array.
However, since commit ce80b76dd327 ("ceph: introduce
ceph_process_folio_batch() method"), this shifting doesn't happen
anymore because the "for" loop got moved to ceph_process_folio_batch(),
and now the `i` variable that remains in ceph_writepages_start()
doesn't get incremented anymore, making the shifting effectively
unreachable much of the time.
Later, commit 1551ec61dc55 ("ceph: introduce ceph_submit_write()
method") added more preconditions for doing the shift, replacing the
`i` check (with something that is still just as broken):
- if ceph_process_folio_batch() fails, shifting never happens
- if ceph_move_dirty_page_in_page_array() was never called (because
ceph_process_folio_batch() has returned early for some of various
reasons), shifting never happens
- if `processed_in_fbatch` is zero (because ceph_process_folio_batch()
has returned early for some of the reasons mentioned above or
because ceph_move_dirty_page_in_page_array() has failed), shifting
never happens
Since those two commits, any problem in ceph_process_folio_batch()
could crash the kernel, e.g. this way:
BUG: kernel NULL pointer dereference, address: 0000000000000034
#PF: supervisor write access in kernel mode
#PF: error_code(0x0002) - not-present page
PGD 0 P4D 0
Oops: Oops: 0002 [#1] SMP NOPTI
CPU: 172 UID: 0 PID: 2342707 Comm: kworker/u778:8 Not tainted 6.15.10-cm4all1-es #714 NONE
Hardware name: Dell Inc. PowerEdge R7615/0G9DHV, BIOS 1.6.10 12/08/2023
Workqueue: writeback wb_workfn (flush-ceph-1)
RIP: 0010:folios_put_refs+0x85/0x140
Code: 83 c5 01 39 e8 7e 76 48 63 c5 49 8b 5c c4 08 b8 01 00 00 00 4d 85 ed 74 05 41 8b 44 ad 00 48 8b 15 b0 >
RSP: 0018:ffffb880af8db778 EFLAGS: 00010207
RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000003
RDX: ffffe377cc3b0000 RSI: 0000000000000000 RDI: ffffb880af8db8c0
RBP: 0000000000000000 R08: 000000000000007d R09: 000000000102b86f
R10: 0000000000000001 R11: 00000000000000ac R12: ffffb880af8db8c0
R13: 0000000000000000 R14: 0000000000000000 R15: ffff9bd262c97000
FS: 0000000000000000(0000) GS:ffff9c8efc303000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000034 CR3: 0000000160958004 CR4: 0000000000770ef0
PKRU: 55555554
Call Trace:
<TASK>
ceph_writepages_start+0xeb9/0x1410
The crash can be reproduced easily by changing the
ceph_check_page_before_write() return value to `-E2BIG`.
(Interestingly, the crash happens only if `huge_zero_folio` has
already been allocated; without `huge_zero_folio`,
is_huge_zero_folio(NULL) returns true and folios_put_refs() skips NULL
entries instead of dereferencing them. That makes reproducing the bug
somewhat unreliable. See
https://lore.kernel.org/20250826231626.218675-1-max.kellermann@ionos.com
for a discussion of this detail.)
My suggestion is to move the ceph_shift_unused_folios_left() to right
after ceph_process_folio_batch() to ensure it always gets called to
fix up the illegal folio_batch state.
Cc: stable@vger.kernel.org
Fixes: ce80b76dd327 ("ceph: introduce ceph_process_folio_batch() method")
Link: https://lore.kernel.org/ceph-devel/aK4v548CId5GIKG1@swift.blarg.de/
Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
When the parent directory's i_rwsem is not locked, req->r_parent may become
stale due to concurrent operations (e.g. rename) between dentry lookup and
message creation. Validate that r_parent matches the encoded parent inode
and update to the correct inode if a mismatch is detected.
[ idryomov: folded a follow-up fix from Alex to drop extra reference
from ceph_get_reply_dir() in ceph_fill_trace():
ceph_get_reply_dir() may return a different, referenced inode when
r_parent is stale and the parent directory lock is not held.
ceph_fill_trace() used that inode but failed to drop the reference
when it differed from req->r_parent, leaking an inode reference.
Keep the directory inode in a local variable and iput() it at
function end if it does not match req->r_parent. ]
Cc: stable@vger.kernel.org
Signed-off-by: Alex Markuze <amarkuze@redhat.com>
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
Add validation to ensure the cached parent directory inode matches the
directory info in MDS replies. This prevents client-side race conditions
where concurrent operations (e.g. rename) cause r_parent to become stale
between request initiation and reply processing, which could lead to
applying state changes to incorrect directory inodes.
[ idryomov: folded a kerneldoc fixup and a follow-up fix from Alex to
move CEPH_CAP_PIN reference when r_parent is updated:
When the parent directory lock is not held, req->r_parent can become
stale and is updated to point to the correct inode. However, the
associated CEPH_CAP_PIN reference was not being adjusted. The
CEPH_CAP_PIN is a reference on an inode that is tracked for
accounting purposes. Moving this pin is important to keep the
accounting balanced. When the pin was not moved from the old parent
to the new one, it created two problems: The reference on the old,
stale parent was never released, causing a reference leak.
A reference for the new parent was never acquired, creating the risk
of a reference underflow later in ceph_mdsc_release_request(). This
patch corrects the logic by releasing the pin from the old parent and
acquiring it for the new parent when r_parent is switched. This
ensures reference accounting stays balanced. ]
Cc: stable@vger.kernel.org
Signed-off-by: Alex Markuze <amarkuze@redhat.com>
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
Running resctrl_tests on an SNC-2 system with lockdep debugging enabled
triggers several warnings with following trace:
WARNING: CPU: 0 PID: 1914 at kernel/cpu.c:528 lockdep_assert_cpus_held
...
Call Trace:
__mon_event_count
? __lock_acquire
? __pfx___mon_event_count
mon_event_count
? __pfx_smp_mon_event_count
smp_mon_event_count
smp_call_on_cpu_callback
get_cpu_cacheinfo_level() called from __mon_event_count() requires CPU hotplug
lock to be held. The hotplug lock is indeed held during this time, as
confirmed by the lockdep_assert_cpus_held() within mon_event_read() that calls
mon_event_count() via IPI, but the lockdep tracking is not able to follow the
IPI.
Fresh CPU cache information via get_cpu_cacheinfo_level() from
__mon_event_count() was added to support the fix for the issue where resctrl
inappropriately maintained links to L3 cache information that will be stale in
the case when the associated CPU goes offline.
Keep the cacheinfo ID in struct rdt_mon_domain to ensure that resctrl does not
maintain stale cache information while CPUs can go offline. Return to using
a pointer to the L3 cache information (struct cacheinfo) in struct rmid_read,
rmid_read::ci. Initialize rmid_read::ci before the IPI where it is used. CPU
hotplug lock is held across rmid_read::ci initialization and use to ensure
that it points to accurate cache information.
Fixes: 594902c986e2 ("x86,fs/resctrl: Remove inappropriate references to cacheinfo in the resctrl subsystem")
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
|
|
Commit 2ce3d282bd50 ("proc: fix missing pde_set_flags() for net proc
files") missed a key part in the definition of proc_dir_entry:
union {
const struct proc_ops *proc_ops;
const struct file_operations *proc_dir_ops;
};
So dereference of ->proc_ops assumes it is a proc_ops structure results in
type confusion and make NULL check for 'proc_ops' not work for proc dir.
Add !S_ISDIR(dp->mode) test before calling pde_set_flags() to fix it.
Link: https://lkml.kernel.org/r/20250904135715.3972782-1-wangzijie1@honor.com
Fixes: 2ce3d282bd50 ("proc: fix missing pde_set_flags() for net proc files")
Signed-off-by: wangzijie <wangzijie1@honor.com>
Reported-by: Brad Spengler <spender@grsecurity.net>
Closes: https://lore.kernel.org/all/20250903065758.3678537-1-wangzijie1@honor.com/
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jiri Slaby <jirislaby@kernel.org>
Cc: Stefano Brivio <sbrivio@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
syzbot detected a OCFS2 hang due to a recursive semaphore on a
FS_IOC_FIEMAP of the extent list on a specially crafted mmap file.
context_switch kernel/sched/core.c:5357 [inline]
__schedule+0x1798/0x4cc0 kernel/sched/core.c:6961
__schedule_loop kernel/sched/core.c:7043 [inline]
schedule+0x165/0x360 kernel/sched/core.c:7058
schedule_preempt_disabled+0x13/0x30 kernel/sched/core.c:7115
rwsem_down_write_slowpath+0x872/0xfe0 kernel/locking/rwsem.c:1185
__down_write_common kernel/locking/rwsem.c:1317 [inline]
__down_write kernel/locking/rwsem.c:1326 [inline]
down_write+0x1ab/0x1f0 kernel/locking/rwsem.c:1591
ocfs2_page_mkwrite+0x2ff/0xc40 fs/ocfs2/mmap.c:142
do_page_mkwrite+0x14d/0x310 mm/memory.c:3361
wp_page_shared mm/memory.c:3762 [inline]
do_wp_page+0x268d/0x5800 mm/memory.c:3981
handle_pte_fault mm/memory.c:6068 [inline]
__handle_mm_fault+0x1033/0x5440 mm/memory.c:6195
handle_mm_fault+0x40a/0x8e0 mm/memory.c:6364
do_user_addr_fault+0x764/0x1390 arch/x86/mm/fault.c:1387
handle_page_fault arch/x86/mm/fault.c:1476 [inline]
exc_page_fault+0x76/0xf0 arch/x86/mm/fault.c:1532
asm_exc_page_fault+0x26/0x30 arch/x86/include/asm/idtentry.h:623
RIP: 0010:copy_user_generic arch/x86/include/asm/uaccess_64.h:126 [inline]
RIP: 0010:raw_copy_to_user arch/x86/include/asm/uaccess_64.h:147 [inline]
RIP: 0010:_inline_copy_to_user include/linux/uaccess.h:197 [inline]
RIP: 0010:_copy_to_user+0x85/0xb0 lib/usercopy.c:26
Code: e8 00 bc f7 fc 4d 39 fc 72 3d 4d 39 ec 77 38 e8 91 b9 f7 fc 4c 89
f7 89 de e8 47 25 5b fd 0f 01 cb 4c 89 ff 48 89 d9 4c 89 f6 <f3> a4 0f
1f 00 48 89 cb 0f 01 ca 48 89 d8 5b 41 5c 41 5d 41 5e 41
RSP: 0018:ffffc9000403f950 EFLAGS: 00050256
RAX: ffffffff84c7f101 RBX: 0000000000000038 RCX: 0000000000000038
RDX: 0000000000000000 RSI: ffffc9000403f9e0 RDI: 0000200000000060
RBP: ffffc9000403fa90 R08: ffffc9000403fa17 R09: 1ffff92000807f42
R10: dffffc0000000000 R11: fffff52000807f43 R12: 0000200000000098
R13: 00007ffffffff000 R14: ffffc9000403f9e0 R15: 0000200000000060
copy_to_user include/linux/uaccess.h:225 [inline]
fiemap_fill_next_extent+0x1c0/0x390 fs/ioctl.c:145
ocfs2_fiemap+0x888/0xc90 fs/ocfs2/extent_map.c:806
ioctl_fiemap fs/ioctl.c:220 [inline]
do_vfs_ioctl+0x1173/0x1430 fs/ioctl.c:532
__do_sys_ioctl fs/ioctl.c:596 [inline]
__se_sys_ioctl+0x82/0x170 fs/ioctl.c:584
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0xfa/0x3b0 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f5f13850fd9
RSP: 002b:00007ffe3b3518b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 0000200000000000 RCX: 00007f5f13850fd9
RDX: 0000200000000040 RSI: 00000000c020660b RDI: 0000000000000004
RBP: 6165627472616568 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffe3b3518f0
R13: 00007ffe3b351b18 R14: 431bde82d7b634db R15: 00007f5f1389a03b
ocfs2_fiemap() takes a read lock of the ip_alloc_sem semaphore (since
v2.6.22-527-g7307de80510a) and calls fiemap_fill_next_extent() to read the
extent list of this running mmap executable. The user supplied buffer to
hold the fiemap information page faults calling ocfs2_page_mkwrite() which
will take a write lock (since v2.6.27-38-g00dc417fa3e7) of the same
semaphore. This recursive semaphore will hold filesystem locks and causes
a hang of the fileystem.
The ip_alloc_sem protects the inode extent list and size. Release the
read semphore before calling fiemap_fill_next_extent() in ocfs2_fiemap()
and ocfs2_fiemap_inline(). This does an unnecessary semaphore lock/unlock
on the last extent but simplifies the error path.
Link: https://lkml.kernel.org/r/61d1a62b-2631-4f12-81e2-cd689914360b@oracle.com
Fixes: 00dc417fa3e7 ("ocfs2: fiemap support")
Signed-off-by: Mark Tinguely <mark.tinguely@oracle.com>
Reported-by: syzbot+541dcc6ee768f77103e7@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=541dcc6ee768f77103e7
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Typo in ff_lseg_match_mirrors makes the diff ineffective. This results
in merge happening all the time. Merge happening all the time is
problematic because it marks lsegs invalid. Marking lsegs invalid
causes all outstanding IO to get restarted with EAGAIN and connections
to get closed.
Closing connections constantly triggers race conditions in the RDMA
implementation...
Fixes: 660d1eb22301c ("pNFS/flexfile: Don't merge layout segments if the mirrors don't match")
Signed-off-by: Jonathan Curley <jcurley@purestorage.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs fixes from Christian Brauner:
"fuse:
- Prevent opening of non-regular backing files.
Fuse doesn't support non-regular files anyway.
- Check whether copy_file_range() returns a larger size than
requested.
- Prevent overflow in copy_file_range() as fuse currently only
supports 32-bit sized copies.
- Cache the blocksize value if the server returned a new value as
inode->i_blkbits isn't modified directly anymore.
- Fix i_blkbits handling for iomap partial writes.
By default i_blkbits is set to PAGE_SIZE which causes iomap to mark
the whole folio as uptodate even on a partial write. But fuseblk
filesystems support choosing a blocksize smaller than PAGE_SIZE
risking data corruption. Simply enforce PAGE_SIZE as blocksize for
fuseblk's internal inode for now.
- Prevent out-of-bounds acces in fuse_dev_write() when the number of
bytes to be retrieved is truncated to the fc->max_pages limit.
virtiofs:
- Fix page faults for DAX page addresses.
Misc:
- Tighten file handle decoding from userns.
Check that the decoded dentry itself has a valid idmapping in the
user namespace.
- Fix mount-notify selftests.
- Fix some indentation errors.
- Add an FMODE_ flag to indicate IOCB_HAS_METADATA availability.
This will be moved to an FOP_* flag with a bit more rework needed
for that to happen not suitable for a fix.
- Don't silently ignore metadata for sync read/write.
- Don't pointlessly log warning when reading coredump sysctls"
* tag 'vfs-6.17-rc6.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
fuse: virtio_fs: fix page fault for DAX page address
selftests/fs/mount-notify: Fix compilation failure.
fhandle: use more consistent rules for decoding file handle from userns
fuse: Block access to folio overlimit
fuse: fix fuseblk i_blkbits for iomap partial writes
fuse: reflect cached blocksize if blocksize was changed
fuse: prevent overflow in copy_file_range return value
fuse: check if copy_file_range() returns larger than requested size
fuse: do not allow mapping a non-regular backing file
coredump: don't pointlessly check and spew warnings
fs: fix indentation style
block: don't silently ignore metadata for sync read/write
fs: add a FMODE_ flag to indicate IOCB_HAS_METADATA availability
Please enter a commit message to explain why this merge is necessary,
especially if it merges an updated upstream into a topic branch.
|
|
Since all callers of nfs_page_group_covers_page() have already ensured
that there is only one group member, all that is required is to check
that the entire folio contains dirty data.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
|
|
If we're truncating part of the folio, then we need to write out the
data on the part that is not covered by the cancellation.
Fixes: d47992f86b30 ("mm: change invalidatepage prototype to accept length")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
|
|
Ensure that all O_DIRECT reads and writes complete before copying a file
range, so that the destination is up to date.
Fixes: a5864c999de6 ("NFS: Do not serialise O_DIRECT reads and writes")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
|
|
Ensure that all O_DIRECT reads and writes complete before cloning a file
range, so that both the source and destination are up to date.
Fixes: a5864c999de6 ("NFS: Do not serialise O_DIRECT reads and writes")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
|
|
Ensure that all O_DIRECT reads and writes complete before calling
fallocate so that we don't race w.r.t. attribute updates.
Fixes: 99f237832243 ("NFSv4.2: Always flush out writes in nfs42_proc_fallocate()")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
|
|
Ensure that all O_DIRECT reads and writes are complete, and prevent the
initiation of new i/o until the setattr operation that will truncate the
file is complete.
Fixes: a5864c999de6 ("NFS: Do not serialise O_DIRECT reads and writes")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
|
|
The NFSv4.2 copy offload and clone functions can also end up extending
the size of the destination file, so they too need to call
nfs_truncate_last_folio().
Reported-by: Olga Kornievskaia <okorniev@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
|
|
This commit fixes the failing xfstest 'generic/363'.
When the user mmaps() an area that extends beyond the end of file, and
proceeds to write data into the folio that straddles that eof, we're
required to discard that folio data if the user calls some function that
extends the file length.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
|
|
Recent commit f06bedfa62d5 ("pNFS/flexfiles: don't attempt pnfs on fatal DS
errors") has changed the error return type of ff_layout_choose_ds_for_read() from
NULL to an error pointer. However, not all code paths have been updated
to match the change. Thus, some non-NULL checks will accept error pointers
as a valid return value.
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Suggested-by: Dan Carpenter <dan.carpenter@linaro.org>
Fixes: f06bedfa62d5 ("pNFS/flexfiles: don't attempt pnfs on fatal DS errors")
Signed-off-by: Tigran Mkrtchyan <tigran.mkrtchyan@desy.de>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
|
|
Previously nfs_local_probe() was made to disable and then attempt to
re-enable LOCALIO (via LOCALIO protocol handshake) if/when it was
called and LOCALIO already enabled.
Vague memory for _why_ this was the case is that this was useful
if/when a local NFS server were to be restarted with a local NFS
client connected to it.
But as it happens this causes an absurd amount of LOCALIO flapping
which has a side-effect of too much IO being needlessly sent to NFSD
(using RPC over the loopback network interface). This is the
definition of "serious performance loss" (that negates the point of
having LOCALIO).
So remove this mis-optimization for re-enabling LOCALIO if/when an NFS
server is restarted (which is an extremely rare thing to do). Will
revisit testing that scenario again but in the meantime this patch
restores the full benefit of LOCALIO.
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: NeilBrown <neil@brown.name>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
|
|
A use-after-free (UAF) vulnerability was identified in the PSI (Pressure
Stall Information) monitoring mechanism:
BUG: KASAN: slab-use-after-free in psi_trigger_poll+0x3c/0x140
Read of size 8 at addr ffff3de3d50bd308 by task systemd/1
psi_trigger_poll+0x3c/0x140
cgroup_pressure_poll+0x70/0xa0
cgroup_file_poll+0x8c/0x100
kernfs_fop_poll+0x11c/0x1c0
ep_item_poll.isra.0+0x188/0x2c0
Allocated by task 1:
cgroup_file_open+0x88/0x388
kernfs_fop_open+0x73c/0xaf0
do_dentry_open+0x5fc/0x1200
vfs_open+0xa0/0x3f0
do_open+0x7e8/0xd08
path_openat+0x2fc/0x6b0
do_filp_open+0x174/0x368
Freed by task 8462:
cgroup_file_release+0x130/0x1f8
kernfs_drain_open_files+0x17c/0x440
kernfs_drain+0x2dc/0x360
kernfs_show+0x1b8/0x288
cgroup_file_show+0x150/0x268
cgroup_pressure_write+0x1dc/0x340
cgroup_file_write+0x274/0x548
Reproduction Steps:
1. Open test/cpu.pressure and establish epoll monitoring
2. Disable monitoring: echo 0 > test/cgroup.pressure
3. Re-enable monitoring: echo 1 > test/cgroup.pressure
The race condition occurs because:
1. When cgroup.pressure is disabled (echo 0 > cgroup.pressure), it:
- Releases PSI triggers via cgroup_file_release()
- Frees of->priv through kernfs_drain_open_files()
2. While epoll still holds reference to the file and continues polling
3. Re-enabling (echo 1 > cgroup.pressure) accesses freed of->priv
epolling disable/enable cgroup.pressure
fd=open(cpu.pressure)
while(1)
...
epoll_wait
kernfs_fop_poll
kernfs_get_active = true echo 0 > cgroup.pressure
... cgroup_file_show
kernfs_show
// inactive kn
kernfs_drain_open_files
cft->release(of);
kfree(ctx);
...
kernfs_get_active = false
echo 1 > cgroup.pressure
kernfs_show
kernfs_activate_one(kn);
kernfs_fop_poll
kernfs_get_active = true
cgroup_file_poll
psi_trigger_poll
// UAF
...
end: close(fd)
To address this issue, introduce kernfs_get_active_of() for kernfs open
files to obtain active references. This function will fail if the open file
has been released. Replace kernfs_get_active() with kernfs_get_active_of()
to prevent further operations on released file descriptors.
Fixes: 34f26a15611a ("sched/psi: Per-cgroup PSI accounting disable/re-enable interface")
Cc: stable <stable@kernel.org>
Reported-by: Zhang Zhaotian <zhangzhaotian@huawei.com>
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20250822070715.1565236-2-chenridong@huaweicloud.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
For the HKDF-SHA512 key derivation needed by fscrypt, just use the
HMAC-SHA512 library functions directly. These functions were introduced
in v6.17, and they provide simple and efficient direct support for
HMAC-SHA512. This ends up being quite a bit simpler and more efficient
than using crypto/hkdf.c, as it avoids the generic crypto layer:
- The HMAC library can't fail, so callers don't need to handle errors
- No inefficient indirect calls
- No inefficient and error-prone dynamic allocations
- No inefficient and error-prone loading of algorithm by name
- Less stack usage
Benchmarks on x86_64 show that deriving a per-file key gets about 30%
faster, and FS_IOC_ADD_ENCRYPTION_KEY gets nearly twice as fast.
The only small downside is the HKDF-Expand logic gets duplicated again.
Then again, even considering that, the new fscrypt_hkdf_expand() is only
7 lines longer than the version that called hkdf_expand(). Later we
could add HKDF support to lib/crypto/, but for now let's just do this.
Link: https://lore.kernel.org/r/20250906035913.1141532-1-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
|
|
Otherwise if the nfsd filecache code releases the nfsd_file
immediately, it can trigger the BUG_ON(cred == current->cred) in
__put_cred() when it puts the nfsd_file->nf_file->f-cred.
Fixes: b9f5dd57f4a5 ("nfs/localio: use dedicated workqueues for filesystem read and write")
Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Reviewed-by: Mike Snitzer <snitzer@kernel.org>
Link: https://lore.kernel.org/r/20250807164938.2395136-1-smayhew@redhat.com
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.18-merge
xfs: kconfig and feature changes for 2025 LTS [6.18 v2 2/2]
Ahead of the 2025 LTS kernel, disable by default the two features that
we promised to turn off in September 2025: V4 filesystems, and the
long-broken ASCII case insensitive directories.
Since online fsck has not had any major issues in the 16 months since it
was merged upstream, let's also turn that on by default.
With a bit of luck, this should all go splendidly.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.18-merge
xfs: improve online repair reap calculations [6.18 v2 1/2]
A few months ago, the multi-fsblock untorn writes patchset added a bunch
of log intent item helper functions to estimate the number of intent
items that could be added to a particular transaction. Those helpers
enabled us to compute a safe upper bound on the number of blocks that
could be written in an untorn fashion with filesystem-provided out of
place writes.
Currently, the online fsck code employs static limits on the number of
intent items that it's willing to accrue to a single transaction when
it's trying to reap what it thinks are the old blocks from a corrupt
structure. There have been no problems reported with this approach
after years of testing, but static limits are scary and gross because
overestimating the intent item limit could result in transaction
overflows and dead filesystems; and underestimating causes unnecessary
overhead.
This series uses the new log intent item size helpers to estimate the
limits dynamically based on worst-case per-block repair work vs. the
size of the scrub transaction. After several months of testing this,
there don't seem to be any problems here either.
v2: rearrange patches, add review tags
This has been running on the djcloud for months with no problems. Enjoy!
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|