Age | Commit message (Collapse) | Author |
|
refcount_t type and corresponding API can protect refcounters from
accidental underflow and overflow and further use-after-free situations.
Signed-off-by: Xiyu Yang <xiyuyang19@fudan.edu.cn>
Signed-off-by: Xin Tan <tanxin.ctf@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/1626674355-55795-1-git-send-email-xiyuyang19@fudan.edu.cn
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
After we drop i_data sem, we need to reload the ext4_ext_path
structure since the extent tree can change once i_data_sem is
released.
This addresses the BUG:
[52117.465187] ------------[ cut here ]------------
[52117.465686] kernel BUG at fs/ext4/extents.c:1756!
...
[52117.478306] Call Trace:
[52117.478565] ext4_ext_shift_extents+0x3ee/0x710
[52117.479020] ext4_fallocate+0x139c/0x1b40
[52117.479405] ? __do_sys_newfstat+0x6b/0x80
[52117.479805] vfs_fallocate+0x151/0x4b0
[52117.480177] ksys_fallocate+0x4a/0xa0
[52117.480533] __x64_sys_fallocate+0x22/0x30
[52117.480930] do_syscall_64+0x35/0x80
[52117.481277] entry_SYSCALL_64_after_hwframe+0x44/0xae
[52117.481769] RIP: 0033:0x7fa062f855ca
Cc: stable@kernel.org
Link: https://lore.kernel.org/r/20210903062748.4118886-4-yangerkun@huawei.com
Signed-off-by: yangerkun <yangerkun@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Like ext4_ext_rm_leaf, we can ensure that there are enough credits
before every call that will consume credits. As part of this fix we
fold the functionality of ext4_access_path() into
ext4_ext_shift_path_extents(). This change is needed as a preparation
for the next bugfix patch.
Cc: stable@kernel.org
Link: https://lore.kernel.org/r/20210903062748.4118886-3-yangerkun@huawei.com
Signed-off-by: yangerkun <yangerkun@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
The debuginfo for binsearch want to show the left/middle/right extent
while the process search for the goal block. However we show this info
after we change right or left.
Link: https://lore.kernel.org/r/20210903062748.4118886-2-yangerkun@huawei.com
Signed-off-by: yangerkun <yangerkun@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
granular unit
Ext4 file system has default lazy inode table initialization setup once
it is mounted. However, it has issue on computing the next schedule time
that makes the timeout same amount in jiffies but different real time in
secs if with various HZ values. Therefore, fix by measuring the current
time in a more granular unit nanoseconds and make the next schedule time
independent of the HZ value.
Fixes: bfff68738f1c ("ext4: add support for lazy inode table initialization")
Signed-off-by: Shaoying Xu <shaoyi@amazon.com>
Cc: stable@vger.kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/20210902164412.9994-2-shaoyi@amazon.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
This reverts commit 948ca5f30e1df0c11eb5b0f410b9ceb97fa77ad9.
Two crash reports from users running variations on 5.15-rc4 kernels
suggest that it is premature to enforce the state assertion in the
original commit. Both crashes were triggered by BUG calls in that
code, indicating that under some rare circumstance the buffer head
state did not match a delayed allocated block at the time the
block was written out. No reproducer is available. Resolving this
problem will require more time than remains in the current release
cycle, so reverting the original patch for the time being is necessary
to avoid any instability it may cause.
Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Link: https://lore.kernel.org/r/20211012171901.5352-1-enwlinux@gmail.com
Fixes: 948ca5f30e1d ("ext4: enforce buffer head state assertion in ext4_da_map_blocks")
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
|
|
This regression can be reproduced with ntfs-3g and overlayfs:
mkdir lower upper work overlay
dd if=/dev/zero of=ntfs.raw bs=1M count=2
mkntfs -F ntfs.raw
mount ntfs.raw lower
touch lower/file.txt
mount -t overlay -o lowerdir=lower,upperdir=upper,workdir=work - overlay
mv overlay/file.txt overlay/file2.txt
mv fails and (misleadingly) prints
mv: cannot move 'overlay/file.txt' to a subdirectory of itself, 'overlay/file2.txt'
The reason is that ovl_copy_fileattr() is triggered due to S_NOATIME being
set on all inodes (by fuse) regardless of fileattr.
ovl_copy_fileattr() tries to retrieve file attributes from lower file, but
that fails because filesystem does not support this ioctl (this should fail
with ENOTTY, but ntfs-3g return EINVAL instead). This failure is
propagated to origial operation (in this case rename) that triggered the
copy-up.
The fix is to ignore ENOTTY and EINVAL errors from fileattr_get() in copy
up. This also requires turning the internal ENOIOCTLCMD into ENOTTY.
As a further measure to prevent unnecessary failures, only try the
fileattr_get/set on upper if there are any flags to copy up.
Side note: a number of filesystems set S_NOATIME (and sometimes other inode
flags) irrespective of fileattr flags. This causes unnecessary calls
during copy up, which might lead to a performance issue, especially if
latency is high. To fix this, the kernel would need to differentiate
between the two cases. E.g. introduce SB_NOATIME_UPDATE, a per-sb variant
of S_NOATIME. SB_NOATIME doesn't work, because that's interpreted as
"filesystem doesn't store an atime attribute"
Reported-and-tested-by: Kevin Locke <kevin@kevinlocke.name>
Fixes: 72db82115d2b ("ovl: copy up sync/noatime fileattr flags")
Cc: <stable@vger.kernel.org> # v5.15
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Sohaib Mohamed started a serie of tiny and incomplete checkpatch fixes but
seemingly stopped halfway -- take over and do most of it.
This is still missing net/9p/trans* and net/9p/protocol.c for a later
time...
Link: http://lkml.kernel.org/r/20211102134608.1588018-3-dominique.martinet@atmark-techno.com
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
|
|
having a readahead of 128k with a msize of 128k (with overhead) lead to
reading 124+4k everytime, making two roundtrips needlessly.
tune readahead according to msize when cache is enabled for better
performance
Link: http://lkml.kernel.org/r/20211104120323.2189376-1-asmadeus@codewreck.org
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
|
|
Syzbot triggered the following warning in ovl_workdir_create() ->
ovl_create_real():
if (!err && WARN_ON(!newdentry->d_inode)) {
The reason is that the cgroup2 filesystem returns from mkdir without
instantiating the new dentry.
Weird filesystems such as this will be rejected by overlayfs at a later
stage during setup, but to prevent such a warning, call ovl_mkdir_real()
directly from ovl_workdir_create() and reject this case early.
Reported-and-tested-by: syzbot+75eab84fd0af9e8bf66b@syzkaller.appspotmail.com
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
This patch removes a list_first_entry() call which is already done by
the previous con_next_wq() call.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull per signal_struct coredumps from Eric Biederman:
"Current coredumps are mixed up with the exit code, the signal handling
code, and the ptrace code making coredumps much more complicated than
necessary and difficult to follow.
This series of changes starts with ptrace_stop and cleans it up,
making it easier to follow what is happening in ptrace_stop. Then
cleans up the exec interactions with coredumps. Then cleans up the
coredump interactions with exit. Finally the coredump interactions
with the signal handling code is cleaned up.
The first and last changes are bug fixes for minor bugs.
I believe the fact that vfork followed by execve can kill the process
the called vfork if exec fails is sufficient justification to change
the userspace visible behavior.
In previous discussions some of these changes were organized
differently and individually appeared to make the code base worse. As
currently written I believe they all stand on their own as cleanups
and bug fixes.
Which means that even if the worst should happen and the last change
needs to be reverted for some unimaginable reason, the code base will
still be improved.
If the worst does not happen there are a more cleanups that can be
made. Signals that generate coredumps can easily become eligible for
short circuit delivery in complete_signal. The entire rendezvous for
generating a coredump can move into get_signal. The function
force_sig_info_to_task be written in a way that does not modify the
signal handling state of the target task (because coredumps are
eligible for short circuit delivery). Many of these future cleanups
can be done another way but nothing so cleanly as if coredumps become
per signal_struct"
* 'per_signal_struct_coredumps-for-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
coredump: Limit coredumps to a single thread group
coredump: Don't perform any cleanups before dumping core
exit: Factor coredump_exit_mm out of exit_mm
exec: Check for a pending fatal signal instead of core_state
ptrace: Remove the unnecessary arguments from arch_ptrace_stop
signal: Remove the bogus sigkill_pending in ptrace_stop
|
|
Pull jfs fix from David Kleikamp:
"Just one JFS patch"
* tag 'jfs-5.16' of git://github.com/kleikamp/linux-shaggy:
JFS: fix memleak in jfs_mount
|
|
Only dereference i->iov after establishing that i is of type ITER_IOVEC.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
Allow atomic exchange via RENAME_EXCHANGE when using simple_rename.
This affects binderfs, ramfs, hubetlbfs and bpffs.
Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/bpf/20211028094724.59043-3-lmb@cloudflare.com
|
|
Move shmem_exchange and make it available to other callers.
Suggested-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/bpf/20211028094724.59043-2-lmb@cloudflare.com
|
|
During umount, the session slot tables are freed. If there are
outstanding FREE_STATEID tasks, a use-after-free and slab corruption can
occur when rpc_exit_task calls rpc_call_done -> nfs41_sequence_done ->
nfs4_sequence_process/nfs41_sequence_free_slot.
Prevent that from happening by taking a reference on the nfs_client in
nfs41_free_stateid and putting it in nfs41_free_stateid_release.
Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
|
|
This is also a checkpatch warning fix but this one might have implications
so keeping it separate
Link: http://lkml.kernel.org/r/20211102134608.1588018-5-dominique.martinet@atmark-techno.com
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
|
|
This is also a checkpatch change, but this one might have more implications
so keeping this separate
Link: http://lkml.kernel.org/r/20211102134608.1588018-4-dominique.martinet@atmark-techno.com
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
|
|
- add missing SPDX-License-Identifier
- remove (sometimes incorrect) file name from file header
Link: http://lkml.kernel.org/r/20211102134608.1588018-2-dominique.martinet@atmark-techno.com
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
|
|
Warning found by checkpatch.pl
Link: http://lkml.kernel.org/r/20210930220420.44150-1-sohaib.amhmd@gmail.com
Signed-off-by: Sohaib Mohamed <sohaib.amhmd@gmail.com>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
|
|
Link: http://lkml.kernel.org/r/20211001063444.102330-1-sohaib.amhmd@gmail.com
Signed-off-by: Sohaib Mohamed <sohaib.amhmd@gmail.com>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
[Dominique: Fix the fixed indentation...]
|
|
Warnings found by checkpatch.pl
Link: http://lkml.kernel.org/r/20210930235503.126033-1-sohaib.amhmd@gmail.com
Signed-off-by: Sohaib Mohamed <sohaib.amhmd@gmail.com>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
|
|
Error found by checkpatch.pl
Link: http://lkml.kernel.org/r/20211001062454.99205-1-sohaib.amhmd@gmail.com
Signed-off-by: Sohaib Mohamed <sohaib.amhmd@gmail.com>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
|
|
Convert the 9p filesystem to use the netfs helper lib to handle readpage,
readahead and write_begin, converting those into a common issue_op for the
filesystem itself to handle. The netfs helper lib also handles reading
from fscache if a cache is available, and interleaving reads from both
sources.
This change also switches from the old fscache I/O API to the new one,
meaning that fscache no longer keeps track of netfs pages and instead does
async DIO between the backing files and the 9p file pagecache. As a part
of this change, the handling of PG_fscache changes. It now just means that
the cache has a write I/O operation in progress on a page (PG_locked
is used for a read I/O op).
Note that this is a cut-down version of the fscache rewrite and does not
change any of the cookie and cache coherency handling.
Changes
=======
ver #4:
- Rebase on top of folios.
- Don't use wait_on_page_bit_killable().
ver #3:
- v9fs_req_issue_op() needs to terminate the subrequest.
- v9fs_write_end() needs to call SetPageUptodate() a bit more often.
- It's not CONFIG_{AFS,V9FS}_FSCACHE[1]
- v9fs_init_rreq() should take a ref on the p9_fid and the cleanup should
drop it [from Dominique Martinet].
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-and-tested-by: Dominique Martinet <asmadeus@codewreck.org>
cc: v9fs-developer@lists.sourceforge.net
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/YUm+xucHxED+1MJp@codewreck.org/ [1]
Link: https://lore.kernel.org/r/163162772646.438332.16323773205855053535.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/163189109885.2509237.7153668924503399173.stgit@warthog.procyon.org.uk/ # rfc v2
Link: https://lore.kernel.org/r/163363943896.1980952.1226527304649419689.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/163551662876.1877519.14706391695553204156.stgit@warthog.procyon.org.uk/ # v4
Link: https://lore.kernel.org/r/163584179557.4023316.11089762304657644342.stgit@warthog.procyon.org.uk # rebase on folio
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
|
|
Add the byte offset of the readahead request to the tracepoint output
so we know where the read starts.
Before this patch:
cat-8104 [002] ..... 813.168775: nfs_aop_readahead: fileid=00:31:141 fhandle=0xe55807f6 version=1756509392533525500 nr_pages=256
cat-8104 [002] ..... 813.174973: nfs_aop_readahead_done: fileid=00:31:141 fhandle=0xe55807f6 version=1756509392533525500 nr_pages=256 ret=0
cat-8104 [002] ..... 813.175963: nfs_aop_readahead: fileid=00:31:141 fhandle=0xe55807f6 version=1756509392533525500 nr_pages=256
cat-8104 [002] ..... 813.183742: nfs_aop_readahead_done: fileid=00:31:141 fhandle=0xe55807f6 version=1756509392533525500 nr_pages=1 ret=0
After this patch:
cat-6392 [001] ..... 73.107782: nfs_aop_readahead: fileid=00:31:141 fhandle=0xed22403f version=1756511950029502774 offset=5242880 nr_pages=256
cat-6392 [001] ..... 73.112466: nfs_aop_readahead_done: fileid=00:31:141 fhandle=0xed22403f version=1756511950029502774 nr_pages=256 ret=0
cat-6392 [001] ..... 73.115692: nfs_aop_readahead: fileid=00:31:141 fhandle=0xed22403f version=1756511950029502774 offset=6291456 nr_pages=256
cat-6392 [001] ..... 73.123283: nfs_aop_readahead_done: fileid=00:31:141 fhandle=0xed22403f version=1756511950029502774 nr_pages=256 ret=0
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
|
|
After the assignment, only exit path with label 'err' uses ret as
return value. However,before exiting through this path with label 'err',
ret is assigned with the return value of io_wq_max_workers(). Hence, the
initial assignment is redundant and can be removed.
Signed-off-by: Nghia Le <nghialm78@gmail.com>
Link: https://lore.kernel.org/r/20211102190521.28291-1-nghialm78@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Pull xfs updates from Darrick Wong:
"This cycle we've worked on fixing bugs and improving XFS' memory
footprint.
The most notable fixes include: fixing a corruption warning (and free
space accounting skew) if copy on write fails; fixing slab cache
misuse if SLOB is enabled, which apparently was broken for years
without anybody noticing; and fixing a potential race with online
shrinkfs.
Otherwise, the bulk of the changes here involve setting up separate
slab caches for frequently used items such as btree cursors and log
intent items, and compacting the structures to reduce memory usage of
those items substantially. This also sets us up to support larger
btrees in future kernels. We also switch parts of online fsck to
allocate scrub context information from the heap instead of using
stack space.
Summary:
- Bug fixes and cleanups for kernel memory allocation usage, this
time without touching the mm code.
- Refactor the log recovery mechanism that preserves held resources
across a transaction roll so that it uses the exact same mechanism
that we use for that during regular runtime.
- Fix bugs and tighten checking around btree heights.
- Remove more old typedefs.
- Fix perag reference leaks when racing with growfs.
- Remove unused fields from xfs_btree_cur.
- Allocate various scrub structures on the heap to reduce stack
usage.
- Pack xfs_btree_cur fields and rearrange to support arbitrary
heights.
- Compute maximum possible heights for each btree height, and use
that to set up slab caches for each btree type.
- Finally remove kmem_zone_t, since these have always been struct
kmem_cache on Linux.
- Compact the structures used to coordinate work intent items.
- Set up slab caches for each work intent item type.
- Rename the "bmap_add_free" function to "free_extent_later", which
more accurately describes what it does.
- Fix corruption warning on unmount when a CoW preallocation covers a
data fork delalloc reservation but then the CoW fails.
- Add some more minor code improvements"
* tag 'xfs-5.16-merge-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (45 commits)
xfs: use swap() to make code cleaner
xfs: Remove duplicated include in xfs_super
xfs: punch out data fork delalloc blocks on COW writeback failure
xfs: remove unused parameter from refcount code
xfs: reduce the size of struct xfs_extent_free_item
xfs: rename xfs_bmap_add_free to xfs_free_extent_later
xfs: create slab caches for frequently-used deferred items
xfs: compact deferred intent item structures
xfs: rename _zone variables to _cache
xfs: remove kmem_zone typedef
xfs: use separate btree cursor cache for each btree type
xfs: compute absolute maximum nlevels for each btree type
xfs: kill XFS_BTREE_MAXLEVELS
xfs: compute the maximum height of the rmap btree when reflink enabled
xfs: clean up xfs_btree_{calc_size,compute_maxlevels}
xfs: compute maximum AG btree height for critical reservation calculation
xfs: rename m_ag_maxlevels to m_allocbt_maxlevels
xfs: dynamically allocate cursors based on maxlevels
xfs: encode the max btree height in the cursor
xfs: refactor btree cursor allocation function
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull AFS updates from David Howells:
- Split the readpage handler for symlinks from the one for files. The
symlink readpage isn't given a file pointer, so the handling has to
be special-cased.
This has been posted as part of a patchset to foliate netfs, afs,
etc.[1] but I've moved it to this one as it's not actually doing
foliation but is more of a pre-cleanup.
- Fix file creation to set the mtime from the client's clock to keep
make happy if the server's clock isn't quite in sync.[2]
Link: https://lore.kernel.org/r/163005742570.2472992.7800423440314043178.stgit@warthog.procyon.org.uk/ [1]
Link: http://lists.infradead.org/pipermail/linux-afs/2021-October/004395.html [2]
* tag 'afs-next-20211102' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
afs: Set mtime from the client for yfs create operations
afs: Sort out symlink reading
|
|
This patch fixes the following crash by receiving a invalid message:
[ 160.672220] ==================================================================
[ 160.676206] BUG: KASAN: user-memory-access in dlm_user_add_ast+0xc3/0x370
[ 160.679659] Read of size 8 at addr 00000000deadbeef by task kworker/u32:13/319
[ 160.681447]
[ 160.681824] CPU: 10 PID: 319 Comm: kworker/u32:13 Not tainted 5.14.0-rc2+ #399
[ 160.683472] Hardware name: Red Hat KVM/RHEL-AV, BIOS 1.14.0-1.module+el8.6.0+12648+6ede71a5 04/01/2014
[ 160.685574] Workqueue: dlm_recv process_recv_sockets
[ 160.686721] Call Trace:
[ 160.687310] dump_stack_lvl+0x56/0x6f
[ 160.688169] ? dlm_user_add_ast+0xc3/0x370
[ 160.689116] kasan_report.cold.14+0x116/0x11b
[ 160.690138] ? dlm_user_add_ast+0xc3/0x370
[ 160.690832] dlm_user_add_ast+0xc3/0x370
[ 160.691502] _receive_unlock_reply+0x103/0x170
[ 160.692241] _receive_message+0x11df/0x1ec0
[ 160.692926] ? rcu_read_lock_sched_held+0xa1/0xd0
[ 160.693700] ? rcu_read_lock_bh_held+0xb0/0xb0
[ 160.694427] ? lock_acquire+0x175/0x400
[ 160.695058] ? do_purge.isra.51+0x200/0x200
[ 160.695744] ? lock_acquired+0x360/0x5d0
[ 160.696400] ? lock_contended+0x6a0/0x6a0
[ 160.697055] ? lock_release+0x21d/0x5e0
[ 160.697686] ? lock_is_held_type+0xe0/0x110
[ 160.698352] ? lock_is_held_type+0xe0/0x110
[ 160.699026] ? ___might_sleep+0x1cc/0x1e0
[ 160.699698] ? dlm_wait_requestqueue+0x94/0x140
[ 160.700451] ? dlm_process_requestqueue+0x240/0x240
[ 160.701249] ? down_write_killable+0x2b0/0x2b0
[ 160.701988] ? do_raw_spin_unlock+0xa2/0x130
[ 160.702690] dlm_receive_buffer+0x1a5/0x210
[ 160.703385] dlm_process_incoming_buffer+0x726/0x9f0
[ 160.704210] receive_from_sock+0x1c0/0x3b0
[ 160.704886] ? dlm_tcp_shutdown+0x30/0x30
[ 160.705561] ? lock_acquire+0x175/0x400
[ 160.706197] ? rcu_read_lock_sched_held+0xa1/0xd0
[ 160.706941] ? rcu_read_lock_bh_held+0xb0/0xb0
[ 160.707681] process_recv_sockets+0x32/0x40
[ 160.708366] process_one_work+0x55e/0xad0
[ 160.709045] ? pwq_dec_nr_in_flight+0x110/0x110
[ 160.709820] worker_thread+0x65/0x5e0
[ 160.710423] ? process_one_work+0xad0/0xad0
[ 160.711087] kthread+0x1ed/0x220
[ 160.711628] ? set_kthread_struct+0x80/0x80
[ 160.712314] ret_from_fork+0x22/0x30
The issue is that we received a DLM message for a user lock but the
destination lock is a kernel lock. Note that the address which is trying
to derefence is 00000000deadbeef, which is in a kernel lock
lkb->lkb_astparam, this field should never be derefenced by the DLM
kernel stack. In case of a user lock lkb->lkb_astparam is lkb->lkb_ua
(memory is shared by a union field). The struct lkb_ua will be handled
by the DLM kernel stack but on a kernel lock it will contain invalid
data and ends in most likely crashing the kernel.
It can be reproduced with two cluster nodes.
node 2:
dlm_tool join test
echo "862 fooobaar 1 2 1" > /sys/kernel/debug/dlm/test_locks
echo "862 3 1" > /sys/kernel/debug/dlm/test_waiters
node 1:
dlm_tool join test
python:
foo = DLM(h_cmd=3, o_nextcmd=1, h_nodeid=1, h_lockspace=0x77222027, \
m_type=7, m_flags=0x1, m_remid=0x862, m_result=0xFFFEFFFE)
newFile = open("/sys/kernel/debug/dlm/comms/2/rawmsg", "wb")
newFile.write(bytes(foo))
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
|
|
This patch adds functionality to put a lkb to the waiters state. It can
be useful to combine this feature with the "rawmsg" debugfs
functionality. It will bring the DLM lkb into a state that a message
will be parsed by the kernel.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
|
|
This patch adds functionality to add an lkb during runtime. This is a
highly debugging feature only, wrong input can crash the kernel. It is a
early state feature as well. The goal is to provide a user interface for
manipulate dlm state and combine it with the rawmsg feature. It is
debugfs functionality, we don't care about UAPI breakage. Even it's
possible to add lkb's/rsb's which could never be exists in such wat by
using normal DLM operation. The user of this interface always need to
think before using this feature, not every crash which happens can really
occur during normal dlm operation.
Future there should be more functionality to add a more realistic lkb
which reflects normal DLM state inside the kernel. For now this is
enough.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
|
|
This patch adds functionality to add a lkb with a specific id range.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
|
|
This patch adds a dlm functionality to send a raw dlm message to a
specific cluster node. This raw message can be build by user space and
send out by writing the message to "rawmsg" dlm debugfs file.
There is a in progress scapy dlm module which provides a easy build of
DLM messages in user space. For example:
DLM(h_cmd=3, o_nextcmd=1, h_nodeid=1, h_lockspace=0xe4f48a18, ...)
The goal is to provide an easy reproducable state to crash DLM or to
fuzz the DLM kernel stack if there are possible ways to crash it.
Note: that if the sequence number is zero and dlm version is not set to
3.1 the kernel will automatic will set a right sequence number, otherwise
DLM stack testing is not possible.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
|
|
This patch changes the dlm_lowcomms_new_msg() function pointer private data
from "struct mhandle *" to "void *" to provide different structures than
just "struct mhandle".
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
|
|
This patch changes the ls_count busy wait to use atomic counter values
and wait_event() to wait until ls_count reach zero. It will slightly
reduce the number of holding lslist_lock. At remove lockspace we need to
retry the wait because it a lockspace get could interefere between
wait_event() and holding the lock which deletes the lockspace list entry.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
|
|
This patch changes the requestqueue busy waiting algorithm to use
atomic counter values and wait_event() to wait until the requestqueue is
empty. It will slightly reduce the number of holding ls_requestqueue_mutex
mutex.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
|
|
This patch adds tracepoints for dlm socket receive and send
functionality. We can use it to track how much data was send or received
to or from a specific nodeid.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
|
|
This patch adds initial support for dlm tracepoints. It will introduce
tracepoints to dlm main functionality dlm_lock()/dlm_unlock() and their
complete ast() callback or blocking bast() callback.
The lock/unlock functionality has a start and end tracepoint, this is
because there exists a race in case if would have a tracepoint at the
end position only the complete/blocking callbacks could occur before. To
work with eBPF tracing and using their lookup hash functionality there
could be problems that an entry was not inserted yet. However use the
start functionality for hash insert and check again in end functionality
if there was an dlm internal error so there is no ast callback. In further
it might also that locks with local masters will occur those callbacks
immediately so we must have such functionality.
I did not make everything accessible yet, although it seems eBPF can be
used to access a lot of internal datastructures if it's aware of the
struct definitions of the running kernel instance. We still can change
it, if you do eBPF experiments e.g. time measurements between lock and
callback functionality you can simple use the local lkb_id field as hash
value in combination with the lockspace id if you have multiple
lockspaces. Otherwise you can simple use trace-cmd for some functionality,
e.g. `trace-cmd record -e dlm` and `trace-cmd report` afterwards.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
|
|
This patch makes dlm_callback_resume info printout less noisy by
accumulate all callback queues into one printout not in 25 times steps.
It seems this printout became lately quite noisy in relationship with
gfs2.
Before:
[241767.849302] dlm: bin: dlm_callback_resume 25
[241767.854846] dlm: bin: dlm_callback_resume 25
[241767.860373] dlm: bin: dlm_callback_resume 25
...
[241767.865920] dlm: bin: dlm_callback_resume 25
[241767.871352] dlm: bin: dlm_callback_resume 25
[241767.876733] dlm: bin: dlm_callback_resume 25
After the patch:
[ 385.485728] dlm: gfs2: dlm_callback_resume 175
if zero it will not be printed out.
Reported-by: Barry Marson <bmarson@redhat.com>
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
|
|
This patch will change to evaluate the dlm_recovery_stopped() in the
condition of the if branch instead fetch it before evaluating the
condition. As this is an atomic test-set operation it should be
evaluated in the condition itself.
Reported-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
|
|
This patch will change to use dlm_recovery_stopped() which is the dlm way
to check if the LSFL_RECOVER_STOP flag in ls_flags by using the helper.
It is an atomic operation but the check is still as before to fetch the
value if ls_recover_lock is held. There might be more further
investigations if the value can be changed afterwards and if it has any
side effects.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
|
|
This patch moves version conversion to little endian from a runtime
variable to compile time constant.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
|
|
Since commit 764ff4011424 ("fs: dlm: auto load sctp module") we try
load the sctp module before we try to create a sctp kernel socket. That
a socket creation fails now has more likely other reasons. This patch
removes the part of error to load the sctp module and instead printout
the error code.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
|
|
This patch improves the debug output for midcomms layer by also printing
out the nodeid where users counter belongs to.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
|
|
This patch fixes a typo from lockspace to lockspace.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
|
|
This patch removes an obsolete define for some length for an temporary
buffer which is not being used anymore. The use of this define is not
necessary anymore since commit 4798cbbfbd00 ("fs: dlm: rework receive
handling").
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull gfs2 updates from Andreas Gruenbacher:
- Fix a locking order inversion between the inode and iopen glocks in
gfs2_inode_lookup.
- Implement proper queuing of glock holders for glocks that require
instantiation (like reading an inode or bitmap blocks from disk).
Before, multiple glock holders could race with each other and
half-initialized objects could be exposed; the GL_SKIP flag further
exacerbated this problem.
- Fix a rare deadlock between inode lookup / creation and remote delete
work.
- Fix a rare scheduling-while-atomic bug in dlm during glock hash table
walks.
- Various other minor fixes and cleanups.
* tag 'gfs2-v5.15-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: (21 commits)
gfs2: Fix unused value warning in do_gfs2_set_flags()
gfs2: check context in gfs2_glock_put
gfs2: Fix glock_hash_walk bugs
gfs2: Cancel remote delete work asynchronously
gfs2: set glock object after nq
gfs2: remove RDF_UPTODATE flag
gfs2: Eliminate GIF_INVALID flag
gfs2: fix GL_SKIP node_scope problems
gfs2: split glock instantiation off from do_promote
gfs2: further simplify do_promote
gfs2: re-factor function do_promote
gfs2: Remove 'first' trace_gfs2_promote argument
gfs2: change go_lock to go_instantiate
gfs2: dump glocks from gfs2_consist_OBJ_i
gfs2: dequeue iopen holder in gfs2_inode_lookup error
gfs2: Save ip from gfs2_glock_nq_init
gfs2: Allow append and immutable bits to coexist
gfs2: Switch some BUG_ON to GLOCK_BUG_ON for debug
gfs2: move GL_SKIP check from glops to do_promote
gfs2: Add GL_SKIP holder flag to dump_holder
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull gfs2 mmap + page fault deadlocks fixes from Andreas Gruenbacher:
"Functions gfs2_file_read_iter and gfs2_file_write_iter are both
accessing the user buffer to write to or read from while holding the
inode glock.
In the most basic deadlock scenario, that buffer will not be resident
and it will be mapped to the same file. Accessing the buffer will
trigger a page fault, and gfs2 will deadlock trying to take the same
inode glock again while trying to handle that fault.
Fix that and similar, more complex scenarios by disabling page faults
while accessing user buffers. To make this work, introduce a small
amount of new infrastructure and fix some bugs that didn't trigger so
far, with page faults enabled"
* tag 'gfs2-v5.15-rc5-mmap-fault' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
gfs2: Fix mmap + page fault deadlocks for direct I/O
iov_iter: Introduce nofault flag to disable page faults
gup: Introduce FOLL_NOFAULT flag to disable page faults
iomap: Add done_before argument to iomap_dio_rw
iomap: Support partial direct I/O on user copy failures
iomap: Fix iomap_dio_rw return value for user copies
gfs2: Fix mmap + page fault deadlocks for buffered I/O
gfs2: Eliminate ip->i_gh
gfs2: Move the inode glock locking to gfs2_file_buffered_write
gfs2: Introduce flag for glock holder auto-demotion
gfs2: Clean up function may_grant
gfs2: Add wrapper for iomap_file_buffered_write
iov_iter: Introduce fault_in_iov_iter_writeable
iov_iter: Turn iov_iter_fault_in_readable into fault_in_iov_iter_readable
gup: Turn fault_in_pages_{readable,writeable} into fault_in_{readable,writeable}
powerpc/kvm: Fix kvm_use_magic_page
iov_iter: Fix iov_iter_get_pages{,_alloc} page fault return value
|
|
In io-wq.c:io_wq_max_workers(), new_count[] was changed right after each
node's value was set. This caused the following node getting the setting
of the previous one.
Returned values are copied from node 0.
Fixes: 2e480058ddc2 ("io-wq: provide a way to limit max number of workers")
Signed-off-by: Beld Zhang <beldzhang@gmail.com>
[axboe: minor fixups]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|