Age | Commit message (Collapse) | Author |
|
This commit fix kernel-doc style comments with complete parameter
descriptions for the lookup_file(),lookup_dir_entry() and
lookup_file_dentry().
Link: https://lore.kernel.org/linux-trace-kernel/20240322062604.28862-1-yang.lee@linux.alibaba.com
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
Cross-merge networking fixes after downstream PR.
Conflicts:
net/unix/garbage.c
47d8ac011fe1 ("af_unix: Fix garbage collector racing against connect()")
4090fa373f0e ("af_unix: Replace garbage collection algorithm.")
Adjacent changes:
drivers/net/ethernet/broadcom/bnxt/bnxt.c
faa12ca24558 ("bnxt_en: Reset PTP tx_avail after possible firmware reset")
b3d0083caf9a ("bnxt_en: Support RSS contexts in ethtool .{get|set}_rxfh()")
drivers/net/ethernet/broadcom/bnxt/bnxt_ulp.c
7ac10c7d728d ("bnxt_en: Fix possible memory leak in bnxt_rdma_aux_device_init()")
194fad5b2781 ("bnxt_en: Refactor bnxt_rdma_aux_device_init/uninit functions")
drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
958f56e48385 ("net/mlx5e: Un-expose functions in en.h")
49e6c9387051 ("net/mlx5e: RSS, Block XOR hash with over 128 channels")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
password rotation
There are various use cases that are becoming more common in which password
changes are scheduled on a server(s) periodically but the clients connected
to this server need to stay connected (even in the face of brief network
reconnects) due to mounts which can not be easily unmounted and mounted at
will, and servers that do password rotation do not always have the ability
to tell the clients exactly when to the new password will be effective,
so add support for an alt password ("password2=") on mount (and also
remount) so that we can anticipate the upcoming change to the server
without risking breaking existing mounts.
An alternative would have been to use the kernel keyring for this but the
processes doing the reconnect do not have access to the keyring but do
have access to the ses structure.
Reviewed-by: Bharath SM <bharathsm@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
In cifs_sfu_make_node(), on success, instantiate rather than leave it
with dentry unhashed negative to support callers that expect mknod(2)
to always instantiate.
This fixes the following test case:
mount.cifs //srv/share /mnt -o ...,sfu
mkfifo /mnt/fifo
./xfstests/ltp/growfiles -b -W test -e 1 -u -i 0 -L 30 /mnt/fifo
...
BUG: unable to handle page fault for address: 000000034cec4e58
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: 0000 1 PREEMPT SMP PTI
CPU: 0 PID: 138098 Comm: growfiles Kdump: loaded Not tainted
5.14.0-436.3987_1240945149.el9.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
RIP: 0010:_raw_callee_save__kvm_vcpu_is_preempted+0x0/0x20
Code: e8 15 d9 61 00 e9 63 ff ff ff 41 bd ea ff ff ff e9 58 ff ff ff e8
d0 71 c0 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 <48> 8b 04
fd 60 2b c1 99 80 b8 90 50 03 00 00 0f 95 c0 c3 cc cc cc
RSP: 0018:ffffb6a143cf7cf8 EFLAGS: 00010206
RAX: ffff8a9bc30fb038 RBX: ffff8a9bc666a200 RCX: ffff8a9cc0260000
RDX: 00000000736f622e RSI: ffff8a9bc30fb038 RDI: 000000007665645f
RBP: ffffb6a143cf7d70 R08: 0000000000001000 R09: 0000000000000001
R10: 0000000000000001 R11: 0000000000000000 R12: ffff8a9bc666a200
R13: 0000559a302a12b0 R14: 0000000000001000 R15: 0000000000000000
FS: 00007fbed1dbb740(0000) GS:ffff8a9cf0000000(0000)
knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000034cec4e58 CR3: 0000000128ec6006 CR4: 0000000000770ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
<TASK>
? show_trace_log_lvl+0x1c4/0x2df
? show_trace_log_lvl+0x1c4/0x2df
? __mutex_lock.constprop.0+0x5f7/0x6a0
? __die_body.cold+0x8/0xd
? page_fault_oops+0x134/0x170
? exc_page_fault+0x62/0x150
? asm_exc_page_fault+0x22/0x30
? _pfx_raw_callee_save__kvm_vcpu_is_preempted+0x10/0x10
__mutex_lock.constprop.0+0x5f7/0x6a0
? __mod_memcg_lruvec_state+0x84/0xd0
pipe_write+0x47/0x650
? do_anonymous_page+0x258/0x410
? inode_security+0x22/0x60
? selinux_file_permission+0x108/0x150
vfs_write+0x2cb/0x410
ksys_write+0x5f/0xe0
do_syscall_64+0x5c/0xf0
? syscall_exit_to_user_mode+0x22/0x40
? do_syscall_64+0x6b/0xf0
? sched_clock_cpu+0x9/0xc0
? exc_page_fault+0x62/0x150
entry_SYSCALL_64_after_hwframe+0x6e/0x76
Cc: stable@vger.kernel.org
Fixes: 72bc63f5e23a ("smb3: fix creating FIFOs when mounting with "sfu" mount option")
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
We were decrementing the count of open files on server twice
for the case where we were closing cached directories.
Fixes: 8e843bf38f7b ("cifs: return a single-use cfid if we did not get a lease")
Cc: stable@vger.kernel.org
Acked-by: Bharath SM <bharathsm@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
The same list item will be used in both cap_delay_list and
cap_unlink_delay_list, so it's buggy to use two different locks
to protect them.
Cc: stable@vger.kernel.org
Fixes: dbc347ef7f0c ("ceph: add ceph_cap_unlink_work to fire check_caps() immediately")
Link: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/AODC76VXRAMXKLFDCTK4TKFDDPWUSCN5
Reported-by: Marc Ruhmann <ruhmann@luis.uni-hannover.de>
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Tested-by: Marc Ruhmann <ruhmann@luis.uni-hannover.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
Pull more bcachefs fixes from Kent Overstreet:
"Notable user impacting bugs
- On multi device filesystems, recovery was looping in
btree_trans_too_many_iters(). This checks if a transaction has
touched too many btree paths (because of iteration over many keys),
and isuses a restart to drop unneeded paths.
But it's now possible for some paths to exceed the previous limit
without iteration in the interior btree update path, since the
transaction commit will do alloc updates for every old and new
btree node, and during journal replay we don't use the btree write
buffer for locking reasons and thus those updates use btree paths
when they wouldn't normally.
- Fix a corner case in rebalance when moving extents on a
durability=0 device. This wouldn't be hit when a device was
formatted with durability=0 since in that case we'll only use it as
a write through cache (only cached extents will live on it), but
durability can now be changed on an existing device.
- bch2_get_acl() could rarely forget to handle a transaction restart;
this manifested as the occasional missing acl that came back after
dropping caches.
- Fix a major performance regression on high iops multithreaded write
workloads (only since 6.9-rc1); a previous fix for a deadlock in
the interior btree update path to check the journal watermark
introduced a dependency on the state of btree write buffer flushing
that we didn't want.
- Assorted other repair paths and recovery fixes"
* tag 'bcachefs-2024-04-10' of https://evilpiepirate.org/git/bcachefs: (25 commits)
bcachefs: Fix __bch2_btree_and_journal_iter_init_node_iter()
bcachefs: Kill read lock dropping in bch2_btree_node_lock_write_nofail()
bcachefs: Fix a race in btree_update_nodes_written()
bcachefs: btree_node_scan: Respect member.data_allowed
bcachefs: Don't scan for btree nodes when we can reconstruct
bcachefs: Fix check_topology() when using node scan
bcachefs: fix eytzinger0_find_gt()
bcachefs: fix bch2_get_acl() transaction restart handling
bcachefs: fix the count of nr_freed_pcpu after changing bc->freed_nonpcpu list
bcachefs: Fix gap buffer bug in bch2_journal_key_insert_take()
bcachefs: Rename struct field swap to prevent macro naming collision
MAINTAINERS: Add entry for bcachefs documentation
Documentation: filesystems: Add bcachefs toctree
bcachefs: JOURNAL_SPACE_LOW
bcachefs: Disable errors=panic for BCH_IOCTL_FSCK_OFFLINE
bcachefs: Fix BCH_IOCTL_FSCK_OFFLINE for encrypted filesystems
bcachefs: fix rand_delete unit test
bcachefs: fix ! vs ~ typo in __clear_bit_le64()
bcachefs: Fix rebalance from durability=0 device
bcachefs: Print shutdown journal sequence number
...
|
|
The page has been marked clean before writepage is called. If we don't
redirty it before postponing the write, it might never get written.
Cc: stable@vger.kernel.org
Fixes: 503d4fa6ee28 ("ceph: remove reliance on bdi congestion")
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
Add "sched_prepare_exec" tracepoint, which is run right after the point
of no return but before the current task assumes its new exec identity.
Unlike the tracepoint "sched_process_exec", the "sched_prepare_exec"
tracepoint runs before flushing the old exec, i.e. while the task still
has the original state (such as original MM), but when the new exec
either succeeds or crashes (but never returns to the original exec).
Being able to trace this event can be helpful in a number of use cases:
* allowing tracing eBPF programs access to the original MM on exec,
before current->mm is replaced;
* counting exec in the original task (via perf event);
* profiling flush time ("sched_prepare_exec" to "sched_process_exec").
Example of tracing output:
$ cat /sys/kernel/debug/tracing/trace_pipe
<...>-379 [003] ..... 179.626921: sched_prepare_exec: interp=/usr/bin/sshd filename=/usr/bin/sshd pid=379 comm=sshd
<...>-381 [002] ..... 180.048580: sched_prepare_exec: interp=/bin/bash filename=/bin/bash pid=381 comm=sshd
<...>-385 [001] ..... 180.068277: sched_prepare_exec: interp=/usr/bin/tty filename=/usr/bin/tty pid=385 comm=bash
<...>-389 [006] ..... 192.020147: sched_prepare_exec: interp=/usr/bin/dmesg filename=/usr/bin/dmesg pid=389 comm=bash
Signed-off-by: Marco Elver <elver@google.com>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Link: https://lore.kernel.org/r/20240411102158.1272267-1-elver@google.com
Signed-off-by: Kees Cook <keescook@chromium.org>
|
|
The nfs4 mount fails with EIO on 64-bit big endian architectures since
v6.7. The issue arises from employing a union in the nfsd4_encode_fattr4()
function to overlay a 32-bit array with a 64-bit values based bitmap,
which does not function as intended. Address the endianness issue by
utilizing bitmap_from_arr32() to copy 32-bit attribute masks into a
bitmap in an endianness-agnostic manner.
Cc: stable@vger.kernel.org
Fixes: fce7913b13d0 ("NFSD: Use a bitmask loop to encode FATTR4 results")
Link: https://bugs.launchpad.net/ubuntu/+source/nfs-utils/+bug/2060217
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
The sysfs_break_active_protection() routine has an obvious reference
leak in its error path. If the call to kernfs_find_and_get() fails then
kn will be NULL, so the companion sysfs_unbreak_active_protection()
routine won't get called (and would only cause an access violation by
trying to dereference kn->parent if it was called). As a result, the
reference to kobj acquired at the start of the function will never be
released.
Fix the leak by adding an explicit kobject_put() call when kn is NULL.
Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Fixes: 2afc9166f79b ("scsi: sysfs: Introduce sysfs_{un,}break_active_protection()")
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: stable@vger.kernel.org
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/8a4d3f0f-c5e3-4b70-a188-0ca433f9e6f9@rowland.harvard.edu
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
The variable retval is being assigned a value that is not being read,
it is being re-assigned later on in the function. The assignment
is redundant and can be removed.
Cleans up clang scan build warning:
fs/direct-io.c:1220:2: warning: Value stored to 'retval' is never
read [deadcode.DeadStores]
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Link: https://lore.kernel.org/r/20240410162221.292485-1-colin.i.king@gmail.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull bootconfig fixes from Masami Hiramatsu:
- show the original cmdline only once, and only if it was modeified by
bootconfig
* tag 'bootconfig-fixes-v6.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
fs/proc: Skip bootloader comment if no embedded kernel parameters
fs/proc: remove redundant comments from /proc/bootconfig
|
|
We weren't respecting trans->journal_replay_not_finished - we shouldn't
be searching the journal keys unless we have a ref on them.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
dropping read locks in bch2_btree_node_lock_write_nofail() dates from
before we had the cycle detector; we can now tell the cycle detector
directly when taking a lock may not fail because we can't handle
transaction restarts.
This is needed for adding should_be_locked asserts.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
One btree update might have terminated in a node update, and then while
it is in flight another btree update might free that original node.
This race has to be handled in btree_update_nodes_written() - we were
missing a READ_ONCE().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
cifs_get_fattr() may be called with a NULL inode, so check for a
non-NULL inode before calling
cifs_mark_open_handles_for_deleted_file().
This fixes the following oops:
mount.cifs //srv/share /mnt -o ...,vers=3.1.1
cd /mnt
touch foo; tail -f foo &
rm foo
cat foo
BUG: kernel NULL pointer dereference, address: 00000000000005c0
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: 0000 [#1] PREEMPT SMP NOPTI
CPU: 2 PID: 696 Comm: cat Not tainted 6.9.0-rc2 #1
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
1.16.3-1.fc39 04/01/2014
RIP: 0010:__lock_acquire+0x5d/0x1c70
Code: 00 00 44 8b a4 24 a0 00 00 00 45 85 f6 0f 84 bb 06 00 00 8b 2d
48 e2 95 01 45 89 c3 41 89 d2 45 89 c8 85 ed 0 0 <48> 81 3f 40 7a 76
83 44 0f 44 d8 83 fe 01 0f 86 1b 03 00 00 31 d2
RSP: 0018:ffffc90000b37490 EFLAGS: 00010002
RAX: 0000000000000000 RBX: ffff888110021ec0 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000000005c0
RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000200
FS: 00007f2a1fa08740(0000) GS:ffff888157a00000(0000)
knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
CR2: 00000000000005c0 CR3: 000000011ac7c000 CR4: 0000000000750ef0
PKRU: 55555554
Call Trace:
<TASK>
? __die+0x23/0x70
? page_fault_oops+0x180/0x490
? srso_alias_return_thunk+0x5/0xfbef5
? exc_page_fault+0x70/0x230
? asm_exc_page_fault+0x26/0x30
? __lock_acquire+0x5d/0x1c70
? srso_alias_return_thunk+0x5/0xfbef5
? srso_alias_return_thunk+0x5/0xfbef5
lock_acquire+0xc0/0x2d0
? cifs_mark_open_handles_for_deleted_file+0x3a/0x100 [cifs]
? srso_alias_return_thunk+0x5/0xfbef5
? kmem_cache_alloc+0x2d9/0x370
_raw_spin_lock+0x34/0x80
? cifs_mark_open_handles_for_deleted_file+0x3a/0x100 [cifs]
cifs_mark_open_handles_for_deleted_file+0x3a/0x100 [cifs]
cifs_get_fattr+0x24c/0x940 [cifs]
? srso_alias_return_thunk+0x5/0xfbef5
cifs_get_inode_info+0x96/0x120 [cifs]
cifs_lookup+0x16e/0x800 [cifs]
cifs_atomic_open+0xc7/0x5d0 [cifs]
? lookup_open.isra.0+0x3ce/0x5f0
? __pfx_cifs_atomic_open+0x10/0x10 [cifs]
lookup_open.isra.0+0x3ce/0x5f0
path_openat+0x42b/0xc30
? srso_alias_return_thunk+0x5/0xfbef5
? srso_alias_return_thunk+0x5/0xfbef5
? srso_alias_return_thunk+0x5/0xfbef5
do_filp_open+0xc4/0x170
do_sys_openat2+0xab/0xe0
__x64_sys_openat+0x57/0xa0
do_syscall_64+0xc1/0x1e0
entry_SYSCALL_64_after_hwframe+0x72/0x7a
Fixes: ffceb7640cbf ("smb: client: do not defer close open handles to deleted files")
Reviewed-by: Meetakshi Setiya <msetiya@microsoft.com>
Reviewed-by: Bharath SM <bharathsm@microsoft.com>
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Rather than use the older style ->read() hook, use ->read_iter() so that
signalfd can support both O_NONBLOCK and IOCB_NOWAIT for non-blocking
read attempts.
Split the fd setup into two parts, so that signalfd can mark the file
mode with FMODE_NOWAIT before installing it into the process table.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Rather than use the older style ->read() hook, use ->read_iter() so that
userfaultfd can support both O_NONBLOCK and IOCB_NOWAIT for non-blocking
read attempts.
Split the fd setup into two parts, so that userfaultfd can mark the file
mode with FMODE_NOWAIT before installing it into the process table. With
that, we can also defer grabbing the mm until we know the rest will
succeed, as the fd isn't visible before then.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Switch timerfd to using fops->read_iter(), so it can support not just
O_NONBLOCK but IOCB_NOWAIT as well. With the latter, users like io_uring
interact with timerfds a lot better, as they can be driven purely
by the poll trigger.
Manually get and install the required fd, so that FMODE_NOWAIT can be
set before the file is installed into the file table.
No functional changes intended in this patch, it's purely a straight
conversion to using the read iterator method.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
In 9p2000 legacy mode, stat2inode initializes nlink to 1,
which is redundant with what alloc_inode should have already set.
9p2000.u overrides this with extensions if present in the stat
structure, and 9p2000.L incorporates nlink into its stat structure.
At the very least this probably messes with directory nlink
accounting in legacy mode.
Signed-off-by: Eric Van Hensbergen <ericvh@kernel.org>
|
|
If a device wasn't used for btree nodes, no need to scan for them.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Fixes the following Coccinelle/coccicheck warning reported by
string_choices.cocci:
opportunity for str_plural(zgroup->g_nr_zones)
Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com>
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
|
|
[BUG]
There is a recent report that when memory pressure is high (including
cached pages), btrfs can spend most of its time on memory allocation in
btrfs_alloc_page_array() for compressed read/write.
[CAUSE]
For btrfs_alloc_page_array() we always go alloc_pages_bulk_array(), and
even if the bulk allocation failed (fell back to single page
allocation) we still retry but with extra memalloc_retry_wait().
If the bulk alloc only returned one page a time, we would spend a lot of
time on the retry wait.
The behavior was introduced in commit 395cb57e8560 ("btrfs: wait between
incomplete batch memory allocations").
[FIX]
Although the commit mentioned that other filesystems do the wait, it's
not the case at least nowadays.
All the mainlined filesystems only call memalloc_retry_wait() if they
failed to allocate any page (not only for bulk allocation).
If there is any progress, they won't call memalloc_retry_wait() at all.
For example, xfs_buf_alloc_pages() would only call memalloc_retry_wait()
if there is no allocation progress at all, and the call is not for
metadata readahead.
So I don't believe we should call memalloc_retry_wait() unconditionally
for short allocation.
Call memalloc_retry_wait() if it fails to allocate any page for tree
block allocation (which goes with __GFP_NOFAIL and may not need the
special handling anyway), and reduce the latency for
btrfs_alloc_page_array().
Reported-by: Julian Taylor <julian.taylor@1und1.de>
Tested-by: Julian Taylor <julian.taylor@1und1.de>
Link: https://lore.kernel.org/all/8966c095-cbe7-4d22-9784-a647d1bf27c3@1und1.de/
Fixes: 395cb57e8560 ("btrfs: wait between incomplete batch memory allocations")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Add an ASSERT to catch a faulty delayed reference item resulting from
prematurely cleared extent buffer.
Also, add a WARN to detect if we try to dirty a ZEROOUT buffer again, which
is suspicious as its update will be lost.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Btrfs clears the content of an extent buffer marked as
EXTENT_BUFFER_ZONED_ZEROOUT before the bio submission. This mechanism is
introduced to prevent a write hole of an extent buffer, which is once
allocated, marked dirty, but turns out unnecessary and cleaned up within
one transaction operation.
Currently, btrfs_clear_buffer_dirty() marks the extent buffer as
EXTENT_BUFFER_ZONED_ZEROOUT, and skips the entry function. If this call
happens while the buffer is under IO (with the WRITEBACK flag set,
without the DIRTY flag), we can add the ZEROOUT flag and clear the
buffer's content just before a bio submission. As a result:
1) it can lead to adding faulty delayed reference item which leads to a
FS corrupted (EUCLEAN) error, and
2) it writes out cleared tree node on disk
The former issue is previously discussed in [1]. The corruption happens
when it runs a delayed reference update. So, on-disk data is safe.
[1] https://lore.kernel.org/linux-btrfs/3f4f2a0ff1a6c818050434288925bdcf3cd719e5.1709124777.git.naohiro.aota@wdc.com/
The latter one can reach on-disk data. But, as that node is already
processed by btrfs_clear_buffer_dirty(), that will be invalidated in the
next transaction commit anyway. So, the chance of hitting the corruption
is relatively small.
Anyway, we should skip flagging ZEROOUT on a non-DIRTY extent buffer, to
keep the content under IO intact.
Fixes: aa6313e6ff2b ("btrfs: zoned: don't clear dirty flag of extent buffer")
CC: stable@vger.kernel.org # 6.8
Link: https://lore.kernel.org/linux-btrfs/oadvdekkturysgfgi4qzuemd57zudeasynswurjxw3ocdfsef6@sjyufeugh63f/
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Move dlm message processing from an ordered workqueue context to an
ordered softirq context. Handling dlm messages in softirq will allow
requests to be cleared more quickly and efficiently, and should avoid
longer queues of incomplete requests. Later patches are expected to
run completion/blocking callbacks directly from this message processing
context, further reducing context switches required to complete a request.
In the longer term, concurrent message processing could be implemented.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
|
|
Use spin_lock_bh for all spinlocks involved in message processing,
in preparation for softirq message processing. DLM lock requests
from user space involve dlm processing in user context, in addition
to the standard kernel context, necessitating bh variants.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
|
|
Remove an explicit schedule() call in the message processing path,
in preparation for softirq message processing.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
|
|
Convert ls_recv_active rw_semaphore to an rwlock to avoid
sleeping, in preparation for softirq message processing.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
|
|
The end of the recovery process transitioned to normal message
processing by temporarily blocking the receiving context,
processing saved messages, then unblocking the receiving
context. To avoid blocking the receiving context, the old
wait_queue and mutex are replaced by a new rwlock and the new
RECV_MSG_BLOCKED flag. Received messages are added to the
list of saved messages, protected by the rwlock, until the
flag is cleared, which happens when all saved messages have
been processed.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
|
|
Convert the rsb struct res_lock from a mutex to a spinlock
in preparation for processing messages in softirq context.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
|
|
Convert the waiters mutex to a spinlock in prepration for
processing messages in softirq context.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
|
|
The waiters_mutex no longer needs to be used in the waiters recovery
functions dlm_recover_waiters_pre() and dlm_recover_waiters_pre().
During recovery, ordinary locking operations are paused, and the
recovery thread is the only context accessing the waiters list,
so the lock is not needed.
Access to the waiters list from debugfs functions is avoided by
taking the top level recovery lock in the debugfs dump function.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
|
|
Add a new struct to save the current position in the rsb masters_list
while sending the rsb names to other nodes. The rsb names are sent in
multiple chunks, and for each new chunk, the new "dlm_dir_dump" struct
saves the last position in the masters_list. The new struct is also
used to save more information to sanity check the recovery process.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
|
|
Move the rsb root_list from the lockspace to a stack variable since
it is now only used by the ls_recover() function.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
|
|
Add a new "masters_list" for master rsb structs, with a new
rwlock. The new list is created and used during the recovery
process to send the master rsb names to new nodes. With this
change, the current "root_list" can be used without locking.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
|
|
Move dlm_create_root_list() and dlm_release_root_list() to
recover.c and declare them static because they are only used
there.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
|
|
Replace GFP_NOFS with GFP_ATOMIC. Also stop using idr_preload which
uses a non-bh spin_lock. This is further preparation for softirq
message processing.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
|
|
Commit 228804a35caa tried to add a refcount check to
gfs2_scan_glock_lru() to make sure that glocks that are still referenced
cannot be freed. It failed to account for the bias state_change() adds
to the refcount for held glocks, so held glocks are no longer removed
from the glock cache, which can lead to out-of-memory problems. Fix
that. (The inodes those glocks are associated with do get shrunk and do
get pushed out of memory.)
In addition, use the same eligibility check in gfs2_scan_glock_lru() and
gfs2_dispose_glock_lru().
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
This reverts commit 62862485a4c3a52029fc30f4bdde9af04afdafc9.
Commit 62862485a4c3 tried to fix issues introduced by commit
228804a35caa ("gfs2: Make glock lru list scanning safer"), but like that
commit, it failed to account for the bias state_change() adds to the
glock reference count for locked glocks. Revert commit 62862485a4c3 so
that we can fix commit 228804a35caa properly.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
Commit 3e11e53041502 tries to suppress dlm_lock() lock conversion errors
that occur when the lockspace has already been released.
It does that by setting and checking the SDF_SKIP_DLM_UNLOCK flag. This
conflicts with the intended meaning of the SDF_SKIP_DLM_UNLOCK flag, so
check whether the lockspace is still allocated instead.
(Given the current DLM API, checking for this kind of error after the
fact seems easier that than to make sure that the lockspace is still
allocated before calling dlm_lock(). Changing the DLM API so that users
maintain the lockspace references themselves would be an option.)
Fixes: 3e11e53041502 ("GFS2: ignore unlock failures after withdraw")
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
The GLF_LOCK flag is protected by the gl->gl_lockref.lock spin lock
which is held when entering run_queue(), so we can use test_bit() and
set_bit() here.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
In gfs2_dispose_glock_lru(), we want to skip glocks which are in the
process of transitioning state (as indicated by the set GLF_LOCK flag),
but we we don't need to set that flag for requesting a state transition.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
Function gfs2_glock_queue_put() puts a glock reference by enqueuing
glock work instead of putting the reference directly. This ensures that
the operation won't sleep, but it is costly and really only necessary
when putting the final glock reference. Replace it with a new
gfs2_glock_put_async() function that only queues glock work when putting
the last glock reference.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
In function signal_our_withdraw(), we are calling gfs2_glock_queue_put()
in a context in which we are actually allowed to sleep, so replace that
with a simple call to gfs2_glock_put().
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
In gfs2_jindex_free(), set sdp->sd_jdesc to NULL under the log flush
lock to provide exclusion against gfs2_log_flush().
In gfs2_log_flush(), check if sdp->sd_jdesc is non-NULL before
dereferencing it. Otherwise, we could run into a NULL pointer
dereference when outstanding glock work races with an unmount
(glock_work_func -> run_queue -> do_xmote -> inode_go_sync ->
gfs2_log_flush).
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
Commit fffe9bee14b0 ("gfs2: Delay withdraw from atomic context")
switched from gfs2_withdraw() to gfs2_withdraw_delayed() in
gfs2_ail_error(), but failed to then check if a delayed withdraw had
occurred. Fix that by adding the missing check in __gfs2_ail_flush(),
where the spin locks are already dropped and a withdraw is possible.
Fixes: fffe9bee14b0 ("gfs2: Delay withdraw from atomic context")
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
When calling gfs2_glock_get(), consistently pass the CREATE or NO_CREATE
flag for the create argument. This is the only place that passes 0
instead.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
Follow up to commit e7beb8b6de1a ("gfs2: Rename SDF_DEACTIVATING to
SDF_KILL") and change the description of the SDF_KILL flag from
"Deactivating" to "Killing".
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|