Age | Commit message (Collapse) | Author |
|
New datapaths may use multiple descriptor units to describe
a single packet. Prepare for that by adding a descriptors
per simple frame constant into ring size calculations.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Fei Qin <fei.qin@corigine.com>
Signed-off-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
To reduce the coupling of slow path ring implementations and their
callers, use callbacks instead.
Changes to Jakub's work:
* Also use callbacks for xmit functions
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Yinjun Zhang <yinjun.zhang@corigine.com>
Signed-off-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
In preparation for support for a new datapath format move all
ring and fast path logic into separate files. It is basically
a verbatim move with some wrapping functions, no new structures
and functions added.
The current data path is called NFD3 from the initial version
of the driver ABI it used. The non-fast path, but ring related
functions are moved to nfp_net_dp.c file.
Changes to Jakub's work:
* Rebase on xsk related code.
* Split the patch, move the callback changes to next commit.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Fei Qin <fei.qin@corigine.com>
Signed-off-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Ring enable masks are 64bit long. Replace mask calculation from:
block_cnt == 64 ? 0xffffffffffffffffULL : (1 << block_cnt) - 1
with:
(U64_MAX >> (64 - block_cnt))
to simplify the code.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Fei Qin <fei.qin@corigine.com>
Signed-off-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
free_watch() does everything barring actually freeing the watch object. Fix
this by adding the missing kfree.
kmemleak produces a report something like the following. Note that as an
address can be seen in the first word, the watch would appear to have gone
through call_rcu().
BUG: memory leak
unreferenced object 0xffff88810ce4a200 (size 96):
comm "syz-executor352", pid 3605, jiffies 4294947473 (age 13.720s)
hex dump (first 32 bytes):
e0 82 48 0d 81 88 ff ff 00 00 00 00 00 00 00 00 ..H.............
80 a2 e4 0c 81 88 ff ff 00 00 00 00 00 00 00 00 ................
backtrace:
[<ffffffff8214e6cc>] kmalloc include/linux/slab.h:581 [inline]
[<ffffffff8214e6cc>] kzalloc include/linux/slab.h:714 [inline]
[<ffffffff8214e6cc>] keyctl_watch_key+0xec/0x2e0 security/keys/keyctl.c:1800
[<ffffffff8214ec84>] __do_sys_keyctl+0x3c4/0x490 security/keys/keyctl.c:2016
[<ffffffff84493a25>] do_syscall_x64 arch/x86/entry/common.c:50 [inline]
[<ffffffff84493a25>] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
[<ffffffff84600068>] entry_SYSCALL_64_after_hwframe+0x44/0xae
Fixes: c73be61cede5 ("pipe: Add general notification queue support")
Reported-and-tested-by: syzbot+6e2de48f06cdb2884bfc@syzkaller.appspotmail.com
Signed-off-by: David Howells <dhowells@redhat.com>
|
|
In watch_queue_set_size(), the error cleanup code doesn't take account of
the fact that __free_page() can't handle a NULL pointer when trying to free
up buffer pages that did get allocated.
Fix this by only calling __free_page() on the pages actually allocated.
Without the fix, this can lead to something like the following:
BUG: KASAN: null-ptr-deref in __free_pages+0x1f/0x1b0 mm/page_alloc.c:5473
Read of size 4 at addr 0000000000000034 by task syz-executor168/3599
...
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
__kasan_report mm/kasan/report.c:446 [inline]
kasan_report.cold+0x66/0xdf mm/kasan/report.c:459
check_region_inline mm/kasan/generic.c:183 [inline]
kasan_check_range+0x13d/0x180 mm/kasan/generic.c:189
instrument_atomic_read include/linux/instrumented.h:71 [inline]
atomic_read include/linux/atomic/atomic-instrumented.h:27 [inline]
page_ref_count include/linux/page_ref.h:67 [inline]
put_page_testzero include/linux/mm.h:717 [inline]
__free_pages+0x1f/0x1b0 mm/page_alloc.c:5473
watch_queue_set_size+0x499/0x630 kernel/watch_queue.c:275
pipe_ioctl+0xac/0x2b0 fs/pipe.c:632
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:874 [inline]
__se_sys_ioctl fs/ioctl.c:860 [inline]
__x64_sys_ioctl+0x193/0x200 fs/ioctl.c:860
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
Fixes: c73be61cede5 ("pipe: Add general notification queue support")
Reported-and-tested-by: syzbot+d55757faa9b80590767b@syzkaller.appspotmail.com
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Fabio M. De Francesco <fmdefrancesco@gmail.com>
|
|
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for net-next.
This patchset contains updates for the nf_tables register tracking
infrastructure, disable bogus warning when attaching ct helpers,
one namespace pollution fix and few cleanups for the flowtable.
1) Revisit conntrack gc routine to reduce chances of overruning
the netlink buffer from the event path. From Florian Westphal.
2) Disable warning on explicit ct helper assignment, from Phil Sutter.
3) Read-only expressions do not update registers, mark them as
NFT_REDUCE_READONLY. Add helper functions to update the register
tracking information. This patch re-enables the register tracking
infrastructure.
4) Cancel register tracking in case an expression fully/partially
clobbers existing data.
5) Add register tracking support for remaining expressions: ct,
lookup, meta, numgen, osf, hash, immediate, socket, xfrm, tunnel,
fib, exthdr.
6) Rename init and exit functions for the conntrack h323 helper,
from Randy Dunlap.
7) Remove redundant field in struct flow_offload_work.
8) Update nf_flow_table_iterate() to pass flowtable to callback.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Reset the last_readdir at the same time, and add a comment explaining
why we don't free last_readdir when dir_emit returns false.
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
If read_mapping_folio() fails then "inline_version" is printed without
being initialized.
[ jlayton: use CEPH_INLINE_NONE instead of "-1" ]
Fixes: 083db6fd3e73 ("ceph: uninline the data on a file opened for writing")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
Signed-off-by: Venky Shankar <vshankar@redhat.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
stdev is computed in `cephfs-top` tool - clients forward
square of sums and IO count required to calculate stdev.
Signed-off-by: Venky Shankar <vshankar@redhat.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
Make the math a bit simpler to understand (should not
affect execution speeds).
Signed-off-by: Venky Shankar <vshankar@redhat.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
Latencies are of type ktime_t, coverting from jiffies is incorrect.
Also, switch to "struct ceph_timespec" for r/w/m latencies.
Signed-off-by: Venky Shankar <vshankar@redhat.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
The ceph_find_inode() may will fail and return NULL.
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
The ceph_get_inode() will search for or insert a new inode into the
hash for the given vino, and return a reference to it. If new is
non-NULL, its reference is consumed.
We should release the reference when in error handing cases.
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
Cache move-in for virtual accesses is controlled by the TLB. Thus,
we must generally purge TLB entries before flushing. The flush routines
must use TLB entries that inhibit cache move-in.
V2: Load physical address prior to flushing TLB. In flush_cache_page,
flush TLB when flushing and purging.
V3: Don't flush when start equals end.
Signed-off-by: John David Anglin <dave.anglin@bell.net>
Signed-off-by: Helge Deller <deller@gmx.de>
|
|
Standalone ports use vid 0. Let the bridge use vid 1 when
"vlan_default_pvid 0" is set to avoid collisions. Since no
VLAN is created when default pvid is 0 this is set
at "PORT_ATTR_SET" and handled in the Switchdev fdb handler.
Signed-off-by: Casper Andersson <casper.casan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
allocated_mem is allocated by kcalloc(). The memory is set to zero.
It is unnecessary to call memset again.
Signed-off-by: Wan Jiabing <wanjiabing@vivo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
In calipso_map_cat_ntoh(), in the for loop, if the return value of
netlbl_bitmap_walk() is equal to (net_clen_bits - 1), when
netlbl_bitmap_walk() is called next time, out-of-bounds memory accesses
of bitmap[byte_offset] occurs.
The bug was found during fuzzing. The following is the fuzzing report
BUG: KASAN: slab-out-of-bounds in netlbl_bitmap_walk+0x3c/0xd0
Read of size 1 at addr ffffff8107bf6f70 by task err_OH/252
CPU: 7 PID: 252 Comm: err_OH Not tainted 5.17.0-rc7+ #17
Hardware name: linux,dummy-virt (DT)
Call trace:
dump_backtrace+0x21c/0x230
show_stack+0x1c/0x60
dump_stack_lvl+0x64/0x7c
print_address_description.constprop.0+0x70/0x2d0
__kasan_report+0x158/0x16c
kasan_report+0x74/0x120
__asan_load1+0x80/0xa0
netlbl_bitmap_walk+0x3c/0xd0
calipso_opt_getattr+0x1a8/0x230
calipso_sock_getattr+0x218/0x340
calipso_sock_getattr+0x44/0x60
netlbl_sock_getattr+0x44/0x80
selinux_netlbl_socket_setsockopt+0x138/0x170
selinux_socket_setsockopt+0x4c/0x60
security_socket_setsockopt+0x4c/0x90
__sys_setsockopt+0xbc/0x2b0
__arm64_sys_setsockopt+0x6c/0x84
invoke_syscall+0x64/0x190
el0_svc_common.constprop.0+0x88/0x200
do_el0_svc+0x88/0xa0
el0_svc+0x128/0x1b0
el0t_64_sync_handler+0x9c/0x120
el0t_64_sync+0x16c/0x170
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Wang Yufen <wangyufen@huawei.com>
Acked-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Duoming Zhou says:
====================
Fix refcount leak and NPD bugs in ax25
The first patch fixes refcount leak in ax25 that could cause
ax25-ex-connected-session-now-listening-state-bug.
The second patch fixes NPD bugs in ax25 timers.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The previous commit 7ec02f5ac8a5 ("ax25: fix NPD bug in ax25_disconnect")
move ax25_disconnect into lock_sock() in order to prevent NPD bugs. But
there are race conditions that may lead to null pointer dereferences in
ax25_heartbeat_expiry(), ax25_t1timer_expiry(), ax25_t2timer_expiry(),
ax25_t3timer_expiry() and ax25_idletimer_expiry(), when we use
ax25_kill_by_device() to detach the ax25 device.
One of the race conditions that cause null pointer dereferences can be
shown as below:
(Thread 1) | (Thread 2)
ax25_connect() |
ax25_std_establish_data_link() |
ax25_start_t1timer() |
mod_timer(&ax25->t1timer,..) |
| ax25_kill_by_device()
(wait a time) | ...
| s->ax25_dev = NULL; //(1)
ax25_t1timer_expiry() |
ax25->ax25_dev->values[..] //(2)| ...
... |
We set null to ax25_cb->ax25_dev in position (1) and dereference
the null pointer in position (2).
The corresponding fail log is shown below:
===============================================================
BUG: kernel NULL pointer dereference, address: 0000000000000050
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.17.0-rc6-00794-g45690b7d0
RIP: 0010:ax25_t1timer_expiry+0x12/0x40
...
Call Trace:
call_timer_fn+0x21/0x120
__run_timers.part.0+0x1ca/0x250
run_timer_softirq+0x2c/0x60
__do_softirq+0xef/0x2f3
irq_exit_rcu+0xb6/0x100
sysvec_apic_timer_interrupt+0xa2/0xd0
...
This patch moves ax25_disconnect() before s->ax25_dev = NULL
and uses del_timer_sync() to delete timers in ax25_disconnect().
If ax25_disconnect() is called by ax25_kill_by_device() or
ax25->ax25_dev is NULL, the reason in ax25_disconnect() will be
equal to ENETUNREACH, it will wait all timers to stop before we
set null to s->ax25_dev in ax25_kill_by_device().
Fixes: 7ec02f5ac8a5 ("ax25: fix NPD bug in ax25_disconnect")
Signed-off-by: Duoming Zhou <duoming@zju.edu.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The previous commit d01ffb9eee4a ("ax25: add refcount in ax25_dev to
avoid UAF bugs") and commit feef318c855a ("ax25: fix UAF bugs of
net_device caused by rebinding operation") increase the refcounts of
ax25_dev and net_device in ax25_bind() and decrease the matching refcounts
in ax25_kill_by_device() in order to prevent UAF bugs, but there are
reference count leaks.
The root cause of refcount leaks is shown below:
(Thread 1) | (Thread 2)
ax25_bind() |
... |
ax25_addr_ax25dev() |
ax25_dev_hold() //(1) |
... |
dev_hold_track() //(2) |
... | ax25_destroy_socket()
| ax25_cb_del()
| ...
| hlist_del_init() //(3)
|
|
(Thread 3) |
ax25_kill_by_device() |
... |
ax25_for_each(s, &ax25_list) { |
if (s->ax25_dev == ax25_dev) //(4) |
... |
Firstly, we use ax25_bind() to increase the refcount of ax25_dev in
position (1) and increase the refcount of net_device in position (2).
Then, we use ax25_cb_del() invoked by ax25_destroy_socket() to delete
ax25_cb in hlist in position (3) before calling ax25_kill_by_device().
Finally, the decrements of refcounts in ax25_kill_by_device() will not
be executed, because no s->ax25_dev equals to ax25_dev in position (4).
This patch adds decrements of refcounts in ax25_release() and use
lock_sock() to do synchronization. If refcounts decrease in ax25_release(),
the decrements of refcounts in ax25_kill_by_device() will not be
executed and vice versa.
Fixes: d01ffb9eee4a ("ax25: add refcount in ax25_dev to avoid UAF bugs")
Fixes: 87563a043cef ("ax25: fix reference count leaks of ax25_dev")
Fixes: feef318c855a ("ax25: fix UAF bugs of net_device caused by rebinding operation")
Reported-by: Thomas Osterried <thomas@osterried.de>
Signed-off-by: Duoming Zhou <duoming@zju.edu.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add the <linux/cgroup-defs.h> dependency to <linux/psi.h>, because
cgroup_move_task() will dereference 'struct css_set'.
( Only older toolchains are affected, due to variations in
the implementation of rcu_assign_pointer() et al. )
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Reported-by: Sachin Sant <sachinp@linux.ibm.com>
Reported-by: Andrew Morton <akpm@linux-foundation.org>
Reported-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Signed-off-by: Borislav Petkov <bp@suse.de>
|
|
This reverts commit cf3e26427c08ad9015956293ab389004ac6a338e.
Multi-vCPU Hyper-V guests started crashing randomly on boot with the
latest kvm/queue and the problem can be bisected the problem to this
particular patch. Basically, I'm not able to boot e.g. 16-vCPU guest
successfully anymore. Both Intel and AMD seem to be affected. Reverting
the commit saves the day.
Reported-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Since "KVM: x86/mmu: Zap only TDP MMU leafs in kvm_zap_gfn_range()"
is going to be reverted, it's not going to be true anymore that
the zap-page flow does not free any 'struct kvm_mmu_page'. Introduce
an early flush before tdp_mmu_zap_leafs() returns, to preserve
bisectability.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
https://gitlab.freedesktop.org/agd5f/linux into drm-next
amd-drm-next-5.18-2022-03-18:
amdgpu:
- Aldebaran fixes
- SMU 13.0.5 fixes
- DCN 3.1.5 fixes
- DCN 3.1.6 fixes
- Pipe split fixes
- More display FP cleanup
- DP 2.0 UHBR fix
- DC GPU reset fix
- DC deep color ratio fix
- SMU robustness fixes
- Runtime PM fix for APUs
- IGT reload fixes
- SR-IOV fix
- Misc fixes and cleanups
amdkfd:
- CRIU fixes
- SVM fixes
UAPI:
- Properly handle SDMA transfers with CRIU
Proposed user mode change: https://github.com/checkpoint-restore/criu/pull/1709
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Alex Deucher <alexander.deucher@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20220318203717.5833-1-alexander.deucher@amd.com
|
|
When CONFIG_DEBUG_INFO_BTF is disabled, bpf_get_btf_vmlinux can return a
NULL pointer. Check for it in btf_get_module_btf to prevent a NULL pointer
dereference.
While kernel test robot only complained about this specific case, let's
also check for NULL in other call sites of bpf_get_btf_vmlinux.
Fixes: 9492450fd287 ("bpf: Always raise reference in btf_get_module_btf")
Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220320143003.589540-1-memxor@gmail.com
|
|
In remove_phb_dynamic() we use &phb->io_resource, after we've called
device_unregister(&host_bridge->dev). But the unregister may have freed
phb, because pcibios_free_controller_deferred() is the release function
for the host_bridge.
If there are no outstanding references when we call device_unregister()
then phb will be freed out from under us.
This has gone mainly unnoticed, but with slub_debug and page_poison
enabled it can lead to a crash:
PID: 7574 TASK: c0000000d492cb80 CPU: 13 COMMAND: "drmgr"
#0 [c0000000e4f075a0] crash_kexec at c00000000027d7dc
#1 [c0000000e4f075d0] oops_end at c000000000029608
#2 [c0000000e4f07650] __bad_page_fault at c0000000000904b4
#3 [c0000000e4f076c0] do_bad_slb_fault at c00000000009a5a8
#4 [c0000000e4f076f0] data_access_slb_common_virt at c000000000008b30
Data SLB Access [380] exception frame:
R0: c000000000167250 R1: c0000000e4f07a00 R2: c000000002a46100
R3: c000000002b39ce8 R4: 00000000000000c0 R5: 00000000000000a9
R6: 3894674d000000c0 R7: 0000000000000000 R8: 00000000000000ff
R9: 0000000000000100 R10: 6b6b6b6b6b6b6b6b R11: 0000000000008000
R12: c00000000023da80 R13: c0000009ffd38b00 R14: 0000000000000000
R15: 000000011c87f0f0 R16: 0000000000000006 R17: 0000000000000003
R18: 0000000000000002 R19: 0000000000000004 R20: 0000000000000005
R21: 000000011c87ede8 R22: 000000011c87c5a8 R23: 000000011c87d3a0
R24: 0000000000000000 R25: 0000000000000001 R26: c0000000e4f07cc8
R27: c00000004d1cc400 R28: c0080000031d00e8 R29: c00000004d23d800
R30: c00000004d1d2400 R31: c00000004d1d2540
NIP: c000000000167258 MSR: 8000000000009033 OR3: c000000000e9f474
CTR: 0000000000000000 LR: c000000000167250 XER: 0000000020040003
CCR: 0000000024088420 MQ: 0000000000000000 DAR: 6b6b6b6b6b6b6ba3
DSISR: c0000000e4f07920 Syscall Result: fffffffffffffff2
[NIP : release_resource+56]
[LR : release_resource+48]
#5 [c0000000e4f07a00] release_resource at c000000000167258 (unreliable)
#6 [c0000000e4f07a30] remove_phb_dynamic at c000000000105648
#7 [c0000000e4f07ab0] dlpar_remove_slot at c0080000031a09e8 [rpadlpar_io]
#8 [c0000000e4f07b50] remove_slot_store at c0080000031a0b9c [rpadlpar_io]
#9 [c0000000e4f07be0] kobj_attr_store at c000000000817d8c
#10 [c0000000e4f07c00] sysfs_kf_write at c00000000063e504
#11 [c0000000e4f07c20] kernfs_fop_write_iter at c00000000063d868
#12 [c0000000e4f07c70] new_sync_write at c00000000054339c
#13 [c0000000e4f07d10] vfs_write at c000000000546624
#14 [c0000000e4f07d60] ksys_write at c0000000005469f4
#15 [c0000000e4f07db0] system_call_exception at c000000000030840
#16 [c0000000e4f07e10] system_call_vectored_common at c00000000000c168
To avoid it, we can take a reference to the host_bridge->dev until we're
done using phb. Then when we drop the reference the phb will be freed.
Fixes: 2dd9c11b9d4d ("powerpc/pseries: use pci_host_bridge.release_fn() to kfree(phb)")
Reported-by: David Dai <zdai@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Tested-by: Sachin Sant <sachinp@linux.ibm.com>
Link: https://lore.kernel.org/r/20220318034219.1188008-1-mpe@ellerman.id.au
|
|
Add a test case for stacktrace with skip > 0 using a small sized
buffer. It didn't support skipping entries greater than or equal to
the size of buffer and filled the skipped part with 0.
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20220314182042.71025-2-namhyung@kernel.org
|
|
Let's say that the caller has storage for num_elem stack frames. Then,
the BPF stack helper functions walk the stack for only num_elem frames.
This means that if skip > 0, one keeps only 'num_elem - skip' frames.
This is because it sets init_nr in the perf_callchain_entry to the end
of the buffer to save num_elem entries only. I believe it was because
the perf callchain code unwound the stack frames until it reached the
global max size (sysctl_perf_event_max_stack).
However it now has perf_callchain_entry_ctx.max_stack to limit the
iteration locally. This simplifies the code to handle init_nr in the
BPF callstack entries and removes the confusion with the perf_event's
__PERF_SAMPLE_CALLCHAIN_EARLY which sets init_nr to 0.
Also change the comment on bpf_get_stack() in the header file to be
more explicit what the return value means.
Fixes: c195651e565a ("bpf: add bpf_get_stack helper")
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/30a7b5d5-6726-1cc2-eaee-8da2828a9a9c@oracle.com
Link: https://lore.kernel.org/bpf/20220314182042.71025-1-namhyung@kernel.org
Based-on-patch-by: Eugene Loh <eugene.loh@oracle.com>
|
|
Using HPAGE_PMD_SIZE as the size for bpf_prog_pack is not ideal in some
cases. Specifically, for NUMA systems, __vmalloc_node_range requires
PMD_SIZE * num_online_nodes() to allocate huge pages. Also, if the system
does not support huge pages (i.e., with cmdline option nohugevmalloc), it
is better to use PAGE_SIZE packs.
Add logic to select proper size for bpf_prog_pack. This solution is not
ideal, as it makes assumption about the behavior of module_alloc and
__vmalloc_node_range. However, it appears to be the easiest solution as
it doesn't require changes in module_alloc and vmalloc code.
Fixes: 57631054fae6 ("bpf: Introduce bpf_prog_pack allocator")
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220311201135.3573610-1-song@kernel.org
|
|
Jakub Sitnicki says:
====================
This patch set is a result of a discussion we had around the RFC patchset from
Ilya [1]. The fix for the narrow loads from the RFC series is still relevant,
but this series does not depend on it. Nor is it required to unbreak sk_lookup
tests on BE, if this series gets applied.
To summarize the takeaways from [1]:
1) we want to make 2-byte load from ctx->remote_port portable across LE and BE,
2) we keep the 4-byte load from ctx->remote_port as it is today - result varies
on endianess of the platform.
[1] https://lore.kernel.org/bpf/20220222182559.2865596-2-iii@linux.ibm.com/
v1 -> v2:
- Remove needless check that 4-byte load is from &ctx->remote_port offset
(Martin)
[v1]: https://lore.kernel.org/bpf/20220317165826.1099418-1-jakub@cloudflare.com/
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
The context access converter rewrites the 4-byte load from
bpf_sk_lookup->remote_port to a 2-byte load from bpf_sk_lookup_kern
structure.
It means that we cannot treat the destination register contents as a 32-bit
value, or the code will not be portable across big- and little-endian
architectures.
This is exactly the same case as with 4-byte loads from bpf_sock->dst_port
so follow the approach outlined in [1] and treat the register contents as a
16-bit value in the test.
[1]: https://lore.kernel.org/bpf/20220317113920.1068535-5-jakub@cloudflare.com/
Fixes: 2ed0dc5937d3 ("selftests/bpf: Cover 4-byte load from remote_port in bpf_sk_lookup")
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20220319183356.233666-4-jakub@cloudflare.com
|
|
In commit 9a69e2b385f4 ("bpf: Make remote_port field in struct
bpf_sk_lookup 16-bit wide") ->remote_port field changed from __u32 to
__be16.
However, narrow load tests which exercise 1-byte sized loads from
offsetof(struct bpf_sk_lookup, remote_port) were not adopted to reflect the
change.
As a result, on little-endian we continue testing loads from addresses:
- (__u8 *)&ctx->remote_port + 3
- (__u8 *)&ctx->remote_port + 4
which map to the zero padding following the remote_port field, and don't
break the tests because there is no observable change.
While on big-endian, we observe breakage because tests expect to see zeros
for values loaded from:
- (__u8 *)&ctx->remote_port - 1
- (__u8 *)&ctx->remote_port - 2
Above addresses map to ->remote_ip6 field, which precedes ->remote_port,
and are populated during the bpf_sk_lookup IPv6 tests.
Unsurprisingly, on s390x we observe:
#136/38 sk_lookup/narrow access to ctx v4:OK
#136/39 sk_lookup/narrow access to ctx v6:FAIL
Fix it by removing the checks for 1-byte loads from offsets outside of the
->remote_port field.
Fixes: 9a69e2b385f4 ("bpf: Make remote_port field in struct bpf_sk_lookup 16-bit wide")
Suggested-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20220319183356.233666-3-jakub@cloudflare.com
|
|
In commit 9a69e2b385f4 ("bpf: Make remote_port field in struct
bpf_sk_lookup 16-bit wide") the remote_port field has been split up and
re-declared from u32 to be16.
However, the accompanying changes to the context access converter have not
been well thought through when it comes big-endian platforms.
Today 2-byte wide loads from offsetof(struct bpf_sk_lookup, remote_port)
are handled as narrow loads from a 4-byte wide field.
This by itself is not enough to create a problem, but when we combine
1. 32-bit wide access to ->remote_port backed by a 16-wide wide load, with
2. inherent difference between litte- and big-endian in how narrow loads
need have to be handled (see bpf_ctx_narrow_access_offset),
we get inconsistent results for a 2-byte loads from &ctx->remote_port on LE
and BE architectures. This in turn makes BPF C code for the common case of
2-byte load from ctx->remote_port not portable.
To rectify it, inform the context access converter that remote_port is
2-byte wide field, and only 1-byte loads need to be treated as narrow
loads.
At the same time, we special-case the 4-byte load from &ctx->remote_port to
continue handling it the same way as do today, in order to keep the
existing BPF programs working.
Fixes: 9a69e2b385f4 ("bpf: Make remote_port field in struct bpf_sk_lookup 16-bit wide")
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20220319183356.233666-2-jakub@cloudflare.com
|
|
Joanne Koong says:
====================
From: Joanne Koong <joannelkoong@gmail.com>
Currently, local storage memory can only be allocated atomically
(GFP_ATOMIC). This restriction is too strict for sleepable bpf
programs.
In this patchset, sleepable programs can allocate memory in local
storage using GFP_KERNEL, while non-sleepable programs always default to
GFP_ATOMIC.
v3 <- v2:
* Add extra case to local_storage.c selftest to test associating multiple
elements with the local storage, which triggers a GFP_KERNEL allocation in
local_storage_update().
* Cast gfp_t to __s32 in verifier to fix the sparse warnings
v2 <- v1:
* Allocate the memory before/after the raw_spin_lock_irqsave, depending
on the gfp flags
* Rename mem_flags to gfp_flags
* Reword the comment "*mem_flags* is set by the bpf verifier" to
"*gfp_flags* is a hidden argument provided by the verifier"
* Add a sentence to the commit message about existing local storage
selftests covering both the GFP_ATOMIC and GFP_KERNEL paths in
bpf_local_storage_update.
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
This patch adds a few calls to the existing local storage selftest to
test that we can associate multiple elements with the local storage.
The sleepable program's call to bpf_sk_storage_get with sk_storage_map2
will lead to an allocation of a new selem under the GFP_KERNEL flag.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220318045553.3091807-3-joannekoong@fb.com
|
|
Currently, local storage memory can only be allocated atomically
(GFP_ATOMIC). This restriction is too strict for sleepable bpf
programs.
In this patch, the verifier detects whether the program is sleepable,
and passes the corresponding GFP_KERNEL or GFP_ATOMIC flag as a
5th argument to bpf_task/sk/inode_storage_get. This flag will propagate
down to the local storage functions that allocate memory.
Please note that bpf_task/sk/inode_storage_update_elem functions are
invoked by userspace applications through syscalls. Preemption is
disabled before bpf_task/sk/inode_storage_update_elem is called, which
means they will always have to allocate memory atomically.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: KP Singh <kpsingh@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20220318045553.3091807-2-joannekoong@fb.com
|
|
If BPF object doesn't have an BTF info, don't attempt to search for BTF
types describing BPF map key or value layout.
Fixes: 262cfb74ffda ("libbpf: Init btf_{key,value}_type_id on internal map open")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20220320001911.3640917-1-andrii@kernel.org
|
|
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The commit cad6fade6e78 ("xtensa: clean up WSR*/RSR*/get_sr/set_sr")
replaced 'WSR' macro in the function xtensa_wsr with 'xtensa_set_sr',
but variable 'v' in the xtensa_set_sr body shadowed the argument 'v'
passed to it, resulting in wrong value written to debug registers.
Fix that by removing intermediate variable from the xtensa_set_sr
macro body.
Cc: stable@vger.kernel.org
Fixes: cad6fade6e78 ("xtensa: clean up WSR*/RSR*/get_sr/set_sr")
Signed-off-by: Max Filippov <jcmvbkbc@gmail.com>
|
|
While the original code is valid, it is not the obvious choice for the
sizeof() call and in preparation to limit the scope of the list iterator
variable the sizeof should be changed to the size of the destination.
Signed-off-by: Jakob Koschel <jakobkoschel@gmail.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Pull kvm fix from Paolo Bonzini:
"Fix for the SLS mitigation, which makes a 'SETcc/RET' pair grow
to 'SETcc/RET/INT3'.
This doesn't fit in 4 bytes any more, so the alignment has to
change to 8 for this case"
* tag 'for-linus-5.17' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
kvm/emulate: Fix SETcc emulation function offsets with SLS
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input
Pull input fixes from Dmitry Torokhov:
"Two driver fixes:
- a fix for zinitix touchscreen to properly report contacts
- a fix for aiptek tablet driver to be more resilient to devices with
incorrect descriptors"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
Input: aiptek - properly check endpoint type
Input: zinitix - do not report shadow fingers
|
|
I've been chasing a recent resurgence in generic/388 recovery
failure and/or corruption events. The events have largely been
uninitialised inode chunks being tripped over in log recovery
such as:
XFS (pmem1): User initiated shutdown received.
pmem1: writeback error on inode 12621949, offset 1019904, sector 12968096
XFS (pmem1): Log I/O Error (0x6) detected at xfs_fs_goingdown+0xa3/0xf0 (fs/xfs/xfs_fsops.c:500). Shutting down filesystem.
XFS (pmem1): Please unmount the filesystem and rectify the problem(s)
XFS (pmem1): Unmounting Filesystem
XFS (pmem1): Mounting V5 Filesystem
XFS (pmem1): Starting recovery (logdev: internal)
XFS (pmem1): bad inode magic/vsn daddr 8723584 #0 (magic=1818)
XFS (pmem1): Metadata corruption detected at xfs_inode_buf_verify+0x180/0x190, xfs_inode block 0x851c80 xfs_inode_buf_verify
XFS (pmem1): Unmount and run xfs_repair
XFS (pmem1): First 128 bytes of corrupted metadata buffer:
00000000: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 ................
00000010: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 ................
00000020: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 ................
00000030: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 ................
00000040: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 ................
00000050: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 ................
00000060: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 ................
00000070: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 ................
XFS (pmem1): metadata I/O error in "xlog_recover_items_pass2+0x52/0xc0" at daddr 0x851c80 len 32 error 117
XFS (pmem1): log mount/recovery failed: error -117
XFS (pmem1): log mount failed
There have been isolated random other issues, too - xfs_repair fails
because it finds some corruption in symlink blocks, rmap
inconsistencies, etc - but they are nowhere near as common as the
uninitialised inode chunk failure.
The problem has clearly happened at runtime before recovery has run;
I can see the ICREATE log item in the log shortly before the
actively recovered range of the log. This means the ICREATE was
definitely created and written to the log, but for some reason the
tail of the log has been moved past the ordered buffer log item that
tracks INODE_ALLOC buffers and, supposedly, prevents the tail of the
log moving past the ICREATE log item before the inode chunk buffer
is written to disk.
Tracing the fsstress processes that are running when the filesystem
shut down immediately pin-pointed the problem:
user shutdown marks xfs_mount as shutdown
godown-213341 [008] 6398.022871: console: [ 6397.915392] XFS (pmem1): User initiated shutdown received.
.....
aild tries to push ordered inode cluster buffer
xfsaild/pmem1-213314 [001] 6398.022974: xfs_buf_trylock: dev 259:1 daddr 0x851c80 bbcount 0x20 hold 16 pincount 0 lock 0 flags DONE|INODES|PAGES caller xfs_inode_item_push+0x8e
xfsaild/pmem1-213314 [001] 6398.022976: xfs_ilock_nowait: dev 259:1 ino 0x851c80 flags ILOCK_SHARED caller xfs_iflush_cluster+0xae
xfs_iflush_cluster() checks xfs_is_shutdown(), returns true,
calls xfs_iflush_abort() to kill writeback of the inode.
Inode is removed from AIL, drops cluster buffer reference.
xfsaild/pmem1-213314 [001] 6398.022977: xfs_ail_delete: dev 259:1 lip 0xffff88880247ed80 old lsn 7/20344 new lsn 7/21000 type XFS_LI_INODE flags IN_AIL
xfsaild/pmem1-213314 [001] 6398.022978: xfs_buf_rele: dev 259:1 daddr 0x851c80 bbcount 0x20 hold 17 pincount 0 lock 0 flags DONE|INODES|PAGES caller xfs_iflush_abort+0xd7
.....
All inodes on cluster buffer are aborted, then the cluster buffer
itself is aborted and removed from the AIL *without writeback*:
xfsaild/pmem1-213314 [001] 6398.023011: xfs_buf_error_relse: dev 259:1 daddr 0x851c80 bbcount 0x20 hold 2 pincount 0 lock 0 flags ASYNC|DONE|STALE|INODES|PAGES caller xfs_buf_ioend_fail+0x33
xfsaild/pmem1-213314 [001] 6398.023012: xfs_ail_delete: dev 259:1 lip 0xffff8888053efde8 old lsn 7/20344 new lsn 7/20344 type XFS_LI_BUF flags IN_AIL
The inode buffer was at 7/20344 when it was removed from the AIL.
xfsaild/pmem1-213314 [001] 6398.023012: xfs_buf_item_relse: dev 259:1 daddr 0x851c80 bbcount 0x20 hold 2 pincount 0 lock 0 flags ASYNC|DONE|STALE|INODES|PAGES caller xfs_buf_item_done+0x31
xfsaild/pmem1-213314 [001] 6398.023012: xfs_buf_rele: dev 259:1 daddr 0x851c80 bbcount 0x20 hold 2 pincount 0 lock 0 flags ASYNC|DONE|STALE|INODES|PAGES caller xfs_buf_item_relse+0x39
.....
Userspace is still running, doing stuff. an fsstress process runs
syncfs() or sync() and we end up in sync_fs_one_sb() which issues
a log force. This pushes on the CIL:
fsstress-213322 [001] 6398.024430: xfs_fs_sync_fs: dev 259:1 m_features 0x20000000019ff6e9 opstate (clean|shutdown|inodegc|blockgc) s_flags 0x70810000 caller sync_fs_one_sb+0x26
fsstress-213322 [001] 6398.024430: xfs_log_force: dev 259:1 lsn 0x0 caller xfs_fs_sync_fs+0x82
fsstress-213322 [001] 6398.024430: xfs_log_force: dev 259:1 lsn 0x5f caller xfs_log_force+0x7c
<...>-194402 [001] 6398.024467: kmem_alloc: size 176 flags 0x14 caller xlog_cil_push_work+0x9f
And the CIL fills up iclogs with pending changes. This picks up
the current tail from the AIL:
<...>-194402 [001] 6398.024497: xlog_iclog_get_space: dev 259:1 state XLOG_STATE_ACTIVE refcnt 1 offset 0 lsn 0x0 flags caller xlog_write+0x149
<...>-194402 [001] 6398.024498: xlog_iclog_switch: dev 259:1 state XLOG_STATE_ACTIVE refcnt 1 offset 0 lsn 0x700005408 flags caller xlog_state_get_iclog_space+0x37e
<...>-194402 [001] 6398.024521: xlog_iclog_release: dev 259:1 state XLOG_STATE_WANT_SYNC refcnt 1 offset 32256 lsn 0x700005408 flags caller xlog_write+0x5f9
<...>-194402 [001] 6398.024522: xfs_log_assign_tail_lsn: dev 259:1 new tail lsn 7/21000, old lsn 7/20344, last sync 7/21448
And it moves the tail of the log to 7/21000 from 7/20344. This
*moves the tail of the log beyond the ICREATE transaction* that was
at 7/20344 and pinned by the inode cluster buffer that was cancelled
above.
....
godown-213341 [008] 6398.027005: xfs_force_shutdown: dev 259:1 tag logerror flags log_io|force_umount file fs/xfs/xfs_fsops.c line_num 500
godown-213341 [008] 6398.027022: console: [ 6397.915406] pmem1: writeback error on inode 12621949, offset 1019904, sector 12968096
godown-213341 [008] 6398.030551: console: [ 6397.919546] XFS (pmem1): Log I/O Error (0x6) detected at xfs_fs_goingdown+0xa3/0xf0 (fs/
And finally the log itself is now shutdown, stopping all further
writes to the log. But this is too late to prevent the corruption
that moving the tail of the log forwards after we start cancelling
writeback causes.
The fundamental problem here is that we are using the wrong shutdown
checks for log items. We've long conflated mount shutdown with log
shutdown state, and I started separating that recently with the
atomic shutdown state changes in commit b36d4651e165 ("xfs: make
forced shutdown processing atomic"). The changes in that commit
series are directly responsible for being able to diagnose this
issue because it clearly separated mount shutdown from log shutdown.
Essentially, once we start cancelling writeback of log items and
removing them from the AIL because the filesystem is shut down, we
*cannot* update the journal because we may have cancelled the items
that pin the tail of the log. That moves the tail of the log
forwards without having written the metadata back, hence we have
corrupt in memory state and writing to the journal propagates that
to the on-disk state.
What commit b36d4651e165 makes clear is that log item state needs to
change relative to log shutdown, not mount shutdown. IOWs, anything
that aborts metadata writeback needs to check log shutdown state
because log items directly affect log consistency. Having them check
mount shutdown state introduces the above race condition where we
cancel metadata writeback before the log shuts down.
To fix this, this patch works through all log items and converts
shutdown checks to use xlog_is_shutdown() rather than
xfs_is_shutdown(), so that we don't start aborting metadata
writeback before we shut off journal writes.
AFAICT, this race condition is a zero day IO error handling bug in
XFS that dates back to the introduction of XLOG_IO_ERROR,
XLOG_STATE_IOERROR and XFS_FORCED_SHUTDOWN back in January 1997.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
The AIL operates purely on log items, so it is a log centric
subsystem. Divorce it from the xfs_mount and instead have it pass
around xlog pointers.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Log items belong to the log, not the xfs_mount. Convert the mount
pointer in the log item to a xlog pointer in preparation for
upcoming log centric changes to the log items.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
When the AIL tries to flush the CIL, it relies on the CIL push
ending up on stable storage without having to wait for and
manipulate iclog state directly. However, if there is already a
pending CIL push when the AIL tries to flush the CIL, it won't set
the cil->xc_push_commit_stable flag and so the CIL push will not
actively flush the commit record iclog.
generic/530 when run on a single CPU test VM can trigger this fairly
reliably. This test exercises unlinked inode recovery, and can
result in inodes being pinned in memory by ongoing modifications to
the inode cluster buffer to record unlinked list modifications. As a
result, the first inode unlinked in a buffer can pin the tail of the
log whilst the inode cluster buffer is pinned by the current
checkpoint that has been pushed but isn't on stable storage because
because the cil->xc_push_commit_stable was not set. This results in
the log/AIL effectively deadlocking until something triggers the
commit record iclog to be pushed to stable storage (i.e. the
periodic log worker calling xfs_log_force()).
The fix is two-fold - first we should always set the
cil->xc_push_commit_stable when xlog_cil_flush() is called,
regardless of whether there is already a pending push or not.
Second, if the CIL is empty, we should trigger an iclog flush to
ensure that the iclogs of the last checkpoint have actually been
submitted to disk as that checkpoint may not have been run under
stable completion constraints.
Reported-and-tested-by: Matthew Wilcox <willy@infradead.org>
Fixes: 0020a190cf3e ("xfs: AIL needs asynchronous CIL forcing")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
xfs_ail_push_all_sync() has a loop like this:
while max_ail_lsn {
prepare_to_wait(ail_empty)
target = max_ail_lsn
wake_up(ail_task);
schedule()
}
Which is designed to sleep until the AIL is emptied. When
xfs_ail_update_finish() moves the tail of the log, it does:
if (list_empty(&ailp->ail_head))
wake_up_all(&ailp->ail_empty);
So it will only wake up the sync push waiter when the AIL goes
empty. If, by the time the push waiter has woken, the AIL has more
in it, it will reset the target, wake the push task and go back to
sleep.
The problem here is that if the AIL is having items added to it
when xfs_ail_push_all_sync() is called, then they may get inserted
into the AIL at a LSN higher than the target LSN. At this point,
xfsaild_push() will see that the target is X, the item LSNs are
(X+N) and skip over them, hence never pushing the out.
The result of this the AIL will not get emptied by the AIL push
thread, hence xfs_ail_finish_update() will never see the AIL being
empty even if it moves the tail. Hence xfs_ail_push_all_sync() never
gets woken and hence cannot update the push target to capture the
items beyond the current target on the LSN.
This is a TOCTOU type of issue so the way to avoid it is to not
use the push target at all for sync pushes. We know that a sync push
is being requested by the fact the ail_empty wait queue is active,
hence the xfsaild can just set the target to max_ail_lsn on every
push that we see the wait queue active. Hence we no longer will
leave items on the AIL that are beyond the LSN sampled at the start
of a sync push.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|