Age | Commit message (Collapse) | Author |
|
Daniel T. Lee says:
====================
samples/bpf: make BPF programs more libbpf aware
The existing tracing programs have been developed for a considerable
period of time and, as a result, do not properly incorporate the
features of the current libbpf, such as CO-RE. This is evident in
frequent usage of functions like PT_REGS* and the persistence of "hack"
methods using underscore-style bpf_probe_read_kernel from the past.
These programs are far behind the current level of libbpf and can
potentially confuse users.
The kernel has undergone significant changes, and some of these changes
have broken these programs, but on the other hand, more robust APIs have
been developed for increased stableness.
To list some of the kernel changes that this patch set is focusing on,
- symbol mismatch occurs due to compiler optimization [1]
- inline of blk_account_io* breaks BPF kprobe program [2]
- new tracepoints for the block_io_start/done are introduced [3]
- map lookup probes can't be triggered (bpf_disable_instrumentation)[4]
- BPF_KSYSCALL has been introduced to simplify argument fetching [5]
- convert to vmlinux.h and use tp argument structure within it
- make tracing programs to be more CO-RE centric
In this regard, this patch set aims not only to integrate the latest
features of libbpf into BPF programs but also to reduce confusion and
clarify the BPF programs. This will help with the potential confusion
among users and make the programs more intutitive.
[1]: https://github.com/iovisor/bcc/issues/1754
[2]: https://github.com/iovisor/bcc/issues/4261
[3]: commit 5a80bd075f3b ("block: introduce block_io_start/block_io_done tracepoints")
[4]: commit 7c4cd051add3 ("bpf: Fix syscall's stackmap lookup potential deadlock")
[5]: commit 6f5d467d55f0 ("libbpf: improve BPF_KPROBE_SYSCALL macro and rename it to BPF_KSYSCALL")
====================
Link: https://lore.kernel.org/r/20230818090119.477441-1-danieltimlee@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
With the introduction of kprobe.multi, it is now possible to attach
multiple kprobes to a single BPF program without the need for multiple
definitions. Additionally, this method supports wildcard-based
matching, allowing for further simplification of BPF programs. In here,
an asterisk (*) wildcard is used to map to all symbols relevant to
spin_{lock|unlock}.
Furthermore, since kprobe.multi handles symbol matching, this commit
eliminates the need for the previous logic of reading the ksym table to
verify the existence of symbols.
Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com>
Link: https://lore.kernel.org/r/20230818090119.477441-10-danieltimlee@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
This commit refactors the syscall tracing programs by adopting the
BPF_KSYSCALL macro. This change aims to enhance the clarity and
simplicity of the BPF programs by reducing the complexity of argument
parsing from pt_regs.
Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com>
Link: https://lore.kernel.org/r/20230818090119.477441-9-danieltimlee@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
In the commit 7c4cd051add3 ("bpf: Fix syscall's stackmap lookup
potential deadlock"), a potential deadlock issue was addressed, which
resulted in *_map_lookup_elem not triggering BPF programs.
(prior to lookup, bpf_disable_instrumentation() is used)
To resolve the broken map lookup probe using "htab_map_lookup_elem",
this commit introduces an alternative approach. Instead, it utilize
"bpf_map_copy_value" and apply a filter specifically for the hash table
with map_type.
Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com>
Fixes: 7c4cd051add3 ("bpf: Fix syscall's stackmap lookup potential deadlock")
Link: https://lore.kernel.org/r/20230818090119.477441-8-danieltimlee@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Recently, a new tracepoint for the block layer, specifically the
block_io_start/done tracepoints, was introduced in commit 5a80bd075f3b
("block: introduce block_io_start/block_io_done tracepoints").
Previously, the kprobe entry used for this purpose was quite unstable
and inherently broke relevant probes [1]. Now that a stable tracepoint
is available, this commit replaces the bio latency check with it.
One of the changes made during this replacement is the key used for the
hash table. Since 'struct request' cannot be used as a hash key, the
approach taken follows that which was implemented in bcc/biolatency [2].
(uses dev:sector for the key)
[1]: https://github.com/iovisor/bcc/issues/4261
[2]: https://github.com/iovisor/bcc/pull/4691
Fixes: 450b7879e345 ("block: move blk_account_io_{start,done} to blk-mq.c")
Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com>
Link: https://lore.kernel.org/r/20230818090119.477441-7-danieltimlee@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
The existing tracing programs have been developed for a considerable
period of time and, as a result, do not properly incorporate the
features of the current libbpf, such as CO-RE. This is evident in
frequent usage of functions like PT_REGS* and the persistence of "hack"
methods using underscore-style bpf_probe_read_kernel from the past.
These programs are far behind the current level of libbpf and can
potentially confuse users. Therefore, this commit aims to convert the
outdated BPF programs to be more CO-RE centric.
Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com>
Link: https://lore.kernel.org/r/20230818090119.477441-6-danieltimlee@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Currently, multiple kprobe programs are suffering from symbol mismatch
due to compiler optimization. These optimizations might induce
additional suffix to the symbol name such as '.isra' or '.constprop'.
# egrep ' finish_task_switch| __netif_receive_skb_core' /proc/kallsyms
ffffffff81135e50 t finish_task_switch.isra.0
ffffffff81dd36d0 t __netif_receive_skb_core.constprop.0
ffffffff8205cc0e t finish_task_switch.isra.0.cold
ffffffff820b1aba t __netif_receive_skb_core.constprop.0.cold
To avoid this, this commit replaces the original kprobe section to
kprobe.multi in order to match symbol with wildcard characters. Here,
asterisk is used for avoiding symbol mismatch.
Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com>
Link: https://lore.kernel.org/r/20230818090119.477441-5-danieltimlee@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Currently, BPF programs typically have a suffix of .bpf.c. However,
some programs still utilize a mixture of _kern.c suffix alongside the
naming convention. In order to achieve consistency in the naming of
these programs, this commit unifies the inconsistency in the naming
convention of BPF kernel programs.
Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com>
Link: https://lore.kernel.org/r/20230818090119.477441-4-danieltimlee@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
This commit replaces separate headers with a single vmlinux.h to
tracing programs. Thanks to that, we no longer need to define the
argument structure for tracing programs directly. For example, argument
for the sched_switch tracpepoint (sched_switch_args) can be replaced
with the vmlinux.h provided trace_event_raw_sched_switch.
Additional defines have been added to the BPF program either directly
or through the inclusion of net_shared.h. Defined values are
PERF_MAX_STACK_DEPTH, IFNAMSIZ constants and __stringify() macro. This
change enables the BPF program to access internal structures with BTF
generated "vmlinux.h" header.
Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com>
Link: https://lore.kernel.org/r/20230818090119.477441-3-danieltimlee@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Currently, compiling the bpf programs will result the warning with the
ignored attribute as follows. This commit fixes the warning by adding
cf-protection option.
In file included from ./arch/x86/include/asm/linkage.h:6:
./arch/x86/include/asm/ibt.h:77:8: warning: 'nocf_check' attribute ignored; use -fcf-protection to enable the attribute [-Wignored-attributes]
extern __noendbr u64 ibt_save(bool disable);
^
./arch/x86/include/asm/ibt.h:32:34: note: expanded from macro '__noendbr'
^
Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com>
Link: https://lore.kernel.org/r/20230818090119.477441-2-danieltimlee@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Teach lockdep that icc_bw_lock is needed in code paths that could
deadlock if they trigger reclaim.
Signed-off-by: Rob Clark <robdclark@chromium.org>
Link: https://lore.kernel.org/r/20230807171148.210181-8-robdclark@gmail.com
Signed-off-by: Georgi Djakov <djakov@kernel.org>
|
|
For cases where icc_bw_set() can be called in callbaths that could
deadlock against shrinker/reclaim, such as runpm resume, we need to
decouple the icc locking. Introduce a new icc_bw_lock for cases where
we need to serialize bw aggregation and update to decouple that from
paths that require memory allocation such as node/link creation/
destruction.
Fixes this lockdep splat:
======================================================
WARNING: possible circular locking dependency detected
6.2.0-rc8-debug+ #554 Not tainted
------------------------------------------------------
ring0/132 is trying to acquire lock:
ffffff80871916d0 (&gmu->lock){+.+.}-{3:3}, at: a6xx_pm_resume+0xf0/0x234
but task is already holding lock:
ffffffdb5aee57e8 (dma_fence_map){++++}-{0:0}, at: msm_job_run+0x68/0x150
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #4 (dma_fence_map){++++}-{0:0}:
__dma_fence_might_wait+0x74/0xc0
dma_resv_lockdep+0x1f4/0x2f4
do_one_initcall+0x104/0x2bc
kernel_init_freeable+0x344/0x34c
kernel_init+0x30/0x134
ret_from_fork+0x10/0x20
-> #3 (mmu_notifier_invalidate_range_start){+.+.}-{0:0}:
fs_reclaim_acquire+0x80/0xa8
slab_pre_alloc_hook.constprop.0+0x40/0x25c
__kmem_cache_alloc_node+0x60/0x1cc
__kmalloc+0xd8/0x100
topology_parse_cpu_capacity+0x8c/0x178
get_cpu_for_node+0x88/0xc4
parse_cluster+0x1b0/0x28c
parse_cluster+0x8c/0x28c
init_cpu_topology+0x168/0x188
smp_prepare_cpus+0x24/0xf8
kernel_init_freeable+0x18c/0x34c
kernel_init+0x30/0x134
ret_from_fork+0x10/0x20
-> #2 (fs_reclaim){+.+.}-{0:0}:
__fs_reclaim_acquire+0x3c/0x48
fs_reclaim_acquire+0x54/0xa8
slab_pre_alloc_hook.constprop.0+0x40/0x25c
__kmem_cache_alloc_node+0x60/0x1cc
__kmalloc+0xd8/0x100
kzalloc.constprop.0+0x14/0x20
icc_node_create_nolock+0x4c/0xc4
icc_node_create+0x38/0x58
qcom_icc_rpmh_probe+0x1b8/0x248
platform_probe+0x70/0xc4
really_probe+0x158/0x290
__driver_probe_device+0xc8/0xe0
driver_probe_device+0x44/0x100
__driver_attach+0xf8/0x108
bus_for_each_dev+0x78/0xc4
driver_attach+0x2c/0x38
bus_add_driver+0xd0/0x1d8
driver_register+0xbc/0xf8
__platform_driver_register+0x30/0x3c
qnoc_driver_init+0x24/0x30
do_one_initcall+0x104/0x2bc
kernel_init_freeable+0x344/0x34c
kernel_init+0x30/0x134
ret_from_fork+0x10/0x20
-> #1 (icc_lock){+.+.}-{3:3}:
__mutex_lock+0xcc/0x3c8
mutex_lock_nested+0x30/0x44
icc_set_bw+0x88/0x2b4
_set_opp_bw+0x8c/0xd8
_set_opp+0x19c/0x300
dev_pm_opp_set_opp+0x84/0x94
a6xx_gmu_resume+0x18c/0x804
a6xx_pm_resume+0xf8/0x234
adreno_runtime_resume+0x2c/0x38
pm_generic_runtime_resume+0x30/0x44
__rpm_callback+0x15c/0x174
rpm_callback+0x78/0x7c
rpm_resume+0x318/0x524
__pm_runtime_resume+0x78/0xbc
adreno_load_gpu+0xc4/0x17c
msm_open+0x50/0x120
drm_file_alloc+0x17c/0x228
drm_open_helper+0x74/0x118
drm_open+0xa0/0x144
drm_stub_open+0xd4/0xe4
chrdev_open+0x1b8/0x1e4
do_dentry_open+0x2f8/0x38c
vfs_open+0x34/0x40
path_openat+0x64c/0x7b4
do_filp_open+0x54/0xc4
do_sys_openat2+0x9c/0x100
do_sys_open+0x50/0x7c
__arm64_sys_openat+0x28/0x34
invoke_syscall+0x8c/0x128
el0_svc_common.constprop.0+0xa0/0x11c
do_el0_svc+0xac/0xbc
el0_svc+0x48/0xa0
el0t_64_sync_handler+0xac/0x13c
el0t_64_sync+0x190/0x194
-> #0 (&gmu->lock){+.+.}-{3:3}:
__lock_acquire+0xe00/0x1060
lock_acquire+0x1e0/0x2f8
__mutex_lock+0xcc/0x3c8
mutex_lock_nested+0x30/0x44
a6xx_pm_resume+0xf0/0x234
adreno_runtime_resume+0x2c/0x38
pm_generic_runtime_resume+0x30/0x44
__rpm_callback+0x15c/0x174
rpm_callback+0x78/0x7c
rpm_resume+0x318/0x524
__pm_runtime_resume+0x78/0xbc
pm_runtime_get_sync.isra.0+0x14/0x20
msm_gpu_submit+0x58/0x178
msm_job_run+0x78/0x150
drm_sched_main+0x290/0x370
kthread+0xf0/0x100
ret_from_fork+0x10/0x20
other info that might help us debug this:
Chain exists of:
&gmu->lock --> mmu_notifier_invalidate_range_start --> dma_fence_map
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(dma_fence_map);
lock(mmu_notifier_invalidate_range_start);
lock(dma_fence_map);
lock(&gmu->lock);
*** DEADLOCK ***
2 locks held by ring0/132:
#0: ffffff8087191170 (&gpu->lock){+.+.}-{3:3}, at: msm_job_run+0x64/0x150
#1: ffffffdb5aee57e8 (dma_fence_map){++++}-{0:0}, at: msm_job_run+0x68/0x150
stack backtrace:
CPU: 7 PID: 132 Comm: ring0 Not tainted 6.2.0-rc8-debug+ #554
Hardware name: Google Lazor (rev1 - 2) with LTE (DT)
Call trace:
dump_backtrace.part.0+0xb4/0xf8
show_stack+0x20/0x38
dump_stack_lvl+0x9c/0xd0
dump_stack+0x18/0x34
print_circular_bug+0x1b4/0x1f0
check_noncircular+0x78/0xac
__lock_acquire+0xe00/0x1060
lock_acquire+0x1e0/0x2f8
__mutex_lock+0xcc/0x3c8
mutex_lock_nested+0x30/0x44
a6xx_pm_resume+0xf0/0x234
adreno_runtime_resume+0x2c/0x38
pm_generic_runtime_resume+0x30/0x44
__rpm_callback+0x15c/0x174
rpm_callback+0x78/0x7c
rpm_resume+0x318/0x524
__pm_runtime_resume+0x78/0xbc
pm_runtime_get_sync.isra.0+0x14/0x20
msm_gpu_submit+0x58/0x178
msm_job_run+0x78/0x150
drm_sched_main+0x290/0x370
kthread+0xf0/0x100
ret_from_fork+0x10/0x20
Signed-off-by: Rob Clark <robdclark@chromium.org>
Link: https://lore.kernel.org/r/20230807171148.210181-7-robdclark@gmail.com
Signed-off-by: Georgi Djakov <djakov@kernel.org>
|
|
Hou Tao says:
====================
Remove unnecessary synchronizations in cpumap
From: Hou Tao <houtao1@huawei.com>
Hi,
This is the formal patchset to remove unnecessary synchronizations in
cpu-map after address comments and collect Rvb tags from Toke
Høiland-Jørgensen (Big thanks to Toke). Patch #1 removes the unnecessary
rcu_barrier() when freeing bpf_cpu_map_entry and replaces it by
queue_rcu_work(). Patch #2 removes the unnecessary call_rcu() and
queue_work() when destroying cpu-map and does the freeing directly.
Test the patchset by using xdp_redirect_cpu and virtio-net. Both
xdp-mode and skb-mode have been exercised and no issues were reported.
As ususal, comments and suggestions are always welcome.
Change Log:
v1:
* address comments from Toke Høiland-Jørgensen
* add Rvb tags from Toke Høiland-Jørgensen
* update outdated comment in cpu_map_delete_elem()
RFC: https://lore.kernel.org/bpf/20230728023030.1906124-1-houtao@huaweicloud.com
====================
Link: https://lore.kernel.org/r/20230816045959.358059-1-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
After synchronous_rcu(), both the dettached XDP program and
xdp_do_flush() are completed, and the only user of bpf_cpu_map_entry
will be cpu_map_kthread_run(), so instead of calling
__cpu_map_entry_replace() to stop kthread and cleanup entry after a RCU
grace period, do these things directly.
Signed-off-by: Hou Tao <houtao1@huawei.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/r/20230816045959.358059-3-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
As for now __cpu_map_entry_replace() uses call_rcu() to wait for the
inflight xdp program to exit the RCU read critical section, and then
launch kworker cpu_map_kthread_stop() to call kthread_stop() to flush
all pending xdp frames or skbs.
But it is unnecessary to use rcu_barrier() in cpu_map_kthread_stop() to
wait for the completion of __cpu_map_entry_free(), because rcu_barrier()
will wait for all pending RCU callbacks and cpu_map_kthread_stop() only
needs to wait for the completion of a specific __cpu_map_entry_free().
So use queue_rcu_work() to replace call_rcu(), schedule_work() and
rcu_barrier(). queue_rcu_work() will queue a __cpu_map_entry_free()
kworker after a RCU grace period. Because __cpu_map_entry_free() is
running in a kworker context, so it is OK to do all of these freeing
procedures include kthread_stop() in it.
After the update, there is no need to do reference-counting for
bpf_cpu_map_entry, because bpf_cpu_map_entry is freed directly in
__cpu_map_entry_free(), so just remove it.
Signed-off-by: Hou Tao <houtao1@huawei.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/r/20230816045959.358059-2-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Enable sync_state on sm8450 so that the interconnect votes actually mean
anything and aren't just pinned to INT_MAX.
Fixes: fafc114a468e ("interconnect: qcom: Add SM8450 interconnect provider driver")
Signed-off-by: Konrad Dybcio <konrad.dybcio@linaro.org>
Reviewed-by: Vinod Koul <vkoul@kernel.org>
Link: https://lore.kernel.org/r/20230811-topic-8450_syncstate-v1-1-69ae5552a18b@linaro.org
Signed-off-by: Georgi Djakov <djakov@kernel.org>
|
|
Prepare for the coming implementation by GCC and Clang of the __counted_by
attribute. Flexible array members annotated with __counted_by can have
their accesses bounds-checked at run-time checking via CONFIG_UBSAN_BOUNDS
(for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family
functions).
As found with Coccinelle[1], add __counted_by for struct icc_onecell_data.
Additionally, since the element count member must be set before accessing
the annotated flexible array member, move its initialization earlier.
[1] https://github.com/kees/kernel-tools/blob/trunk/coccinelle/examples/counted_by.cocci
Cc: Andy Gross <agross@kernel.org>
Cc: Bjorn Andersson <andersson@kernel.org>
Cc: Konrad Dybcio <konrad.dybcio@linaro.org>
Cc: Georgi Djakov <djakov@kernel.org>
Cc: linux-arm-msm@vger.kernel.org
Cc: linux-pm@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Acked-by: Konrad Dybcio <konrad.dybcio@linaro.org>
Link: https://lore.kernel.org/r/20230817204215.never.916-kees@kernel.org
Signed-off-by: Georgi Djakov <djakov@kernel.org>
|
|
Prepare for the coming implementation by GCC and Clang of the __counted_by
attribute. Flexible array members annotated with __counted_by can have
their accesses bounds-checked at run-time checking via CONFIG_UBSAN_BOUNDS
(for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family
functions).
As found with Coccinelle[1], add __counted_by for struct icc_path.
[1] https://github.com/kees/kernel-tools/blob/trunk/coccinelle/examples/counted_by.cocci
Cc: Georgi Djakov <djakov@kernel.org>
Cc: linux-pm@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Link: https://lore.kernel.org/r/20230817204144.never.605-kees@kernel.org
Signed-off-by: Georgi Djakov <djakov@kernel.org>
|
|
Prepare for the coming implementation by GCC and Clang of the __counted_by
attribute. Flexible array members annotated with __counted_by can have
their accesses bounds-checked at run-time checking via CONFIG_UBSAN_BOUNDS
(for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family
functions).
As found with Coccinelle[1], add __counted_by for struct icc_clk_provider.
[1] https://github.com/kees/kernel-tools/blob/trunk/coccinelle/examples/counted_by.cocci
Cc: Georgi Djakov <djakov@kernel.org>
Cc: linux-pm@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Link: https://lore.kernel.org/r/20230817202914.never.661-kees@kernel.org
Signed-off-by: Georgi Djakov <djakov@kernel.org>
|
|
Simple 'opp-table:true' accepts a boolean property as opp-table, so
restrict it to object to properly enforce real OPP table nodes.
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Link: https://lore.kernel.org/r/20230820080543.25204-1-krzysztof.kozlowski@linaro.org
Signed-off-by: Rob Herring <robh@kernel.org>
|
|
All callers of __of_{add,remove,update}_property() and
__of_{attach,detach}_node() wrap the call with the devtree_lock
spinlock. Let's move the spinlock into the functions. This allows moving
the sysfs update functions into those functions as well.
Link: https://lore.kernel.org/r/20230801-dt-changeset-fixes-v3-6-5f0410e007dd@kernel.org
Signed-off-by: Rob Herring <robh@kernel.org>
|
|
The changeset code checks for a property in the deadprops list when
adding/updating a property, but of_add_property() and
of_update_property() do not. As the users of these functions are pretty
simple, they have not hit this scenario or else the property lists
would get corrupted.
With this there are 3 cases of removing a property from either deadprops
or properties lists, so add a helper to find and remove a matching
property.
Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
Link: https://lore.kernel.org/r/20230801-dt-changeset-fixes-v3-5-5f0410e007dd@kernel.org
Signed-off-by: Rob Herring <robh@kernel.org>
|
|
__of_update_property() returns the existing property if there is one, but
that value is never added to the changeset. Updates work because the
existing property was also retrieved before in of_changeset_action(),
but that is racy as of_changeset_action() doesn't hold any locks. The
property could be changed before the changeset is applied.
Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>
Link: https://lore.kernel.org/r/20230801-dt-changeset-fixes-v3-4-5f0410e007dd@kernel.org
Signed-off-by: Rob Herring <robh@kernel.org>
|
|
Several places print the changeset action with node and property
details. Refactor these into a common printing helper. The complicating
factor is some prints are debug and some are errors. Solve this with a
bit of preprocessor magic.
Some cases printed the 'cset' which was the changeset entry pointer
rather than the whole changeset itself. The changeset entry is not all that
interesting and gets obfuscated by default anyways. So just drop it.
Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
Link: https://lore.kernel.org/r/20230801-dt-changeset-fixes-v3-3-5f0410e007dd@kernel.org
Signed-off-by: Rob Herring <robh@kernel.org>
|
|
Pick up changeset fixes for further rework.
|
|
Add the compatible for the OSM L3 interconnect used in the Snapdragon
670.
Signed-off-by: Richard Acayan <mailingradian@gmail.com>
Acked-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Link: https://lore.kernel.org/r/20230816230412.76862-7-mailingradian@gmail.com
Signed-off-by: Georgi Djakov <djakov@kernel.org>
|
|
This series contains fixes necessary for icc to behave correctly
on QCM2290.
* icc-qcm2290
interconnect: qcom: qcm2290: Enable keep_alive on all buses
interconnect: qcom: qcm2290: Enable sync state
Link: https://lore.kernel.org/r/20230720-topic-qcm2290_icc-v2-0-a2ceb9d3e713@linaro.org
Signed-off-by: Georgi Djakov <djakov@kernel.org>
|
|
The fixes that got merged into v6.5-rc6 are needed here.
Signed-off-by: Georgi Djakov <djakov@kernel.org>
|
|
The qunipro_g4_sel clear is also needed for new platforms with major
version > 5. Fix the version check to take this into account.
Fixes: 9c02aa24bf40 ("scsi: ufs: ufs-qcom: Clear qunipro_g4_sel for HW version major 5")
Acked-by: Manivannan Sadhasivam <mani@kernel.org>
Reviewed-by: Nitin Rawat <quic_nitirawa@quicinc.com>
Signed-off-by: Neil Armstrong <neil.armstrong@linaro.org>
Link: https://lore.kernel.org/r/20230821-topic-sm8x50-upstream-ufs-major-5-plus-v2-1-f42a4b712e58@linaro.org
Reviewed-by: "Bao D. Nguyen" <quic_nguyenb@quicinc.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
|
|
Replaces five calls to compound_head with one.
Link: https://lkml.kernel.org/r/20230816151201.3655946-14-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Yanteng Si <siyanteng@loongson.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Because THP_SWAP uses page->private for each page, we must not use the
space which overlaps that field for anything which would conflict with
that. We avoid the conflict on 32-bit systems by disallowing THP_SWAP on
32-bit.
Link: https://lkml.kernel.org/r/20230816151201.3655946-13-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Yanteng Si <siyanteng@loongson.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
This function is misleading; people think it means "Is this a THP", when
all it actually does is check whether this is a large folio. Remove it;
the one remaining user should have been checking to see whether the folio
is PMD sized or not.
Link: https://lkml.kernel.org/r/20230816151201.3655946-12-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Yanteng Si <siyanteng@loongson.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Store the folio order in the low byte of the flags word in the first tail
page. This frees up the word that was being used to store the order and
dtor bytes previously.
Link: https://lkml.kernel.org/r/20230816151201.3655946-11-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Yanteng Si <siyanteng@loongson.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Move PG_writeback into bottom byte so that it can use PG_waiters in a
later patch. Move PG_head into bottom byte as well to match with where
'order' is moving next. PG_active and PG_workingset move into the second
byte to make room for them.
By putting PG_head in bit 6, we ensure that it is cleared by assigning the
folio order to the bottom byte of the first tail page (since the order
cannot be larger than 63).
Link: https://lkml.kernel.org/r/20230816151201.3655946-10-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Yanteng Si <siyanteng@loongson.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Stored in the first tail page's flags, this flag replaces the destructor.
That removes the last of the destructors, so remove all references to
folio_dtor and compound_dtor.
Link: https://lkml.kernel.org/r/20230816151201.3655946-9-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Yanteng Si <siyanteng@loongson.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
We can use a bit in page[1].flags to indicate that this folio belongs to
hugetlb instead of using a value in page[1].dtors. That lets
folio_test_hugetlb() become an inline function like it should be. We can
also get rid of NULL_COMPOUND_DTOR.
Link: https://lkml.kernel.org/r/20230816151201.3655946-8-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Yanteng Si <siyanteng@loongson.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The only remaining destructor is free_compound_page(). Inline it into
destroy_large_folio() and remove the array it used to live in.
Link: https://lkml.kernel.org/r/20230816151201.3655946-7-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Yanteng Si <siyanteng@loongson.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Match folio_undo_large_rmappable(), and move the casting from page to
folio into the callers (which they were largely doing anyway).
Link: https://lkml.kernel.org/r/20230816151201.3655946-6-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Yanteng Si <siyanteng@loongson.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Indirect calls are expensive, thanks to Spectre. Test for
TRANSHUGE_PAGE_DTOR and destroy the folio appropriately. Move the
free_compound_page() call into destroy_large_folio() to simplify later
patches.
Link: https://lkml.kernel.org/r/20230816151201.3655946-5-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Yanteng Si <siyanteng@loongson.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Pass a folio instead of the head page to save a few instructions. Update
the documentation, at least in English.
Link: https://lkml.kernel.org/r/20230816151201.3655946-4-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Yanteng Si <siyanteng@loongson.cn>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Indirect calls are expensive, thanks to Spectre. Call free_huge_page()
directly if the folio belongs to hugetlb.
Link: https://lkml.kernel.org/r/20230816151201.3655946-3-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Yanteng Si <siyanteng@loongson.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "Remove _folio_dtor and _folio_order", v2.
This patch (of 13):
folio_put() is the standard way to write this, and it's not appreciably
slower. This is an enabling patch for removing free_compound_page()
entirely.
Link: https://lkml.kernel.org/r/20230816151201.3655946-1-willy@infradead.org
Link: https://lkml.kernel.org/r/20230816151201.3655946-2-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Yanteng Si <siyanteng@loongson.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Let's test whether merging and unmerging in PROT_NONE areas works as
expected.
Pass a page protection to mmap_and_merge_range(), which will trigger
an mprotect() after writing to the pages, but before enabling merging.
Make sure that unsharing works as expected, by performing a ptrace write
(using /proc/self/mem) and by setting MADV_UNMERGEABLE.
Note that this implicitly tests that ptrace writes in an inaccessible
(PROT_NONE) mapping work as expected.
[david@redhat.com: use sizeof(i) in test_prot_none(), per Peter]
Link: https://lkml.kernel.org/r/e9cdb144-70c7-6596-2377-e675635c94e0@redhat.com
Link: https://lkml.kernel.org/r/20230803143208.383663-8-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Peter Xu <peterx@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: liubo <liubo254@huawei.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
anything got merged
Let's extend mmap_and_merge_range() to test if anything in the current
process was merged. range_maps_duplicates() is too unreliable for that
use case, so instead look at KSM stats.
Trigger a complete unmerge first, to cleanup the stable tree and
stabilize accounting of merged pages.
Note that we're using /proc/self/ksm_merging_pages instead of
/proc/self/ksm_stat, because that one is available in more existing
kernels.
If /proc/self/ksm_merging_pages can't be opened, we can't perform any
checks and simply skip them.
We have to special-case the shared zeropage for now. But the only user
-- test_unmerge_zero_pages() -- performs its own merge checks.
Link: https://lkml.kernel.org/r/20230803143208.383663-7-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Peter Xu <peterx@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: liubo <liubo254@huawei.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Especially the "For PROT_NONE VMAs, the PTEs are not marked
_PAGE_PROTNONE" part is wrong: doing an mprotect(PROT_NONE) will end up
marking all PTEs on x86_64 as _PAGE_PROTNONE, making pte_protnone()
indicate "yes".
So let's improve the comment, so it's easier to grasp which semantics
pte_protnone() actually has.
Link: https://lkml.kernel.org/r/20230803143208.383663-6-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: liubo <liubo254@huawei.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Commit 0b9d705297b2 ("mm: numa: Support NUMA hinting page faults from
gup/gup_fast") from 2012 documented as the primary reason why we would want
to handle NUMA hinting faults from GUP:
KVM secondary MMU page faults will trigger the NUMA hinting page
faults through gup_fast -> get_user_pages -> follow_page ->
handle_mm_fault.
That is still the case today, and relevant KVM code has been converted to
manually set FOLL_HONOR_NUMA_FAULT. So let's stop setting
FOLL_HONOR_NUMA_FAULT for all GUP users and cross fingers that not that
many other ones that really require such handling for autonuma remain.
Possible interaction with MMU notifiers:
Assume a driver obtains a page using get_user_pages() to map it into
a secondary MMU, and uses the MMU notifier framework to get notified on
changes.
Assume get_user_pages() succeeded on a PROT_NONE-mapped page (because
FOLL_HONOR_NUMA_FAULT is not set) in an accessible VMA and the page is
mapped into a secondary MMU. Once user space would turn that mapping
inaccessible using mprotect(PROT_NONE), the actual PTE in the page table
might not change. If the MMU notifier would be smart and optimize for that
case "why notify if the PTE didn't change", that could be problematic.
At least change_pmd_range() with MMU_NOTIFY_PROTECTION_VMA for now does an
unconditional mmu_notifier_invalidate_range_start() ->
mmu_notifier_invalidate_range_end() and should be fine.
Note that even if a PTE in an accessible VMA is pte_protnone(), the
underlying page might be accessed by a secondary MMU that does not set
FOLL_HONOR_NUMA_FAULT, and test_young() MMU notifiers would return "true".
Link: https://lkml.kernel.org/r/20230803143208.383663-5-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: liubo <liubo254@huawei.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
KVM is *the* case we know that really wants to honor NUMA hinting falls.
As we want to stop setting FOLL_HONOR_NUMA_FAULT implicitly, set
FOLL_HONOR_NUMA_FAULT whenever we might obtain pages on behalf of a VCPU
to map them into a secondary MMU, and add a comment why.
Do that unconditionally in hva_to_pfn_slow() when calling
get_user_pages_unlocked().
kvmppc_book3s_instantiate_page(), hva_to_pfn_fast() and
gfn_to_page_many_atomic() are similarly used to map pages into a
secondary MMU. However, FOLL_WRITE and get_user_page_fast_only() always
implicitly honor NUMA hinting faults -- as documented for
FOLL_HONOR_NUMA_FAULT -- so we can limit this change to a single location
for now.
Don't set it in check_user_page_hwpoison(), where we really only want to
check if the mapped page is HW-poisoned.
We won't set it for other KVM users of get_user_pages()/pin_user_pages()
* arch/powerpc/kvm/book3s_64_mmu_hv.c: not used to map pages into a
secondary MMU.
* arch/powerpc/kvm/e500_mmu.c: only used on shared TLB pages with userspace
* arch/s390/kvm/*: s390x only supports a single NUMA node either way
* arch/x86/kvm/svm/sev.c: not used to map pages into a secondary MMU.
This is a preparation for making FOLL_HONOR_NUMA_FAULT no longer
implicitly be set by get_user_pages() and friends.
Link: https://lkml.kernel.org/r/20230803143208.383663-4-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: liubo <liubo254@huawei.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
|
|
The search and wrap around logic in the ufshcd_mcq_sqe_search() function
does not work correctly when the hwq's queue depth is not a power of two
number. Correct it so that any queue depth with a positive integer value
within the supported range would work.
Signed-off-by: "Bao D. Nguyen" <quic_nguyenb@quicinc.com>
Link: https://lore.kernel.org/r/ff49c15be205135ed3ec186f3086694c02867dbd.1692149603.git.quic_nguyenb@quicinc.com
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Fixes: 8d7290348992 ("scsi: ufs: mcq: Add supporting functions for MCQ abort")
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
|
|
commit 0f8e5651095b
("of/platform: Propagate firmware node by calling device_set_node()")
use of_fwnode_handle to replace of_node_get, which introduces a side
effect that the refcount is not increased. Then the out of tree
jailhouse hypervisor enable/disable test will trigger kernel dump in
of_overlay_remove, with the following sequence
"
of_changeset_revert(&overlay_changeset);
of_changeset_destroy(&overlay_changeset);
of_overlay_remove(&overlay_id);
"
So increase the refcount to avoid issues.
This patch also release the refcount when releasing amba device to avoid
refcount leakage.
Fixes: 0f8e5651095b ("of/platform: Propagate firmware node by calling device_set_node()")
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Peng Fan <peng.fan@nxp.com>
Link: https://lore.kernel.org/r/20230821023928.3324283-2-peng.fan@oss.nxp.com
Signed-off-by: Rob Herring <robh@kernel.org>
|