summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2023-10-27padata: Fix refcnt handling in padata_free_shell()WangJinchao
In a high-load arm64 environment, the pcrypt_aead01 test in LTP can lead to system UAF (Use-After-Free) issues. Due to the lengthy analysis of the pcrypt_aead01 function call, I'll describe the problem scenario using a simplified model: Suppose there's a user of padata named `user_function` that adheres to the padata requirement of calling `padata_free_shell` after `serial()` has been invoked, as demonstrated in the following code: ```c struct request { struct padata_priv padata; struct completion *done; }; void parallel(struct padata_priv *padata) { do_something(); } void serial(struct padata_priv *padata) { struct request *request = container_of(padata, struct request, padata); complete(request->done); } void user_function() { DECLARE_COMPLETION(done) padata->parallel = parallel; padata->serial = serial; padata_do_parallel(); wait_for_completion(&done); padata_free_shell(); } ``` In the corresponding padata.c file, there's the following code: ```c static void padata_serial_worker(struct work_struct *serial_work) { ... cnt = 0; while (!list_empty(&local_list)) { ... padata->serial(padata); cnt++; } local_bh_enable(); if (refcount_sub_and_test(cnt, &pd->refcnt)) padata_free_pd(pd); } ``` Because of the high system load and the accumulation of unexecuted softirq at this moment, `local_bh_enable()` in padata takes longer to execute than usual. Subsequently, when accessing `pd->refcnt`, `pd` has already been released by `padata_free_shell()`, resulting in a UAF issue with `pd->refcnt`. The fix is straightforward: add `refcount_dec_and_test` before calling `padata_free_pd` in `padata_free_shell`. Fixes: 07928d9bfc81 ("padata: Remove broken queue flushing") Signed-off-by: WangJinchao <wangjinchao@xfusion.com> Acked-by: Daniel Jordan <daniel.m.jordan@oracle.com> Acked-by: Daniel Jordan <daniel.m.jordan@oracle.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2023-10-27futex: Don't include process MM in futex key on no-MMUBen Wolsieffer
On no-MMU, all futexes are treated as private because there is no need to map a virtual address to physical to match the futex across processes. This doesn't quite work though, because private futexes include the current process's mm_struct as part of their key. This makes it impossible for one process to wake up a shared futex being waited on in another process. Fix this bug by excluding the mm_struct from the key. With a single address space, the futex address is already a unique key. Fixes: 784bdf3bb694 ("futex: Assume all mappings are private on !MMU systems") Signed-off-by: Ben Wolsieffer <ben.wolsieffer@hefring.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Darren Hart <dvhart@infradead.org> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: André Almeida <andrealmeid@igalia.com> Link: https://lore.kernel.org/r/20231019204548.1236437-2-ben.wolsieffer@hefring.com
2023-10-27genirq/generic_chip: Make irq_remove_generic_chip() irqdomain awareHerve Codina
irq_remove_generic_chip() calculates the Linux interrupt number for removing the handler and interrupt chip based on gc::irq_base as a linear function of the bit positions of set bits in the @msk argument. When the generic chip is present in an irq domain, i.e. created with a call to irq_alloc_domain_generic_chips(), gc::irq_base contains not the base Linux interrupt number. It contains the base hardware interrupt for this chip. It is set to 0 for the first chip in the domain, 0 + N for the next chip, where $N is the number of hardware interrupts per chip. That means the Linux interrupt number cannot be calculated based on gc::irq_base for irqdomain based chips without a domain map lookup, which is currently missing. Rework the code to take the irqdomain case into account and calculate the Linux interrupt number by a irqdomain lookup of the domain specific hardware interrupt number. [ tglx: Massage changelog. Reshuffle the logic and add a proper comment. ] Fixes: cfefd21e693d ("genirq: Add chip suspend and resume callbacks") Signed-off-by: Herve Codina <herve.codina@bootlin.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20231024150335.322282-1-herve.codina@bootlin.com
2023-10-26Merge tag 'for-netdev' of ↵Jakub Kicinski
ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2023-10-26 We've added 51 non-merge commits during the last 10 day(s) which contain a total of 75 files changed, 5037 insertions(+), 200 deletions(-). The main changes are: 1) Add open-coded task, css_task and css iterator support. One of the use cases is customizable OOM victim selection via BPF, from Chuyi Zhou. 2) Fix BPF verifier's iterator convergence logic to use exact states comparison for convergence checks, from Eduard Zingerman, Andrii Nakryiko and Alexei Starovoitov. 3) Add BPF programmable net device where bpf_mprog defines the logic of its xmit routine. It can operate in L3 and L2 mode, from Daniel Borkmann and Nikolay Aleksandrov. 4) Batch of fixes for BPF per-CPU kptr and re-enable unit_size checking for global per-CPU allocator, from Hou Tao. 5) Fix libbpf which eagerly assumed that SHT_GNU_verdef ELF section was going to be present whenever a binary has SHT_GNU_versym section, from Andrii Nakryiko. 6) Fix BPF ringbuf correctness to fold smp_mb__before_atomic() into atomic_set_release(), from Paul E. McKenney. 7) Add a warning if NAPI callback missed xdp_do_flush() under CONFIG_DEBUG_NET which helps checking if drivers were missing the former, from Sebastian Andrzej Siewior. 8) Fix missed RCU read-lock in bpf_task_under_cgroup() which was throwing a warning under sleepable programs, from Yafang Shao. 9) Avoid unnecessary -EBUSY from htab_lock_bucket by disabling IRQ before checking map_locked, from Song Liu. 10) Make BPF CI linked_list failure test more robust, from Kumar Kartikeya Dwivedi. 11) Enable samples/bpf to be built as PIE in Fedora, from Viktor Malik. 12) Fix xsk starving when multiple xsk sockets were associated with a single xsk_buff_pool, from Albert Huang. 13) Clarify the signed modulo implementation for the BPF ISA standardization document that it uses truncated division, from Dave Thaler. 14) Improve BPF verifier's JEQ/JNE branch taken logic to also consider signed bounds knowledge, from Andrii Nakryiko. 15) Add an option to XDP selftests to use multi-buffer AF_XDP xdp_hw_metadata and mark used XDP programs as capable to use frags, from Larysa Zaremba. 16) Fix bpftool's BTF dumper wrt printing a pointer value and another one to fix struct_ops dump in an array, from Manu Bretelle. * tag 'for-netdev' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (51 commits) netkit: Remove explicit active/peer ptr initialization selftests/bpf: Fix selftests broken by mitigations=off samples/bpf: Allow building with custom bpftool samples/bpf: Fix passing LDFLAGS to libbpf samples/bpf: Allow building with custom CFLAGS/LDFLAGS bpf: Add more WARN_ON_ONCE checks for mismatched alloc and free selftests/bpf: Add selftests for netkit selftests/bpf: Add netlink helper library bpftool: Extend net dump with netkit progs bpftool: Implement link show support for netkit libbpf: Add link-based API for netkit tools: Sync if_link uapi header netkit, bpf: Add bpf programmable net device bpf: Improve JEQ/JNE branch taken logic bpf: Fold smp_mb__before_atomic() into atomic_set_release() bpf: Fix unnecessary -EBUSY from htab_lock_bucket xsk: Avoid starving the xsk further down the list bpf: print full verifier states on infinite loop detection selftests/bpf: test if state loops are detected in a tricky case bpf: correct loop detection for iterators convergence ... ==================== Link: https://lore.kernel.org/r/20231026150509.2824-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-26Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR. Conflicts: net/mac80211/rx.c 91535613b609 ("wifi: mac80211: don't drop all unprotected public action frames") 6c02fab72429 ("wifi: mac80211: split ieee80211_drop_unencrypted_mgmt() return value") Adjacent changes: drivers/net/ethernet/apm/xgene/xgene_enet_main.c 61471264c018 ("net: ethernet: apm: Convert to platform remove callback returning void") d2ca43f30611 ("net: xgene: Fix unused xgene_enet_of_match warning for !CONFIG_OF") net/vmw_vsock/virtio_transport.c 64c99d2d6ada ("vsock/virtio: support to send non-linear skb") 53b08c498515 ("vsock/virtio: initialize the_virtio_vsock before using VQs") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-26Merge branches 'pm-sleep', 'powercap' and 'pm-tools'Rafael J. Wysocki
Merge updates related to system sleep handling, one power capping update and one PM utility update for 6.7-rc1: - Use __get_safe_page() rather than touching the list in hibernation snapshot code (Brian Geffon). - Fix symbol export for _SIMPLE_ variants of _PM_OPS() (Raag Jadav). - Clean up sync_read handling in snapshot_write_next() (Brian Geffon). - Fix kerneldoc comments for swsusp_check() and swsusp_close() to better match code (Christoph Hellwig). - Downgrade BIOS locked limits pr_warn() in the Intel RAPL power capping driver to pr_debug() (Ville Syrjälä). - Change the minimum python version for the intel_pstate_tracer utility from 2.7 to 3.6 (Doug Smythies). * pm-sleep: PM: hibernate: fix the kerneldoc comment for swsusp_check() and swsusp_close() PM: hibernate: Clean up sync_read handling in snapshot_write_next() PM: sleep: Fix symbol export for _SIMPLE_ variants of _PM_OPS() PM: hibernate: Use __get_safe_page() rather than touching the list * powercap: powercap: intel_rapl: Downgrade BIOS locked limits pr_warn() to pr_debug() * pm-tools: tools/power/x86/intel_pstate_tracer: python minimum version
2023-10-26Merge branch 'pm-cpufreq'Rafael J. Wysocki
Merge cpufreq updates for 6.7-rc1: - Add support for several Qualcomm SoC versions and other similar changes (Christian Marangi, Dmitry Baryshkov, Luca Weiss, Neil Armstrong, Richard Acayan, Robert Marko, Rohit Agarwal, Stephan Gerhold and Varadarajan Narayanan). - Clean up the tegra cpufreq driver (Sumit Gupta). - Use of_property_read_reg() to parse "reg" in pmac32 driver (Rob Herring). - Add support for TI's am62p5 Soc (Bryan Brattlof). - Make ARM_BRCMSTB_AVS_CPUFREQ depends on !ARM_SCMI_CPUFREQ (Florian Fainelli). - Update Kconfig to mention i.MX7 as well (Alexander Stein). - Revise global turbo disable check in intel_pstate (Srinivas Pandruvada). - Carry out initialization of sg_cpu in the schedutil cpufreq governor in one loop (Liao Chang). - Simplify the condition for storing 'down_threshold' in the conservative cpufreq governor (Liao Chang). - Use fine-grained mutex in the userspace cpufreq governor (Liao Chang). - Move is_managed indicator in the userspace cpufreq governor into a per-policy structure (Liao Chang). - Rebuild sched-domains when removing cpufreq driver (Pierre Gondois). - Fix buffer overflow detection in trans_stats() (Christian Marangi). * pm-cpufreq: (32 commits) dt-bindings: cpufreq: qcom-hw: document SM8650 CPUFREQ Hardware cpufreq: arm: Kconfig: Add i.MX7 to supported SoC for ARM_IMX_CPUFREQ_DT cpufreq: qcom-nvmem: add support for IPQ8064 cpufreq: qcom-nvmem: also accept operating-points-v2-krait-cpu cpufreq: qcom-nvmem: drop pvs_ver for format a fuses dt-bindings: cpufreq: qcom-cpufreq-nvmem: Document krait-cpu cpufreq: qcom-nvmem: add support for IPQ6018 dt-bindings: cpufreq: qcom-cpufreq-nvmem: document IPQ6018 cpufreq: qcom-nvmem: Add MSM8909 cpufreq: qcom-nvmem: Simplify driver data allocation cpufreq: stats: Fix buffer overflow detection in trans_stats() dt-bindings: cpufreq: cpufreq-qcom-hw: Add SDX75 compatible cpufreq: ARM_BRCMSTB_AVS_CPUFREQ cannot be used with ARM_SCMI_CPUFREQ cpufreq: ti-cpufreq: Add opp support for am62p5 SoCs cpufreq: dt-platdev: add am62p5 to blocklist cpufreq: tegra194: remove redundant AND with cpu_online_mask cpufreq: tegra194: use refclk delta based loop instead of udelay cpufreq: tegra194: save CPU data to avoid repeated SMP calls cpufreq: Rebuild sched-domains when removing cpufreq driver cpufreq: userspace: Move is_managed indicator into per-policy structure ...
2023-10-26bpf: Add more WARN_ON_ONCE checks for mismatched alloc and freeHou Tao
There are two possible mismatched alloc and free cases in BPF memory allocator: 1) allocate from cache X but free by cache Y with a different unit_size 2) allocate from per-cpu cache but free by kmalloc cache or vice versa So add more WARN_ON_ONCE checks in free_bulk() and __free_by_rcu() to spot these mismatched alloc and free early. Signed-off-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20231021014959.3563841-1-houtao@huaweicloud.com
2023-10-26x86/apic/msi: Fix misconfigured non-maskable MSI quirkKoichiro Den
commit ef8dd01538ea ("genirq/msi: Make interrupt allocation less convoluted"), reworked the code so that the x86 specific quirk for affinity setting of non-maskable PCI/MSI interrupts is not longer activated if necessary. This could be solved by restoring the original logic in the core MSI code, but after a deeper analysis it turned out that the quirk flag is not required at all. The quirk is only required when the PCI/MSI device cannot mask the MSI interrupts, which in turn also prevents reservation mode from being enabled for the affected interrupt. This allows ot remove the NOMASK quirk bit completely as msi_set_affinity() can instead check whether reservation mode is enabled for the interrupt, which gives exactly the same answer. Even in the momentary non-existing case that the reservation mode would be not set for a maskable MSI interrupt this would not cause any harm as it just would cause msi_set_affinity() to go needlessly through the functionaly equivalent slow path, which works perfectly fine with maskable interrupts as well. Rework msi_set_affinity() to query the reservation mode and remove all NOMASK quirk logic from the core code. [ tglx: Massaged changelog ] Fixes: ef8dd01538ea ("genirq/msi: Make interrupt allocation less convoluted") Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Koichiro Den <den@valinux.co.jp> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20231026032036.2462428-1-den@valinux.co.jp
2023-10-25audit: don't take task_lock() in audit_exe_compare() code pathPaul Moore
The get_task_exe_file() function locks the given task with task_lock() which when used inside audit_exe_compare() can cause deadlocks on systems that generate audit records when the task_lock() is held. We resolve this problem with two changes: ignoring those cases where the task being audited is not the current task, and changing our approach to obtaining the executable file struct to not require task_lock(). With the intent of the audit exe filter being to filter on audit events generated by processes started by the specified executable, it makes sense that we would only want to use the exe filter on audit records associated with the currently executing process, e.g. @current. If we are asked to filter records using a non-@current task_struct we can safely ignore the exe filter without negatively impacting the admin's expectations for the exe filter. Knowing that we only have to worry about filtering the currently executing task in audit_exe_compare() we can do away with the task_lock() and call get_mm_exe_file() with @current->mm directly. Cc: <stable@vger.kernel.org> Fixes: 5efc244346f9 ("audit: fix exe_file access in audit_exe_compare") Reported-by: Andreas Steinmetz <anstein99@googlemail.com> Reviewed-by: John Johansen <john.johanse@canonical.com> Reviewed-by: Mateusz Guzik <mjguzik@gmail.com> Signed-off-by: Paul Moore <paul@paul-moore.com>
2023-10-25sched/fair: use folio_xchg_last_cpupid() in should_numa_migrate_memory()Kefeng Wang
Convert to use folio_xchg_last_cpupid() in should_numa_migrate_memory(). Link: https://lkml.kernel.org/r/20231018140806.2783514-14-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-25sched/fair: use folio_xchg_access_time() in numa_hint_fault_latency()Kefeng Wang
Convert to use folio_xchg_access_time() in numa_hint_fault_latency(). Link: https://lkml.kernel.org/r/20231018140806.2783514-9-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-25genirq/matrix: Exclude managed interrupts in irq_matrix_allocated()Chen Yu
When a CPU is about to be offlined, x86 validates that all active interrupts which are targeted to this CPU can be migrated to the remaining online CPUs. If not, the offline operation is aborted. The validation uses irq_matrix_allocated() to retrieve the number of vectors which are allocated on the outgoing CPU. The returned number of allocated vectors includes also vectors which are associated to managed interrupts. That's overaccounting because managed interrupts are: - not migrated when the affinity mask of the interrupt targets only the outgoing CPU - migrated to another CPU, but in that case the vector is already pre-allocated on the potential target CPUs and must not be taken into account. As a consequence the check whether the remaining online CPUs have enough capacity for migrating the allocated vectors from the outgoing CPU might fail incorrectly. Let irq_matrix_allocated() return only the number of allocated non-managed interrupts to make this validation check correct. [ tglx: Amend changelog and fixup kernel-doc comment ] Fixes: 2f75d9e1c905 ("genirq: Implement bitmap matrix allocator") Reported-by: Wendy Wang <wendy.wang@intel.com> Signed-off-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20231020072522.557846-1-yu.c.chen@intel.com
2023-10-25swiotlb: do not try to allocate a TLB bigger than MAX_ORDER pagesPetr Tesarik
When allocating a new pool at runtime, reduce the number of slabs so that the allocation order is at most MAX_ORDER. This avoids a kernel warning in __alloc_pages(). The warning is relatively benign, because the pool size is subsequently reduced when allocation fails, but it is silly to start with a request that is known to fail, especially since this is the default behavior if the kernel is built with CONFIG_SWIOTLB_DYNAMIC=y and booted without any swiotlb= parameter. Reported-by: Ben Greear <greearb@candelatech.com> Closes: https://lore.kernel.org/netdev/4f173dd2-324a-0240-ff8d-abf5c191be18@candelatech.com/ Fixes: 1aaa736815eb ("swiotlb: allocate a new memory pool when existing pools are full") Signed-off-by: Petr Tesarik <petr.tesarik1@huawei-partners.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-10-24netkit, bpf: Add bpf programmable net deviceDaniel Borkmann
This work adds a new, minimal BPF-programmable device called "netkit" (former PoC code-name "meta") we recently presented at LSF/MM/BPF. The core idea is that BPF programs are executed within the drivers xmit routine and therefore e.g. in case of containers/Pods moving BPF processing closer to the source. One of the goals was that in case of Pod egress traffic, this allows to move BPF programs from hostns tcx ingress into the device itself, providing earlier drop or forward mechanisms, for example, if the BPF program determines that the skb must be sent out of the node, then a redirect to the physical device can take place directly without going through per-CPU backlog queue. This helps to shift processing for such traffic from softirq to process context, leading to better scheduling decisions/performance (see measurements in the slides). In this initial version, the netkit device ships as a pair, but we plan to extend this further so it can also operate in single device mode. The pair comes with a primary and a peer device. Only the primary device, typically residing in hostns, can manage BPF programs for itself and its peer. The peer device is designated for containers/Pods and cannot attach/detach BPF programs. Upon the device creation, the user can set the default policy to 'pass' or 'drop' for the case when no BPF program is attached. Additionally, the device can be operated in L3 (default) or L2 mode. The management of BPF programs is done via bpf_mprog, so that multi-attach is supported right from the beginning with similar API and dependency controls as tcx. For details on the latter see commit 053c8e1f235d ("bpf: Add generic attach/detach/query API for multi-progs"). tc BPF compatibility is provided, so that existing programs can be easily migrated. Going forward, we plan to use netkit devices in Cilium as the main device type for connecting Pods. They will be operated in L3 mode in order to simplify a Pod's neighbor management and the peer will operate in default drop mode, so that no traffic is leaving between the time when a Pod is brought up by the CNI plugin and programs attached by the agent. Additionally, the programs we attach via tcx on the physical devices are using bpf_redirect_peer() for inbound traffic into netkit device, hence the latter is also supporting the ndo_get_peer_dev callback. Similarly, we use bpf_redirect_neigh() for the way out, pushing from netkit peer to phys device directly. Also, BIG TCP is supported on netkit device. For the follow-up work in single device mode, we plan to convert Cilium's cilium_host/_net devices into a single one. An extensive test suite for checking device operations and the BPF program and link management API comes as BPF selftests in this series. Co-developed-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Acked-by: Stanislav Fomichev <sdf@google.com> Acked-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://github.com/borkmann/iproute2/tree/pr/netkit Link: http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf (24ff.) Link: https://lore.kernel.org/r/20231024214904.29825-2-daniel@iogearbox.net Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-10-24bpf: Improve JEQ/JNE branch taken logicAndrii Nakryiko
When determining if an if/else branch will always or never be taken, use signed range knowledge in addition to currently used unsigned range knowledge. If either signed or unsigned range suggests that condition is always/never taken, return corresponding branch_taken verdict. Current use of unsigned range for this seems arbitrary and unnecessarily incomplete. It is possible for *signed* operations to be performed on register, which could "invalidate" unsigned range for that register. In such case branch_taken will be artificially useless, even if we can still tell that some constant is outside of register value range based on its signed bounds. veristat-based validation shows zero differences across selftests, Cilium, and Meta-internal BPF object files. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Link: https://lore.kernel.org/bpf/20231022205743.72352-2-andrii@kernel.org
2023-10-24bpf: Fold smp_mb__before_atomic() into atomic_set_release()Paul E. McKenney
The bpf_user_ringbuf_drain() BPF_CALL function uses an atomic_set() immediately preceded by smp_mb__before_atomic() so as to order storing of ring-buffer consumer and producer positions prior to the atomic_set() call's clearing of the ->busy flag, as follows: smp_mb__before_atomic(); atomic_set(&rb->busy, 0); Although this works given current architectures and implementations, and given that this only needs to order prior writes against a later write. However, it does so by accident because the smp_mb__before_atomic() is only guaranteed to work with read-modify-write atomic operations, and not at all with things like atomic_set() and atomic_read(). Note especially that smp_mb__before_atomic() will not, repeat *not*, order the prior write to "a" before the subsequent non-read-modify-write atomic read from "b", even on strongly ordered systems such as x86: WRITE_ONCE(a, 1); smp_mb__before_atomic(); r1 = atomic_read(&b); Therefore, replace the smp_mb__before_atomic() and atomic_set() with atomic_set_release() as follows: atomic_set_release(&rb->busy, 0); This is no slower (and sometimes is faster) than the original, and also provides a formal guarantee of ordering that the original lacks. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: David Vernet <void@manifault.com> Link: https://lore.kernel.org/bpf/ec86d38e-cfb4-44aa-8fdb-6c925922d93c@paulmck-laptop
2023-10-24bpf: Fix unnecessary -EBUSY from htab_lock_bucketSong Liu
htab_lock_bucket uses the following logic to avoid recursion: 1. preempt_disable(); 2. check percpu counter htab->map_locked[hash] for recursion; 2.1. if map_lock[hash] is already taken, return -BUSY; 3. raw_spin_lock_irqsave(); However, if an IRQ hits between 2 and 3, BPF programs attached to the IRQ logic will not able to access the same hash of the hashtab and get -EBUSY. This -EBUSY is not really necessary. Fix it by disabling IRQ before checking map_locked: 1. preempt_disable(); 2. local_irq_save(); 3. check percpu counter htab->map_locked[hash] for recursion; 3.1. if map_lock[hash] is already taken, return -BUSY; 4. raw_spin_lock(). Similarly, use raw_spin_unlock() and local_irq_restore() in htab_unlock_bucket(). Fixes: 20b6cc34ea74 ("bpf: Avoid hashtab deadlock with map_locked") Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/7a9576222aa40b1c84ad3a9ba3e64011d1a04d41.camel@linux.ibm.com Link: https://lore.kernel.org/bpf/20231012055741.3375999-1-song@kernel.org
2023-10-24perf/core: Fix potential NULL derefPeter Zijlstra
Smatch is awesome. Fixes: 32671e3799ca ("perf: Disallow mis-matched inherited group reads") Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2023-10-24sched/fair: Remove SIS_PROPPeter Zijlstra
SIS_UTIL seems to work well, lets remove the old thing. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20231020134337.GD33965@noisy.programming.kicks-ass.net
2023-10-24sched/fair: Use candidate prev/recent_used CPU if scanning failed for ↵Yicong Yang
cluster wakeup Chen Yu reports a hackbench regression of cluster wakeup when hackbench threads equal to the CPU number [1]. Analysis shows it's because we wake up more on the target CPU even if the prev_cpu is a good wakeup candidate and leads to the decrease of the CPU utilization. Generally if the task's prev_cpu is idle we'll wake up the task on it without scanning. On cluster machines we'll try to wake up the task in the same cluster of the target for better cache affinity, so if the prev_cpu is idle but not sharing the same cluster with the target we'll still try to find an idle CPU within the cluster. This will improve the performance at low loads on cluster machines. But in the issue above, if the prev_cpu is idle but not in the cluster with the target CPU, we'll try to scan an idle one in the cluster. But since the system is busy, we're likely to fail the scanning and use target instead, even if the prev_cpu is idle. Then leads to the regression. This patch solves this in 2 steps: o record the prev_cpu/recent_used_cpu if they're good wakeup candidates but not sharing the cluster with the target. o on scanning failure use the prev_cpu/recent_used_cpu if they're recorded as idle [1] https://lore.kernel.org/all/ZGzDLuVaHR1PAYDt@chenyu5-mobl1/ Closes: https://lore.kernel.org/all/ZGsLy83wPIpamy6x@chenyu5-mobl1/ Reported-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> Tested-and-reviewed-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20231019033323.54147-4-yangyicong@huawei.com
2023-10-24sched/fair: Scan cluster before scanning LLC in wake-up pathBarry Song
For platforms having clusters like Kunpeng920, CPUs within the same cluster have lower latency when synchronizing and accessing shared resources like cache. Thus, this patch tries to find an idle cpu within the cluster of the target CPU before scanning the whole LLC to gain lower latency. This will be implemented in 2 steps in select_idle_sibling(): 1. When the prev_cpu/recent_used_cpu are good wakeup candidates, use them if they're sharing cluster with the target CPU. Otherwise trying to scan for an idle CPU in the target's cluster. 2. Scanning the cluster prior to the LLC of the target CPU for an idle CPU to wakeup. Testing has been done on Kunpeng920 by pinning tasks to one numa and two numa. On Kunpeng920, Each numa has 8 clusters and each cluster has 4 CPUs. With this patch, We noticed enhancement on tbench and netperf within one numa or cross two numa on top of tip-sched-core commit 9b46f1abc6d4 ("sched/debug: Print 'tgid' in sched_show_task()") tbench results (node 0): baseline patched 1: 327.2833 372.4623 ( 13.80%) 4: 1320.5933 1479.8833 ( 12.06%) 8: 2638.4867 2921.5267 ( 10.73%) 16: 5282.7133 5891.5633 ( 11.53%) 32: 9810.6733 9877.3400 ( 0.68%) 64: 7408.9367 7447.9900 ( 0.53%) 128: 6203.2600 6191.6500 ( -0.19%) tbench results (node 0-1): baseline patched 1: 332.0433 372.7223 ( 12.25%) 4: 1325.4667 1477.6733 ( 11.48%) 8: 2622.9433 2897.9967 ( 10.49%) 16: 5218.6100 5878.2967 ( 12.64%) 32: 10211.7000 11494.4000 ( 12.56%) 64: 13313.7333 16740.0333 ( 25.74%) 128: 13959.1000 14533.9000 ( 4.12%) netperf results TCP_RR (node 0): baseline patched 1: 76546.5033 90649.9867 ( 18.42%) 4: 77292.4450 90932.7175 ( 17.65%) 8: 77367.7254 90882.3467 ( 17.47%) 16: 78519.9048 90938.8344 ( 15.82%) 32: 72169.5035 72851.6730 ( 0.95%) 64: 25911.2457 25882.2315 ( -0.11%) 128: 10752.6572 10768.6038 ( 0.15%) netperf results TCP_RR (node 0-1): baseline patched 1: 76857.6667 90892.2767 ( 18.26%) 4: 78236.6475 90767.3017 ( 16.02%) 8: 77929.6096 90684.1633 ( 16.37%) 16: 77438.5873 90502.5787 ( 16.87%) 32: 74205.6635 88301.5612 ( 19.00%) 64: 69827.8535 71787.6706 ( 2.81%) 128: 25281.4366 25771.3023 ( 1.94%) netperf results UDP_RR (node 0): baseline patched 1: 96869.8400 110800.8467 ( 14.38%) 4: 97744.9750 109680.5425 ( 12.21%) 8: 98783.9863 110409.9637 ( 11.77%) 16: 99575.0235 110636.2435 ( 11.11%) 32: 95044.7250 97622.8887 ( 2.71%) 64: 32925.2146 32644.4991 ( -0.85%) 128: 12859.2343 12824.0051 ( -0.27%) netperf results UDP_RR (node 0-1): baseline patched 1: 97202.4733 110190.1200 ( 13.36%) 4: 95954.0558 106245.7258 ( 10.73%) 8: 96277.1958 105206.5304 ( 9.27%) 16: 97692.7810 107927.2125 ( 10.48%) 32: 79999.6702 103550.2999 ( 29.44%) 64: 80592.7413 87284.0856 ( 8.30%) 128: 27701.5770 29914.5820 ( 7.99%) Note neither Kunpeng920 nor x86 Jacobsville supports SMT, so the SMT branch in the code has not been tested but it supposed to work. Chen Yu also noticed this will improve the performance of tbench and netperf on a 24 CPUs Jacobsville machine, there are 4 CPUs in one cluster sharing L2 Cache. [https://lore.kernel.org/lkml/Ytfjs+m1kUs0ScSn@worktop.programming.kicks-ass.net] Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Barry Song <song.bao.hua@hisilicon.com> Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> Reviewed-by: Chen Yu <yu.c.chen@intel.com> Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Tested-and-reviewed-by: Chen Yu <yu.c.chen@intel.com> Tested-by: Yicong Yang <yangyicong@hisilicon.com> Link: https://lkml.kernel.org/r/20231019033323.54147-3-yangyicong@huawei.com
2023-10-24sched: Add cpus_share_resources APIBarry Song
Add cpus_share_resources() API. This is the preparation for the optimization of select_idle_cpu() on platforms with cluster scheduler level. On a machine with clusters cpus_share_resources() will test whether two cpus are within the same cluster. On a non-cluster machine it will behaves the same as cpus_share_cache(). So we use "resources" here for cache resources. Signed-off-by: Barry Song <song.bao.hua@hisilicon.com> Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Tested-and-reviewed-by: Chen Yu <yu.c.chen@intel.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lkml.kernel.org/r/20231019033323.54147-2-yangyicong@huawei.com
2023-10-24sched/core: Fix RQCF_ACT_SKIP leakHao Jia
Igor Raits and Bagas Sanjaya report a RQCF_ACT_SKIP leak warning. This warning may be triggered in the following situations: CPU0 CPU1 __schedule() *rq->clock_update_flags <<= 1;* unregister_fair_sched_group() pick_next_task_fair+0x4a/0x410 destroy_cfs_bandwidth() newidle_balance+0x115/0x3e0 for_each_possible_cpu(i) *i=0* rq_unpin_lock(this_rq, rf) __cfsb_csd_unthrottle() raw_spin_rq_unlock(this_rq) rq_lock(*CPU0_rq*, &rf) rq_clock_start_loop_update() rq->clock_update_flags & RQCF_ACT_SKIP <-- raw_spin_rq_lock(this_rq) The purpose of RQCF_ACT_SKIP is to skip the update rq clock, but the update is very early in __schedule(), but we clear RQCF_*_SKIP very late, causing it to span that gap above and triggering this warning. In __schedule() we can clear the RQCF_*_SKIP flag immediately after update_rq_clock() to avoid this RQCF_ACT_SKIP leak warning. And set rq->clock_update_flags to RQCF_UPDATED to avoid rq->clock_update_flags < RQCF_ACT_SKIP warning that may be triggered later. Fixes: ebb83d84e49b ("sched/core: Avoid multiple calling update_rq_clock() in __cfsb_csd_unthrottle()") Closes: https://lore.kernel.org/all/20230913082424.73252-1-jiahao.os@bytedance.com Reported-by: Igor Raits <igor.raits@gmail.com> Reported-by: Bagas Sanjaya <bagasdotme@gmail.com> Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Hao Jia <jiahao.os@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/a5dd536d-041a-2ce9-f4b7-64d8d85c86dc@gmail.com
2023-10-24printk: printk: Remove unnecessary statements'len = 0;'Li kunyu
In the following two functions, len has already been assigned a value of 0 when defining the variable, so remove 'len=0;'. Signed-off-by: Li kunyu <kunyu@nfschina.com> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20231023062359.130633-1-kunyu@nfschina.com
2023-10-23bpf: print full verifier states on infinite loop detectionEduard Zingerman
Additional logging in is_state_visited(): if infinite loop is detected print full verifier state for both current and equivalent states. Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20231024000917.12153-8-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-10-23bpf: correct loop detection for iterators convergenceEduard Zingerman
It turns out that .branches > 0 in is_state_visited() is not a sufficient condition to identify if two verifier states form a loop when iterators convergence is computed. This commit adds logic to distinguish situations like below: (I) initial (II) initial | | V V .---------> hdr .. | | | | V V | .------... .------.. | | | | | | V V V V | ... ... .-> hdr .. | | | | | | | V V | V V | succ <- cur | succ <- cur | | | | | V | V | ... | ... | | | | '----' '----' For both (I) and (II) successor 'succ' of the current state 'cur' was previously explored and has branches count at 0. However, loop entry 'hdr' corresponding to 'succ' might be a part of current DFS path. If that is the case 'succ' and 'cur' are members of the same loop and have to be compared exactly. Co-developed-by: Andrii Nakryiko <andrii.nakryiko@gmail.com> Co-developed-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Reviewed-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20231024000917.12153-6-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-10-23bpf: exact states comparison for iterator convergence checksEduard Zingerman
Convergence for open coded iterators is computed in is_state_visited() by examining states with branches count > 1 and using states_equal(). states_equal() computes sub-state relation using read and precision marks. Read and precision marks are propagated from children states, thus are not guaranteed to be complete inside a loop when branches count > 1. This could be demonstrated using the following unsafe program: 1. r7 = -16 2. r6 = bpf_get_prandom_u32() 3. while (bpf_iter_num_next(&fp[-8])) { 4. if (r6 != 42) { 5. r7 = -32 6. r6 = bpf_get_prandom_u32() 7. continue 8. } 9. r0 = r10 10. r0 += r7 11. r8 = *(u64 *)(r0 + 0) 12. r6 = bpf_get_prandom_u32() 13. } Here verifier would first visit path 1-3, create a checkpoint at 3 with r7=-16, continue to 4-7,3 with r7=-32. Because instructions at 9-12 had not been visitied yet existing checkpoint at 3 does not have read or precision mark for r7. Thus states_equal() would return true and verifier would discard current state, thus unsafe memory access at 11 would not be caught. This commit fixes this loophole by introducing exact state comparisons for iterator convergence logic: - registers are compared using regs_exact() regardless of read or precision marks; - stack slots have to have identical type. Unfortunately, this is too strict even for simple programs like below: i = 0; while(iter_next(&it)) i++; At each iteration step i++ would produce a new distinct state and eventually instruction processing limit would be reached. To avoid such behavior speculatively forget (widen) range for imprecise scalar registers, if those registers were not precise at the end of the previous iteration and do not match exactly. This a conservative heuristic that allows to verify wide range of programs, however it precludes verification of programs that conjure an imprecise value on the first loop iteration and use it as precise on the second. Test case iter_task_vma_for_each() presents one of such cases: unsigned int seen = 0; ... bpf_for_each(task_vma, vma, task, 0) { if (seen >= 1000) break; ... seen++; } Here clang generates the following code: <LBB0_4>: 24: r8 = r6 ; stash current value of ... body ... 'seen' 29: r1 = r10 30: r1 += -0x8 31: call bpf_iter_task_vma_next 32: r6 += 0x1 ; seen++; 33: if r0 == 0x0 goto +0x2 <LBB0_6> ; exit on next() == NULL 34: r7 += 0x10 35: if r8 < 0x3e7 goto -0xc <LBB0_4> ; loop on seen < 1000 <LBB0_6>: ... exit ... Note that counter in r6 is copied to r8 and then incremented, conditional jump is done using r8. Because of this precision mark for r6 lags one state behind of precision mark on r8 and widening logic kicks in. Adding barrier_var(seen) after conditional is sufficient to force clang use the same register for both counting and conditional jump. This issue was discussed in the thread [1] which was started by Andrew Werner <awerner32@gmail.com> demonstrating a similar bug in callback functions handling. The callbacks would be addressed in a followup patch. [1] https://lore.kernel.org/bpf/97a90da09404c65c8e810cf83c94ac703705dc0e.camel@gmail.com/ Co-developed-by: Andrii Nakryiko <andrii.nakryiko@gmail.com> Co-developed-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20231024000917.12153-4-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-10-23bpf: extract same_callsites() as utility functionEduard Zingerman
Extract same_callsites() from clean_live_states() as a utility function. This function would be used by the next patch in the set. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20231024000917.12153-3-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-10-23bpf: move explored_state() closer to the beginning of verifier.cEduard Zingerman
Subsequent patches would make use of explored_state() function. Move it up to avoid adding unnecessary prototype. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20231024000917.12153-2-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-10-24Merge tag 'topic/vmemdup-user-array-2023-10-24-1' of ↵Dave Airlie
git://anongit.freedesktop.org/drm/drm into drm-next vmemdup-user-array API and changes with it. This is just a process PR to merge the topic branch into drm-next, this contains some core kernel and drm changes. Signed-off-by: Dave Airlie <airlied@redhat.com> From: Dave Airlie <airlied@redhat.com> Link: https://patchwork.freedesktop.org/patch/msgid/20231024010905.646830-1-airlied@redhat.com
2023-10-24kprobes: unused header files removedwuqiang.matt
As kernel test robot reported, lib/test_objpool.c (trace:probes/for-next) has linux/version.h included, but version.h is not used at all. Then more unused headers are found in test_objpool.c and rethook.c, and all of them should be removed. Link: https://lore.kernel.org/all/20231023112245.6112-1-wuqiang.matt@bytedance.com/ Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202310191512.vvypKU5Z-lkp@intel.com/ Signed-off-by: wuqiang.matt <wuqiang.matt@bytedance.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2023-10-23bpf, tcx: Get rid of tcx_link_constDaniel Borkmann
Small clean up to get rid of the extra tcx_link_const() and only retain the tcx_link(). Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20231023185015.21152-1-daniel@iogearbox.net Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-10-23tracing/histograms: Simplify last_cmd_set()Christophe JAILLET
Turn a kzalloc()+strcpy()+strncat() into an equivalent and less verbose kasprintf(). Link: https://lore.kernel.org/linux-trace-kernel/30b6fb04dadc10a03cc1ad08f5d8a93ef623a167.1697899346.git.christophe.jaillet@wanadoo.fr Cc: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: Mukesh ojha <quic_mojha@quicinc.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-10-23Merge branches 'rcu/torture', 'rcu/fixes', 'rcu/docs', 'rcu/refscale', ↵Frederic Weisbecker
'rcu/tasks' and 'rcu/stall' into rcu/next rcu/torture: RCU torture, locktorture and generic torture infrastructure rcu/fixes: Generic and misc fixes rcu/docs: RCU documentation updates rcu/refscale: RCU reference scalability test updates rcu/tasks: RCU tasks updates rcu/stall: Stall detection updates
2023-10-23Merge tag 'v6.6-rc7' into sched/core, to pick up fixesIngo Molnar
Pick up recent sched/urgent fixes merged upstream. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2023-10-23dma-debug: Fix a typo in a debugging eye-catcherChuck Lever
Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-10-23swiotlb: rewrite comment explaining why the source is preserved on ↵Sean Christopherson
DMA_FROM_DEVICE Rewrite the comment explaining why swiotlb copies the original buffer to the TLB buffer before initiating DMA *from* the device, i.e. before the device DMAs into the TLB buffer. The existing comment's argument that preserving the original data can prevent a kernel memory leak is bogus. If the driver that triggered the mapping _knows_ that the device will overwrite the entire mapping, or the driver will consume only the written parts, then copying from the original memory is completely pointless. If neither of the above holds true, then copying from the original adds value only if preserving the data is necessary for functional correctness, or the driver explicitly initialized the original memory. If the driver didn't initialize the memory, then copying the original buffer to the TLB buffer simply changes what kernel data is leaked to user space. Writing the entire TLB buffer _does_ prevent leaking stale TLB buffer data from a previous bounce, but that can be achieved by simply zeroing the TLB buffer when grabbing a slot. The real reason swiotlb ended up initializing the TLB buffer with the original buffer is that it's necessary to make swiotlb operate as transparently as possible, i.e. to behave as closely as possible to hardware, and to avoid corrupting the original buffer, e.g. if the driver knows the device will do partial writes and is relying on the unwritten data to be preserved. Reviewed-by: Robin Murphy <robin.murphy@arm.com> Link: https://lore.kernel.org/all/ZN5elYQ5szQndN8n@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-10-22dma-direct: warn when coherent allocations aren't supportedChristoph Hellwig
Log a warning once when dma_alloc_coherent fails because the platform does not support coherent allocations at all. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Robin Murphy <robin.murphy@arm.com> Reviewed-by: Greg Ungerer <gerg@linux-m68k.org> Tested-by: Greg Ungerer <gerg@linux-m68k.org>
2023-10-22dma-direct: simplify the use atomic pool logic in dma_direct_allocChristoph Hellwig
The logic in dma_direct_alloc when to use the atomic pool vs remapping grew a bit unreadable. Consolidate it into a single check, and clean up the set_uncached vs remap logic a bit as well. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Robin Murphy <robin.murphy@arm.com> Reviewed-by: Greg Ungerer <gerg@linux-m68k.org> Tested-by: Greg Ungerer <gerg@linux-m68k.org>
2023-10-22dma-direct: add a CONFIG_ARCH_HAS_DMA_ALLOC symbolChristoph Hellwig
Instead of using arch_dma_alloc if none of the generic coherent allocators are used, require the architectures to explicitly opt into providing it. This will used to deal with the case of m68knommu and coldfire where we can't do any coherent allocations whatsoever, and also makes it clear that arch_dma_alloc is a last resort. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Robin Murphy <robin.murphy@arm.com> Reviewed-by: Greg Ungerer <gerg@linux-m68k.org> Tested-by: Greg Ungerer <gerg@linux-m68k.org>
2023-10-22dma-direct: add dependencies to CONFIG_DMA_GLOBAL_POOLChristoph Hellwig
CONFIG_DMA_GLOBAL_POOL can't be combined with other DMA coherent allocators. Add dependencies to Kconfig to document this, and make kconfig complain about unmet dependencies if someone tries. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Robin Murphy <robin.murphy@arm.com> Reviewed-by: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com> Reviewed-by: Greg Ungerer <gerg@linux-m68k.org> Tested-by: Greg Ungerer <gerg@linux-m68k.org>
2023-10-21Merge tag 'sched-urgent-2023-10-21' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Ingo Molnar: "Fix a recently introduced use-after-free bug" * tag 'sched-urgent-2023-10-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/eevdf: Fix heap corruption more
2023-10-21Merge tag 'perf-urgent-2023-10-21' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf events fix from Ingo Molnar: "Fix group event semantics" * tag 'perf-urgent-2023-10-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf: Disallow mis-matched inherited group reads
2023-10-21Merge tag 'probes-fixes-v6.6-rc6.2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull probes fixes from Masami Hiramatsu: - kprobe-events: Fix kprobe events to reject if the attached symbol is not unique name because it may not the function which the user want to attach to. (User can attach a probe to such symbol using the nearest unique symbol + offset.) - selftest: Add a testcase to ensure the kprobe event rejects non unique symbol correctly. * tag 'probes-fixes-v6.6-rc6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: selftests/ftrace: Add new test case which checks non unique symbol tracing/kprobes: Return EADDRNOTAVAIL when func matches several symbols
2023-10-20bpf: Use bpf_global_percpu_ma for per-cpu kptr in __bpf_obj_drop_impl()Hou Tao
The following warning was reported when running "./test_progs -t test_bpf_ma/percpu_free_through_map_free": ------------[ cut here ]------------ WARNING: CPU: 1 PID: 68 at kernel/bpf/memalloc.c:342 CPU: 1 PID: 68 Comm: kworker/u16:2 Not tainted 6.6.0-rc2+ #222 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) Workqueue: events_unbound bpf_map_free_deferred RIP: 0010:bpf_mem_refill+0x21c/0x2a0 ...... Call Trace: <IRQ> ? bpf_mem_refill+0x21c/0x2a0 irq_work_single+0x27/0x70 irq_work_run_list+0x2a/0x40 irq_work_run+0x18/0x40 __sysvec_irq_work+0x1c/0xc0 sysvec_irq_work+0x73/0x90 </IRQ> <TASK> asm_sysvec_irq_work+0x1b/0x20 RIP: 0010:unit_free+0x50/0x80 ...... bpf_mem_free+0x46/0x60 __bpf_obj_drop_impl+0x40/0x90 bpf_obj_free_fields+0x17d/0x1a0 array_map_free+0x6b/0x170 bpf_map_free_deferred+0x54/0xa0 process_scheduled_works+0xba/0x370 worker_thread+0x16d/0x2e0 kthread+0x105/0x140 ret_from_fork+0x39/0x60 ret_from_fork_asm+0x1b/0x30 </TASK> ---[ end trace 0000000000000000 ]--- The reason is simple: __bpf_obj_drop_impl() does not know the freeing field is a per-cpu pointer and it uses bpf_global_ma to free the pointer. Because bpf_global_ma is not a per-cpu allocator, so ksize() is used to select the corresponding cache. The bpf_mem_cache with 16-bytes unit_size will always be selected to do the unmatched free and it will trigger the warning in free_bulk() eventually. Because per-cpu kptr doesn't support list or rb-tree now, so fix the problem by only checking whether or not the type of kptr is per-cpu in bpf_obj_free_fields(), and using bpf_global_percpu_ma to these kptrs. Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20231020133202.4043247-7-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-10-20bpf: Move the declaration of __bpf_obj_drop_impl() to bpf.hHou Tao
both syscall.c and helpers.c have the declaration of __bpf_obj_drop_impl(), so just move it to a common header file. Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20231020133202.4043247-6-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-10-20bpf: Use pcpu_alloc_size() in bpf_mem_free{_rcu}()Hou Tao
For bpf_global_percpu_ma, the pointer passed to bpf_mem_free_rcu() is allocated by kmalloc() and its size is fixed (16-bytes on x86-64). So no matter which cache allocates the dynamic per-cpu area, on x86-64 cache[2] will always be used to free the per-cpu area. Fix the unbalance by checking whether the bpf memory allocator is per-cpu or not and use pcpu_alloc_size() instead of ksize() to find the correct cache for per-cpu free. Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20231020133202.4043247-5-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-10-20bpf: Re-enable unit_size checking for global per-cpu allocatorHou Tao
With pcpu_alloc_size() in place, check whether or not the size of the dynamic per-cpu area is matched with unit_size. Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20231020133202.4043247-4-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-10-20tracing: Move readpos from seq_buf to trace_seqMatthew Wilcox (Oracle)
To make seq_buf more lightweight as a string buf, move the readpos member from seq_buf to its container, trace_seq. That puts the responsibility of maintaining the readpos entirely in the tracing code. If some future users want to package up the readpos with a seq_buf, we can define a new struct then. Link: https://lore.kernel.org/linux-trace-kernel/20231020033545.2587554-2-willy@infradead.org Cc: Kees Cook <keescook@chromium.org> Cc: Justin Stitt <justinstitt@google.com> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Petr Mladek <pmladek@suse.com> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>