summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2022-10-25bpf: Refactor the core bpf_task_storage_get logic into a new functionMartin KaFai Lau
This patch creates a new function __bpf_task_storage_get() and moves the core logic of the existing bpf_task_storage_get() into this new function. This new function will be shared by another new helper proto in the latter patch. Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20221025184524.3526117-4-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-10-25bpf: Append _recur naming to the bpf_task_storage helper protoMartin KaFai Lau
This patch adds the "_recur" naming to the bpf_task_storage_{get,delete} proto. In a latter patch, they will only be used by the tracing programs that requires a deadlock detection because a tracing prog may use bpf_task_storage_{get,delete} recursively and cause a deadlock. Another following patch will add a different helper proto for the non tracing programs because they do not need the deadlock prevention. This patch does this rename to prepare for this future proto additions. Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20221025184524.3526117-3-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-10-25bpf: Remove prog->active check for bpf_lsm and bpf_iterMartin KaFai Lau
The commit 64696c40d03c ("bpf: Add __bpf_prog_{enter,exit}_struct_ops for struct_ops trampoline") removed prog->active check for struct_ops prog. The bpf_lsm and bpf_iter is also using trampoline. Like struct_ops, the bpf_lsm and bpf_iter have fixed hooks for the prog to attach. The kernel does not call the same hook in a recursive way. This patch also removes the prog->active check for bpf_lsm and bpf_iter. A later patch has a test to reproduce the recursion issue for a sleepable bpf_lsm program. This patch appends the '_recur' naming to the existing enter and exit functions that track the prog->active counter. New __bpf_prog_{enter,exit}[_sleepable] function are added to skip the prog->active tracking. The '_struct_ops' version is also removed. It also moves the decision on picking the enter and exit function to the new bpf_trampoline_{enter,exit}(). It returns the '_recur' ones for all tracing progs to use. For bpf_lsm, bpf_iter, struct_ops (no prog->active tracking after 64696c40d03c), and bpf_lsm_cgroup (no prog->active tracking after 69fd337a975c7), it will return the functions that don't track the prog->active. Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20221025184524.3526117-2-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-10-25fs/exec: switch timens when a task gets a new mmAndrei Vagin
Changing a time namespace requires remapping a vvar page, so we don't want to allow doing that if any other tasks can use the same mm. Currently, we install a time namespace when a task is created with a new vm. exec() is another case when a task gets a new mm and so it can switch a time namespace safely, but it isn't handled now. One more issue of the current interface is that clone() with CLONE_VM isn't allowed if the current task has unshared a time namespace (timens_for_children doesn't match the current timens). Both these issues make some inconvenience for users. For example, Alexey and Florian reported that posix_spawn() uses vfork+exec and this pattern doesn't work with time namespaces due to the both described issues. LXC needed to workaround the exec() issue by calling setns. In the commit 133e2d3e81de5 ("fs/exec: allow to unshare a time namespace on vfork+exec"), we tried to fix these issues with minimal impact on UAPI. But it adds extra complexity and some undesirable side effects. Eric suggested fixing the issues properly because here are all the reasons to suppose that there are no users that depend on the old behavior. Cc: Alexey Izbyshev <izbyshev@ispras.ru> Cc: Christian Brauner <brauner@kernel.org> Cc: Dmitry Safonov <0x7f454c46@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Florian Weimer <fweimer@redhat.com> Cc: Kees Cook <keescook@chromium.org> Suggested-by: "Eric W. Biederman" <ebiederm@xmission.com> Origin-author: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrei Vagin <avagin@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20220921003120.209637-1-avagin@google.com
2022-10-25bpf: Take module reference on kprobe_multi linkJiri Olsa
Currently we allow to create kprobe multi link on function from kernel module, but we don't take the module reference to ensure it's not unloaded while we are tracing it. The multi kprobe link is based on fprobe/ftrace layer which takes different approach and releases ftrace hooks when module is unloaded even if there's tracer registered on top of it. Adding code that gathers all the related modules for the link and takes their references before it's attached. All kernel module references are released after link is unregistered. Note that we do it the same way already for trampoline probes (but for single address). Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20221025134148.3300700-5-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-10-25bpf: Rename __bpf_kprobe_multi_cookie_cmp to bpf_kprobe_multi_addrs_cmpJiri Olsa
Renaming __bpf_kprobe_multi_cookie_cmp to bpf_kprobe_multi_addrs_cmp, because it's more suitable to current and upcoming code. Acked-by: Song Liu <song@kernel.org> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20221025134148.3300700-4-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-10-25ftrace: Add support to resolve module symbols in ftrace_lookup_symbolsJiri Olsa
Currently ftrace_lookup_symbols iterates only over core symbols, adding module_kallsyms_on_each_symbol call to check on modules symbols as well. Also removing 'args.found == args.cnt' condition, because it's already checked in kallsyms_callback function. Also removing 'err < 0' check, because both *kallsyms_on_each_symbol functions do not return error. Reported-by: Martynas Pumputis <m@lambda.lt> Acked-by: Song Liu <song@kernel.org> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20221025134148.3300700-3-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-10-25kallsyms: Make module_kallsyms_on_each_symbol generally availableJiri Olsa
Making module_kallsyms_on_each_symbol generally available, so it can be used outside CONFIG_LIVEPATCH option in following changes. Rather than adding another ifdef option let's make the function generally available (when CONFIG_KALLSYMS and CONFIG_MODULES options are defined). Cc: Christoph Hellwig <hch@lst.de> Acked-by: Song Liu <song@kernel.org> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20221025134148.3300700-2-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-10-25PM: hibernate: Allow hybrid sleep to work with s2idleMario Limonciello
Hybrid sleep is currently hardcoded to only operate with S3 even on systems that might not support it. Instead of assuming this mode is what the user wants to use, for hybrid sleep follow the setting of `mem_sleep_current` which will respect mem_sleep_default kernel command line and policy decisions made by the presence of the FADT low power idle bit. Fixes: 81d45bdf8913 ("PM / hibernate: Untangle power_down()") Reported-and-tested-by: kolAflash <kolAflash@kolahilft.de> Link: https://bugzilla.kernel.org/show_bug.cgi?id=216574 Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2022-10-24Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
include/linux/net.h a5ef058dc4d9 ("net: introduce and use custom sockopt socket flag") e993ffe3da4b ("net: flag sockets supporting msghdr originated zerocopy") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-24Merge tag 'net-6.1-rc3-1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from bpf. The net-memcg fix stands out, the rest is very run-off-the-mill. Maybe I'm biased. Current release - regressions: - eth: fman: re-expose location of the MAC address to userspace, apparently some udev scripts depended on the exact value Current release - new code bugs: - bpf: - wait for busy refill_work when destroying bpf memory allocator - allow bpf_user_ringbuf_drain() callbacks to return 1 - fix dispatcher patchable function entry to 5 bytes nop Previous releases - regressions: - net-memcg: avoid stalls when under memory pressure - tcp: fix indefinite deferral of RTO with SACK reneging - tipc: fix a null-ptr-deref in tipc_topsrv_accept - eth: macb: specify PHY PM management done by MAC - tcp: fix a signed-integer-overflow bug in tcp_add_backlog() Previous releases - always broken: - eth: amd-xgbe: SFP fixes and compatibility improvements Misc: - docs: netdev: offer performance feedback to contributors" * tag 'net-6.1-rc3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (37 commits) net-memcg: avoid stalls when under memory pressure tcp: fix indefinite deferral of RTO with SACK reneging tcp: fix a signed-integer-overflow bug in tcp_add_backlog() net: lantiq_etop: don't free skb when returning NETDEV_TX_BUSY net: fix UAF issue in nfqnl_nf_hook_drop() when ops_init() failed docs: netdev: offer performance feedback to contributors kcm: annotate data-races around kcm->rx_wait kcm: annotate data-races around kcm->rx_psock net: fman: Use physical address for userspace interfaces net/mlx5e: Cleanup MACsec uninitialization routine atlantic: fix deadlock at aq_nic_stop nfp: only clean `sp_indiff` when application firmware is unloaded amd-xgbe: add the bit rate quirk for Molex cables amd-xgbe: fix the SFP compliance codes check for DAC cables amd-xgbe: enable PLL_CTL for fixed PHY modes only amd-xgbe: use enums for mailbox cmd and sub_cmds amd-xgbe: Yellow carp devices do not need rrc bpf: Use __llist_del_all() whenever possbile during memory draining bpf: Wait for busy refill_work when destroying bpf memory allocator MAINTAINERS: add keyword match on PTP ...
2022-10-24Merge tag 'rcu-urgent.2022.10.20a' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu Pull RCU fix from Paul McKenney: "Fix a regression caused by commit bf95b2bc3e42 ("rcu: Switch polled grace-period APIs to ->gp_seq_polled"), which could incorrectly leave interrupts enabled after an early-boot call to synchronize_rcu(). Such synchronize_rcu() calls must acquire leaf rcu_node locks in order to properly interact with polled grace periods, but the code did not take into account the possibility of synchronize_rcu() being invoked from the portion of the boot sequence during which interrupts are disabled. This commit therefore switches the lock acquisition and release from irq to irqsave/irqrestore" * tag 'rcu-urgent.2022.10.20a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: rcu: Keep synchronize_rcu() from enabling irqs in early boot
2022-10-24Merge tag 'for-netdev' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Alexei Starovoitov says: ==================== pull-request: bpf 2022-10-23 We've added 7 non-merge commits during the last 18 day(s) which contain a total of 8 files changed, 69 insertions(+), 5 deletions(-). The main changes are: 1) Wait for busy refill_work when destroying bpf memory allocator, from Hou. 2) Allow bpf_user_ringbuf_drain() callbacks to return 1, from David. 3) Fix dispatcher patchable function entry to 5 bytes nop, from Jiri. 4) Prevent decl_tag from being referenced in func_proto, from Stanislav. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: bpf: Use __llist_del_all() whenever possbile during memory draining bpf: Wait for busy refill_work when destroying bpf memory allocator bpf: Fix dispatcher patchable function entry to 5 bytes nop bpf: prevent decl_tag from being referenced in func_proto selftests/bpf: Add reproducer for decl_tag in func_proto return type selftests/bpf: Make bpf_user_ringbuf_drain() selftest callback return 1 bpf: Allow bpf_user_ringbuf_drain() callbacks to return 1 ==================== Link: https://lore.kernel.org/r/20221023192244.81137-1-alexei.starovoitov@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-23kill signal_pt_regs()Al Viro
Once upon at it was used on hot paths, but that had not been true since 2013. IOW, there's no point for arch-optimized equivalent of task_pt_regs(current) - remaining two users are not worth bothering with. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-10-23kernel/utsname_sysctl.c: Fix hostname pollingLinus Torvalds
Commit bfca3dd3d068 ("kernel/utsname_sysctl.c: print kernel arch") added a new entry to the uts_kern_table[] array, but didn't update the UTS_PROC_xyz enumerators of older entries, breaking anything that used them. Which is admittedly not many cases: it's really just the two uses of uts_proc_notify() in kernel/sys.c. But apparently journald-systemd actually uses this to detect hostname changes. Reported-by: Torsten Hilbrich <torsten.hilbrich@secunet.com> Fixes: bfca3dd3d068 ("kernel/utsname_sysctl.c: print kernel arch") Link: https://lore.kernel.org/lkml/0c2b92a6-0f25-9538-178f-eee3b06da23f@secunet.com/ Link: https://linux-regtracking.leemhuis.info/regzbot/regression/0c2b92a6-0f25-9538-178f-eee3b06da23f@secunet.com/ Cc: Petr Vorel <pvorel@suse.cz> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-10-23Merge tag 'perf_urgent_for_v6.1_rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Borislav Petkov: - Fix raw data handling when perf events are used in bpf - Rework how SIGTRAPs get delivered to events to address a bunch of problems with it. Add a selftest for that too * tag 'perf_urgent_for_v6.1_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: bpf: Fix sample_flags for bpf_perf_event_output selftests/perf_events: Add a SIGTRAP stress test with disables perf: Fix missing SIGTRAPs
2022-10-23Merge tag 'sched_urgent_for_v6.1_rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Borislav Petkov: - Adjust code to not trip up CFI - Fix sched group cookie matching * tag 'sched_urgent_for_v6.1_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched: Introduce struct balance_callback to avoid CFI mismatches sched/core: Fix comparison in sched_group_cookie_match()
2022-10-21bpf: Consider all mem_types compatible for map_{key,value} argsDave Marchevsky
After the previous patch, which added PTR_TO_MEM | MEM_ALLOC type map_key_value_types, the only difference between map_key_value_types and mem_types sets is PTR_TO_BUF and PTR_TO_MEM, which are in the latter set but not the former. Helpers which expect ARG_PTR_TO_MAP_KEY or ARG_PTR_TO_MAP_VALUE already effectively expect a valid blob of arbitrary memory that isn't necessarily explicitly associated with a map. When validating a PTR_TO_MAP_{KEY,VALUE} arg, the verifier expects meta->map_ptr to have already been set, either by an earlier ARG_CONST_MAP_PTR arg, or custom logic like that in process_timer_func or process_kptr_func. So let's get rid of map_key_value_types and just use mem_types for those args. This has the effect of adding PTR_TO_BUF and PTR_TO_MEM to the set of compatible types for ARG_PTR_TO_MAP_KEY and ARG_PTR_TO_MAP_VALUE. PTR_TO_BUF is used by various bpf_iter implementations to represent a chunk of valid r/w memory in ctx args for iter prog. PTR_TO_MEM is used by networking, tracing, and ringbuf helpers to represent a chunk of valid memory. The PTR_TO_MEM | MEM_ALLOC type added in previous commit is specific to ringbuf helpers. Presence or absence of MEM_ALLOC doesn't change the validity of using PTR_TO_MEM as a map_{key,val} input. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20221020160721.4030492-2-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-10-21bpf: Allow ringbuf memory to be used as map keyDave Marchevsky
This patch adds support for the following pattern: struct some_data *data = bpf_ringbuf_reserve(&ringbuf, sizeof(struct some_data, 0)); if (!data) return; bpf_map_lookup_elem(&another_map, &data->some_field); bpf_ringbuf_submit(data); Currently the verifier does not consider bpf_ringbuf_reserve's PTR_TO_MEM | MEM_ALLOC ret type a valid key input to bpf_map_lookup_elem. Since PTR_TO_MEM is by definition a valid region of memory, it is safe to use it as a key for lookups. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Acked-by: Yonghong Song <yhs@fb.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20221020160721.4030492-1-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-10-21bpf: Use __llist_del_all() whenever possbile during memory drainingHou Tao
Except for waiting_for_gp list, there are no concurrent operations on free_by_rcu, free_llist and free_llist_extra lists, so use __llist_del_all() instead of llist_del_all(). waiting_for_gp list can be deleted by RCU callback concurrently, so still use llist_del_all(). Acked-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20221021114913.60508-3-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-10-21bpf: Wait for busy refill_work when destroying bpf memory allocatorHou Tao
A busy irq work is an unfinished irq work and it can be either in the pending state or in the running state. When destroying bpf memory allocator, refill_work may be busy for PREEMPT_RT kernel in which irq work is invoked in a per-CPU RT-kthread. It is also possible for kernel with arch_irq_work_has_interrupt() being false (e.g. 1-cpu arm32 host or mips) and irq work is inovked in timer interrupt. The busy refill_work leads to various issues. The obvious one is that there will be concurrent operations on free_by_rcu and free_list between irq work and memory draining. Another one is call_rcu_in_progress will not be reliable for the checking of pending RCU callback because do_call_rcu() may have not been invoked by irq work yet. The other is there will be use-after-free if irq work is freed before the callback of irq work is invoked as shown below: BUG: kernel NULL pointer dereference, address: 0000000000000000 #PF: supervisor instruction fetch in kernel mode #PF: error_code(0x0010) - not-present page PGD 12ab94067 P4D 12ab94067 PUD 1796b4067 PMD 0 Oops: 0010 [#1] PREEMPT_RT SMP CPU: 5 PID: 64 Comm: irq_work/5 Not tainted 6.0.0-rt11+ #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) RIP: 0010:0x0 Code: Unable to access opcode bytes at 0xffffffffffffffd6. RSP: 0018:ffffadc080293e78 EFLAGS: 00010286 RAX: 0000000000000000 RBX: ffffcdc07fb6a388 RCX: ffffa05000a2e000 RDX: ffffa05000a2e000 RSI: ffffffff96cc9827 RDI: ffffcdc07fb6a388 ...... Call Trace: <TASK> irq_work_single+0x24/0x60 irq_work_run_list+0x24/0x30 run_irq_workd+0x23/0x30 smpboot_thread_fn+0x203/0x300 kthread+0x126/0x150 ret_from_fork+0x1f/0x30 </TASK> Considering the ease of concurrency handling, no overhead for irq_work_sync() under non-PREEMPT_RT kernel and has-irq-work-interrupt kernel and the short wait time used for irq_work_sync() under PREEMPT_RT (When running two test_maps on PREEMPT_RT kernel and 72-cpus host, the max wait time is about 8ms and the 99th percentile is 10us), just using irq_work_sync() to wait for busy refill_work to complete before memory draining and memory freeing. Fixes: 7c8199e24fa0 ("bpf: Introduce any context BPF specific memory allocator.") Acked-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20221021114913.60508-2-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-10-21Merge tag 'block-6.1-2022-10-20' of git://git.kernel.dk/linuxLinus Torvalds
Pull block fixes from Jens Axboe: - NVMe pull request via Christoph: - fix nvme-hwmon for DMA non-cohehrent architectures (Serge Semin) - add a nvme-hwmong maintainer (Christoph Hellwig) - fix error pointer dereference in error handling (Dan Carpenter) - fix invalid memory reference in nvmet_subsys_attr_qid_max_show (Daniel Wagner) - don't limit the DMA segment size in nvme-apple (Russell King) - fix workqueue MEM_RECLAIM flushing dependency (Sagi Grimberg) - disable write zeroes on various Kingston SSDs (Xander Li) - fix a memory leak with block device tracing (Ye) - flexible-array fix for ublk (Yushan) - document the ublk recovery feature from this merge window (ZiyangZhang) - remove dead bfq variable in struct (Yuwei) - error handling rq clearing fix (Yu) - add an IRQ safety check for the cached bio freeing (Pavel) - drbd bio cloning fix (Christoph) * tag 'block-6.1-2022-10-20' of git://git.kernel.dk/linux: blktrace: remove unnessary stop block trace in 'blk_trace_shutdown' blktrace: fix possible memleak in '__blk_trace_remove' blktrace: introduce 'blk_trace_{start,stop}' helper bio: safeguard REQ_ALLOC_CACHE bio put block, bfq: remove unused variable for bfq_queue drbd: only clone bio if we have a backing device ublk_drv: use flexible-array member instead of zero-length array nvmet: fix invalid memory reference in nvmet_subsys_attr_qid_max_show nvmet: fix workqueue MEM_RECLAIM flushing dependency nvme-hwmon: kmalloc the NVME SMART log buffer nvme-hwmon: consistently ignore errors from nvme_hwmon_init nvme: add Guenther as nvme-hwmon maintainer nvme-apple: don't limit DMA segement size nvme-pci: disable write zeroes on various Kingston SSD nvme: fix error pointer dereference in error handling Documentation: document ublk user recovery feature blk-mq: fix null pointer dereference in blk_mq_clear_rq_mapping()
2022-10-21Merge tag 'mm-hotfixes-stable-2022-10-20' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morron: "Seventeen hotfixes, mainly for MM. Five are cc:stable and the remainder address post-6.0 issues" * tag 'mm-hotfixes-stable-2022-10-20' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: nouveau: fix migrate_to_ram() for faulting page mm/huge_memory: do not clobber swp_entry_t during THP split hugetlb: fix memory leak associated with vma_lock structure mm/page_alloc: reduce potential fragmentation in make_alloc_exact() mm: /proc/pid/smaps_rollup: fix maple tree search mm,hugetlb: take hugetlb_lock before decrementing h->resv_huge_pages mm/mmap: fix MAP_FIXED address return on VMA merge mm/mmap.c: __vma_adjust(): suppress uninitialized var warning mm/mmap: undo ->mmap() when mas_preallocate() fails init: Kconfig: fix spelling mistake "satify" -> "satisfy" ocfs2: clear dinode links count in case of error ocfs2: fix BUG when iput after ocfs2_mknod fails gcov: support GCC 12.1 and newer compilers zsmalloc: zs_destroy_pool: add size_class NULL check mm/mempolicy: fix mbind_range() arguments to vma_merge() mailmap: update email for Qais Yousef mailmap: update Dan Carpenter's email address
2022-10-21srcu: Debug NMI safety even on archs that don't require itFrederic Weisbecker
Currently the NMI safety debugging is only performed on architectures that don't support NMI-safe this_cpu_inc(). Reorder the code so that other architectures like x86 also detect bad uses. [ paulmck: Apply kernel test robot, Stephen Rothwell, and Zqiang feedback. ] Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-10-21srcu: Explain the reason behind the read side critical section on GP startFrederic Weisbecker
Tell about the need to protect against concurrent updaters who may overflow the GP counter behind the current update. Reported-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-10-21srcu: Warn when NMI-unsafe API is used in NMIFrederic Weisbecker
Using the NMI-unsafe reader API from within an NMI handler is very likely to be buggy for three reasons: 1) NMIs aren't strictly re-entrant (a pending nested NMI will execute at the end of the current one) so it should be fine to use a non-atomic increment here. However, breakpoints can still interrupt NMIs and if a breakpoint callback has a reader on that same ssp, a racy increment can happen. 2) If the only reader site for a given srcu_struct structure is in an NMI handler, then RCU should be used instead of SRCU. 3) Because of the previous reason (2), an srcu_struct structure having an SRCU read side critical section in an NMI handler is likely to have another one from a task context. For all these reasons, warn if an NMI-unsafe reader API is used from an NMI handler. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-10-21rcu: Fix __this_cpu_read() lockdep warning in rcu_force_quiescent_state()Zqiang
Running rcutorture with non-zero fqs_duration module parameter in a kernel built with CONFIG_PREEMPTION=y results in the following splat: BUG: using __this_cpu_read() in preemptible [00000000] code: rcu_torture_fqs/398 caller is __this_cpu_preempt_check+0x13/0x20 CPU: 3 PID: 398 Comm: rcu_torture_fqs Not tainted 6.0.0-rc1-yoctodev-standard+ Call Trace: <TASK> dump_stack_lvl+0x5b/0x86 dump_stack+0x10/0x16 check_preemption_disabled+0xe5/0xf0 __this_cpu_preempt_check+0x13/0x20 rcu_force_quiescent_state.part.0+0x1c/0x170 rcu_force_quiescent_state+0x1e/0x30 rcu_torture_fqs+0xca/0x160 ? rcu_torture_boost+0x430/0x430 kthread+0x192/0x1d0 ? kthread_complete_and_exit+0x30/0x30 ret_from_fork+0x22/0x30 </TASK> The problem is that rcu_force_quiescent_state() uses __this_cpu_read() in preemptible code instead of the proper raw_cpu_read(). This commit therefore changes __this_cpu_read() to raw_cpu_read(). Signed-off-by: Zqiang <qiang1.zhang@intel.com> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-10-21rcu-tasks: Make grace-period-age message human-readablePaul E. McKenney
This commit adds a few words to the informative message that appears every ten seconds in RCU Tasks and RCU Tasks Trace grace periods. This message currently reads as follows: rcu_tasks_wait_gp: rcu_tasks grace period 1046 is 10088 jiffies old. After this change, it provides additional context, instead reading as follows: rcu_tasks_wait_gp: rcu_tasks grace period number 1046 (since boot) is 10088 jiffies old. Reported-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-10-21rcu: Remove rcu_is_idle_cpu()Yipeng Zou
The commit 3fcd6a230fa7 ("x86/cpu: Avoid cpuinfo-induced IPIing of idle CPUs") introduced rcu_is_idle_cpu() in order to identify the current CPU idle state. But commit f3eca381bd49 ("x86/aperfmperf: Replace arch_freq_get_on_cpu()") switched to using MAX_SAMPLE_AGE, so rcu_is_idle_cpu() is no longer used. This commit therefore removes it. Fixes: f3eca381bd49 ("x86/aperfmperf: Replace arch_freq_get_on_cpu()") Signed-off-by: Yipeng Zou <zouyipeng@huawei.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-10-20gcov: support GCC 12.1 and newer compilersMartin Liska
Starting with GCC 12.1, the created .gcda format can't be read by gcov tool. There are 2 significant changes to the .gcda file format that need to be supported: a) [gcov: Use system IO buffering] (23eb66d1d46a34cb28c4acbdf8a1deb80a7c5a05) changed that all sizes in the format are in bytes and not in words (4B) b) [gcov: make profile merging smarter] (72e0c742bd01f8e7e6dcca64042b9ad7e75979de) add a new checksum to the file header. Tested with GCC 7.5, 10.4, 12.2 and the current master. Link: https://lkml.kernel.org/r/624bda92-f307-30e9-9aaa-8cc678b2dfb2@suse.cz Signed-off-by: Martin Liska <mliska@suse.cz> Tested-by: Peter Oberparleiter <oberpar@linux.ibm.com> Reviewed-by: Peter Oberparleiter <oberpar@linux.ibm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-20bpf: Fix dispatcher patchable function entry to 5 bytes nopJiri Olsa
The patchable_function_entry(5) might output 5 single nop instructions (depends on toolchain), which will clash with bpf_arch_text_poke check for 5 bytes nop instruction. Adding early init call for dispatcher that checks and change the patchable entry into expected 5 nop instruction if needed. There's no need to take text_mutex, because we are using it in early init call which is called at pre-smp time. Fixes: ceea991a019c ("bpf: Move bpf_dispatcher function out of ftrace locations") Signed-off-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20221018075934.574415-1-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-10-20Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
No conflicts. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-20rcu: Keep synchronize_rcu() from enabling irqs in early bootPaul E. McKenney
Making polled RCU grace periods account for expedited grace periods required acquiring the leaf rcu_node structure's lock during early boot, but after rcu_init() was called. This lock is irq-disabled, but the code incorrectly assumes that irqs are always disabled when invoking synchronize_rcu(). The exception is early boot before the scheduler has started, which means that upon return from synchronize_rcu(), irqs will be incorrectly enabled. This commit fixes this bug by using irqsave/irqrestore locking primitives. Fixes: bf95b2bc3e42 ("rcu: Switch polled grace-period APIs to ->gp_seq_polled") Reported-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-10-20srcu: Check for consistent global per-srcu_struct NMI safetyPaul E. McKenney
This commit adds runtime checks to verify that a given srcu_struct uses consistent NMI-safe (or not) read-side primitives globally, but based on the per-CPU data. These global checks are made by the grace-period code that must scan the srcu_data structures anyway, and are done only in kernels built with CONFIG_PROVE_RCU=y. Link: https://lore.kernel.org/all/20220910221947.171557773@linutronix.de/ Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: John Ogness <john.ogness@linutronix.de> Cc: Petr Mladek <pmladek@suse.com>
2022-10-20srcu: Check for consistent per-CPU per-srcu_struct NMI safetyPaul E. McKenney
This commit adds runtime checks to verify that a given srcu_struct uses consistent NMI-safe (or not) read-side primitives on a per-CPU basis. Link: https://lore.kernel.org/all/20220910221947.171557773@linutronix.de/ Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: John Ogness <john.ogness@linutronix.de> Cc: Petr Mladek <pmladek@suse.com>
2022-10-20srcu: Create an srcu_read_lock_nmisafe() and srcu_read_unlock_nmisafe()Paul E. McKenney
On strict load-store architectures, the use of this_cpu_inc() by srcu_read_lock() and srcu_read_unlock() is not NMI-safe in TREE SRCU. To see this suppose that an NMI arrives in the middle of srcu_read_lock(), just after it has read ->srcu_lock_count, but before it has written the incremented value back to memory. If that NMI handler also does srcu_read_lock() and srcu_read_lock() on that same srcu_struct structure, then upon return from that NMI handler, the interrupted srcu_read_lock() will overwrite the NMI handler's update to ->srcu_lock_count, but leave unchanged the NMI handler's update by srcu_read_unlock() to ->srcu_unlock_count. This can result in a too-short SRCU grace period, which can in turn result in arbitrary memory corruption. If the NMI handler instead interrupts the srcu_read_unlock(), this can result in eternal SRCU grace periods, which is not much better. This commit therefore creates a pair of new srcu_read_lock_nmisafe() and srcu_read_unlock_nmisafe() functions, which allow SRCU readers in both NMI handlers and in process and IRQ context. It is bad practice to mix the existing and the new _nmisafe() primitives on the same srcu_struct structure. Use one set or the other, not both. Just to underline that "bad practice" point, using srcu_read_lock() at process level and srcu_read_lock_nmisafe() in your NMI handler will not, repeat NOT, work. If you do not immediately understand why this is the case, please review the earlier paragraphs in this commit log. [ paulmck: Apply kernel test robot feedback. ] [ paulmck: Apply feedback from Randy Dunlap. ] [ paulmck: Apply feedback from John Ogness. ] [ paulmck: Apply feedback from Frederic Weisbecker. ] Link: https://lore.kernel.org/all/20220910221947.171557773@linutronix.de/ Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Acked-by: Randy Dunlap <rdunlap@infradead.org> # build-tested Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: John Ogness <john.ogness@linutronix.de> Cc: Petr Mladek <pmladek@suse.com>
2022-10-20blktrace: remove unnessary stop block trace in 'blk_trace_shutdown'Ye Bin
As previous commit, 'blk_trace_cleanup' will stop block trace if block trace's state is 'Blktrace_running'. So remove unnessary stop block trace in 'blk_trace_shutdown'. Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20221019033602.752383-4-yebin@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-20blktrace: fix possible memleak in '__blk_trace_remove'Ye Bin
When test as follows: step1: ioctl(sda, BLKTRACESETUP, &arg) step2: ioctl(sda, BLKTRACESTART, NULL) step3: ioctl(sda, BLKTRACETEARDOWN, NULL) step4: ioctl(sda, BLKTRACESETUP, &arg) Got issue as follows: debugfs: File 'dropped' in directory 'sda' already present! debugfs: File 'msg' in directory 'sda' already present! debugfs: File 'trace0' in directory 'sda' already present! And also find syzkaller report issue like "KASAN: use-after-free Read in relay_switch_subbuf" "https://syzkaller.appspot.com/bug?id=13849f0d9b1b818b087341691be6cc3ac6a6bfb7" If remove block trace without stop(BLKTRACESTOP) block trace, '__blk_trace_remove' will just set 'q->blk_trace' with NULL. However, debugfs file isn't removed, so will report file already present when call BLKTRACESETUP. static int __blk_trace_remove(struct request_queue *q) { struct blk_trace *bt; bt = rcu_replace_pointer(q->blk_trace, NULL, lockdep_is_held(&q->debugfs_mutex)); if (!bt) return -EINVAL; if (bt->trace_state != Blktrace_running) blk_trace_cleanup(q, bt); return 0; } If do test as follows: step1: ioctl(sda, BLKTRACESETUP, &arg) step2: ioctl(sda, BLKTRACESTART, NULL) step3: ioctl(sda, BLKTRACETEARDOWN, NULL) step4: remove sda There will remove debugfs directory which will remove recursively all file under directory. >> blk_release_queue >> debugfs_remove_recursive(q->debugfs_dir) So all files which created in 'do_blk_trace_setup' are removed, and 'dentry->d_inode' is NULL. But 'q->blk_trace' is still in 'running_trace_lock', 'trace_note_tsk' will traverse 'running_trace_lock' all nodes. >>trace_note_tsk >> trace_note >> relay_reserve >> relay_switch_subbuf >> d_inode(buf->dentry)->i_size To solve above issues, reference commit '5afedf670caf', call 'blk_trace_cleanup' unconditionally in '__blk_trace_remove' and first stop block trace in 'blk_trace_cleanup'. Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20221019033602.752383-3-yebin@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-20blktrace: introduce 'blk_trace_{start,stop}' helperYe Bin
Introduce 'blk_trace_{start,stop}' helper. No functional changed. Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20221019033602.752383-2-yebin@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-18Merge tag 'for-netdev' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2022-10-18 We've added 33 non-merge commits during the last 14 day(s) which contain a total of 31 files changed, 874 insertions(+), 538 deletions(-). The main changes are: 1) Add RCU grace period chaining to BPF to wait for the completion of access from both sleepable and non-sleepable BPF programs, from Hou Tao & Paul E. McKenney. 2) Improve helper UAPI by explicitly defining BPF_FUNC_xxx integer values. In the wild we have seen OS vendors doing buggy backports where helper call numbers mismatched. This is an attempt to make backports more foolproof, from Andrii Nakryiko. 3) Add libbpf *_opts API-variants for bpf_*_get_fd_by_id() functions, from Roberto Sassu. 4) Fix libbpf's BTF dumper for structs with padding-only fields, from Eduard Zingerman. 5) Fix various libbpf bugs which have been found from fuzzing with malformed BPF object files, from Shung-Hsi Yu. 6) Clean up an unneeded check on existence of SSE2 in BPF x86-64 JIT, from Jie Meng. 7) Fix various ASAN bugs in both libbpf and selftests when running the BPF selftest suite on arm64, from Xu Kuohai. 8) Fix missing bpf_iter_vma_offset__destroy() call in BPF iter selftest and use in-skeleton link pointer to remove an explicit bpf_link__destroy(), from Jiri Olsa. 9) Fix BPF CI breakage by pointing to iptables-legacy instead of relying on symlinked iptables which got upgraded to iptables-nft, from Martin KaFai Lau. 10) Minor BPF selftest improvements all over the place, from various others. * tag 'for-netdev' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (33 commits) bpf/docs: Update README for most recent vmtest.sh bpf: Use rcu_trace_implies_rcu_gp() for program array freeing bpf: Use rcu_trace_implies_rcu_gp() in local storage map bpf: Use rcu_trace_implies_rcu_gp() in bpf memory allocator rcu-tasks: Provide rcu_trace_implies_rcu_gp() selftests/bpf: Use sys_pidfd_open() helper when possible libbpf: Fix null-pointer dereference in find_prog_by_sec_insn() libbpf: Deal with section with no data gracefully libbpf: Use elf_getshdrnum() instead of e_shnum selftest/bpf: Fix error usage of ASSERT_OK in xdp_adjust_tail.c selftests/bpf: Fix error failure of case test_xdp_adjust_tail_grow selftest/bpf: Fix memory leak in kprobe_multi_test selftests/bpf: Fix memory leak caused by not destroying skeleton libbpf: Fix memory leak in parse_usdt_arg() libbpf: Fix use-after-free in btf_dump_name_dups selftests/bpf: S/iptables/iptables-legacy/ in the bpf_nf and xdp_synproxy test selftests/bpf: Alphabetize DENYLISTs selftests/bpf: Add tests for _opts variants of bpf_*_get_fd_by_id() libbpf: Introduce bpf_link_get_fd_by_id_opts() libbpf: Introduce bpf_btf_get_fd_by_id_opts() ... ==================== Link: https://lore.kernel.org/r/20221018210631.11211-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-18kcsan: Instrument memcpy/memset/memmove with newer ClangMarco Elver
With Clang version 16+, -fsanitize=thread will turn memcpy/memset/memmove calls in instrumented functions into __tsan_memcpy/__tsan_memset/__tsan_memmove calls respectively. Add these functions to the core KCSAN runtime, so that we (a) catch data races with mem* functions, and (b) won't run into linker errors with such newer compilers. Cc: stable@vger.kernel.org # v5.10+ Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-10-18rcutorture: Verify NUM_ACTIVE_RCU_POLL_OLDSTATEPaul E. McKenney
This commit adds code to the RTWS_POLL_GET case of rcu_torture_writer() to verify that the value of NUM_ACTIVE_RCU_POLL_OLDSTATE is sufficiently large Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-10-18rcutorture: Verify NUM_ACTIVE_RCU_POLL_FULL_OLDSTATEPaul E. McKenney
This commit adds code to the RTWS_POLL_GET_FULL case of rcu_torture_writer() to verify that the value of NUM_ACTIVE_RCU_POLL_FULL_OLDSTATE is sufficiently large. [ paulmck: Fix whitespace issue located by checkpatch.pl. ] Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-10-18rcu: Fix missing nocb gp wake on rcu_barrier()Frederic Weisbecker
In preparation for RCU lazy changes, wake up the RCU nocb gp thread if needed after an entrain. This change prevents the RCU barrier callback from waiting in the queue for several seconds before the lazy callbacks in front of it are serviced. Reported-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-10-18rcu: Fix late wakeup when flush of bypass cblist happensJoel Fernandes (Google)
When the bypass cblist gets too big or its timeout has occurred, it is flushed into the main cblist. However, the bypass timer is still running and the behavior is that it would eventually expire and wake the GP thread. Since we are going to use the bypass cblist for lazy CBs, do the wakeup soon as the flush for "too big or too long" bypass list happens. Otherwise, long delays can happen for callbacks which get promoted from lazy to non-lazy. This is a good thing to do anyway (regardless of future lazy patches), since it makes the behavior consistent with behavior of other code paths where flushing into the ->cblist makes the GP kthread into a non-sleeping state quickly. [ Frederic Weisbecker: Changes to avoid unnecessary GP-thread wakeups plus comment changes. ] Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-10-18rcu: Simplify rcu_init_nohz() cpumask handlingZhen Lei
In kernels built with either CONFIG_RCU_NOCB_CPU_DEFAULT_ALL=y or CONFIG_NO_HZ_FULL=y, additional CPUs must be added to rcu_nocb_mask. Except that kernels booted without the rcu_nocbs= will not have allocated rcu_nocb_mask. And the current rcu_init_nohz() function uses its need_rcu_nocb_mask and offload_all local variables to track the rcu_nocb and nohz_full state. But there is a much simpler approach, namely creating a cpumask pointer to track the default and then using cpumask_available() to check the rcu_nocb_mask state. This commit takes this approach, thereby simplifying and shortening the rcu_init_nohz() function. Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Acked-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-10-18rcu: Use READ_ONCE() for lockless read of rnp->qsmaskJoel Fernandes (Google)
The rnp->qsmask is locklessly accessed from rcutree_dying_cpu(). This may help avoid load tearing due to concurrent access, KCSAN issues, and preserve sanity of people reading the mask in tracing. Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-10-18rcu: Synchronize ->qsmaskinitnext in rcu_boost_kthread_setaffinity()Pingfan Liu
Once either rcutree_online_cpu() or rcutree_dead_cpu() is invoked concurrently, the following rcu_boost_kthread_setaffinity() race can occur: CPU 1 CPU2 mask = rcu_rnp_online_cpus(rnp); ... mask = rcu_rnp_online_cpus(rnp); ... set_cpus_allowed_ptr(t, cm); set_cpus_allowed_ptr(t, cm); This results in CPU2's update being overwritten by that of CPU1, and thus the possibility of ->boost_kthread_task continuing to run on a to-be-offlined CPU. This commit therefore eliminates this race by relying on the pre-existing acquisition of ->boost_kthread_mutex to serialize the full process of changing the affinity of ->boost_kthread_task. Signed-off-by: Pingfan Liu <kernelfans@gmail.com> Cc: David Woodhouse <dwmw@amazon.co.uk> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: "Jason A. Donenfeld" <Jason@zx2c4.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-10-18rcu: Remove duplicate RCU exp QS report from rcu_report_dead()Zqiang
The rcu_report_dead() function invokes rcu_report_exp_rdp() in order to force an immediate expedited quiescent state on the outgoing CPU, and then it invokes rcu_preempt_deferred_qs() to provide any required deferred quiescent state of either sort. Because the call to rcu_preempt_deferred_qs() provides the expedited RCU quiescent state if requested, the call to rcu_report_exp_rdp() is potentially redundant. One possible issue is a concurrent start of a new expedited RCU grace period, but this situation is already handled correctly by __sync_rcu_exp_select_node_cpus(). This function will detect that the CPU is going offline via the error return from its call to smp_call_function_single(). In that case, it will retry, and eventually stop retrying due to rcu_report_exp_rdp() clearing the ->qsmaskinitnext bit corresponding to the target CPU. As a result, __sync_rcu_exp_select_node_cpus() will report the necessary quiescent state after dealing with any remaining CPU. This change assumes that control does not enter rcu_report_dead() within an RCU read-side critical section, but then again, the surviving call to rcu_preempt_deferred_qs() has always made this assumption. This commit therefore removes the call to rcu_report_exp_rdp(), thus relying on rcu_preempt_deferred_qs() to handle both normal and expedited quiescent states. Signed-off-by: Zqiang <qiang1.zhang@intel.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-10-18srcu: Convert ->srcu_lock_count and ->srcu_unlock_count to atomicPaul E. McKenney
NMI-safe variants of srcu_read_lock() and srcu_read_unlock() are needed by printk(), which on many architectures entails read-modify-write atomic operations. This commit prepares Tree SRCU for this change by making both ->srcu_lock_count and ->srcu_unlock_count by atomic_long_t. [ paulmck: Apply feedback from John Ogness. ] Link: https://lore.kernel.org/all/20220910221947.171557773@linutronix.de/ Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: John Ogness <john.ogness@linutronix.de> Cc: Petr Mladek <pmladek@suse.com>