summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2021-09-15PM: hibernate: Remove blk_status_to_errno in hib_wait_ioFalla Coulibaly
blk_status_to_errno doesn't appear to perform extra work besides converting blk_status_t to integer. This patch removes that unnecessary conversion as the return type of the function is blk_status_t. Signed-off-by: Falla Coulibaly <fallacoulibalyz@gmail.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-09-15PM: sleep: Do not assume that "mem" is always presentFlorian Fainelli
An implementation of suspend_ops is allowed to reject the PM_SUSPEND_MEM suspend type from its ->valid() callback, we should not assume that it is always present as this is not a correct reflection of what a firmware interface may support. Fixes: 406e79385f32 ("PM / sleep: System sleep state selection interface rework") Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-09-14bpf: Support for new btf kind BTF_KIND_TAGYonghong Song
LLVM14 added support for a new C attribute ([1]) __attribute__((btf_tag("arbitrary_str"))) This attribute will be emitted to dwarf ([2]) and pahole will convert it to BTF. Or for bpf target, this attribute will be emitted to BTF directly ([3], [4]). The attribute is intended to provide additional information for - struct/union type or struct/union member - static/global variables - static/global function or function parameter. For linux kernel, the btf_tag can be applied in various places to specify user pointer, function pre- or post- condition, function allow/deny in certain context, etc. Such information will be encoded in vmlinux BTF and can be used by verifier. The btf_tag can also be applied to bpf programs to help global verifiable functions, e.g., specifying preconditions, etc. This patch added basic parsing and checking support in kernel for new BTF_KIND_TAG kind. [1] https://reviews.llvm.org/D106614 [2] https://reviews.llvm.org/D106621 [3] https://reviews.llvm.org/D106622 [4] https://reviews.llvm.org/D109560 Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20210914223015.245546-1-yhs@fb.com
2021-09-14memblock: introduce saner 'memblock_free_ptr()' interfaceLinus Torvalds
The boot-time allocation interface for memblock is a mess, with 'memblock_alloc()' returning a virtual pointer, but then you are supposed to free it with 'memblock_free()' that takes a _physical_ address. Not only is that all kinds of strange and illogical, but it actually causes bugs, when people then use it like a normal allocation function, and it fails spectacularly on a NULL pointer: https://lore.kernel.org/all/20210912140820.GD25450@xsang-OptiPlex-9020/ or just random memory corruption if the debug checks don't catch it: https://lore.kernel.org/all/61ab2d0c-3313-aaab-514c-e15b7aa054a0@suse.cz/ I really don't want to apply patches that treat the symptoms, when the fundamental cause is this horribly confusing interface. I started out looking at just automating a sane replacement sequence, but because of this mix or virtual and physical addresses, and because people have used the "__pa()" macro that can take either a regular kernel pointer, or just the raw "unsigned long" address, it's all quite messy. So this just introduces a new saner interface for freeing a virtual address that was allocated using 'memblock_alloc()', and that was kept as a regular kernel pointer. And then it converts a couple of users that are obvious and easy to test, including the 'xbc_nodes' case in lib/bootconfig.c that caused problems. Reported-by: kernel test robot <oliver.sang@intel.com> Fixes: 40caa127f3c7 ("init: bootconfig: Remove all bootconfig data when the init memory is removed") Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-14bpf: Handle return value of BPF_PROG_TYPE_STRUCT_OPS progHou Tao
Currently if a function ptr in struct_ops has a return value, its caller will get a random return value from it, because the return value of related BPF_PROG_TYPE_STRUCT_OPS prog is just dropped. So adding a new flag BPF_TRAMP_F_RET_FENTRY_RET to tell bpf trampoline to save and return the return value of struct_ops prog if ret_size of the function ptr is greater than 0. Also restricting the flag to be used alone. Fixes: 85d33df357b6 ("bpf: Introduce BPF_MAP_TYPE_STRUCT_OPS") Signed-off-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20210914023351.3664499-1-houtao1@huawei.com
2021-09-14audit: Convert to SPDX identifierCai Huoqing
Use SPDX-License-Identifier instead of a verbose license text. Signed-off-by: Cai Huoqing <caihuoqing@baidu.com> Signed-off-by: Paul Moore <paul@paul-moore.com>
2021-09-14Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller
Daniel Borkmann says: ==================== pull-request: bpf 2021-09-14 The following pull-request contains BPF updates for your *net* tree. We've added 7 non-merge commits during the last 13 day(s) which contain a total of 18 files changed, 334 insertions(+), 193 deletions(-). The main changes are: 1) Fix mmap_lock lockdep splat in BPF stack map's build_id lookup, from Yonghong Song. 2) Fix BPF cgroup v2 program bypass upon net_cls/prio activation, from Daniel Borkmann. 3) Fix kvcalloc() BTF line info splat on oversized allocation attempts, from Bixuan Cui. 4) Fix BPF selftest build of task_pt_regs test for arm64/s390, from Jean-Philippe Brucker. 5) Fix BPF's disasm.{c,h} to dual-license so that it is aligned with bpftool given the former is a build dependency for the latter, from Daniel Borkmann with ACKs from contributors. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-09-13kcsan: selftest: Cleanup and add missing __initMarco Elver
Make test_encode_decode() more readable and add missing __init. Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-13kcsan: Move ctx to start of argument listMarco Elver
It is clearer if ctx is at the start of the function argument list; it'll be more consistent when adding functions with varying arguments but all requiring ctx. No functional change intended. Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-13kcsan: Support reporting scoped read-write access typeMarco Elver
Support generating the string representation of scoped read-write accesses for completeness. They will become required in planned changes. Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-13kcsan: Start stack trace with explicit location if providedMarco Elver
If an explicit access address is set, as is done for scoped accesses, always start the stack trace from that location. get_stack_skipnr() is changed into sanitize_stack_entries(), which if given an address, scans the stack trace for a matching function and then replaces that entry with the explicitly provided address. The previous reports for scoped accesses were all over the place, which could be quite confusing. We now always point at the start of the scope. Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-13kcsan: Save instruction pointer for scoped accessesMarco Elver
Save the instruction pointer for scoped accesses, so that it becomes possible for the reporting code to construct more accurate stack traces that will show the start of the scope. Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-13kcsan: Add ability to pass instruction pointer of access to reportingMarco Elver
Add the ability to pass an explicitly set instruction pointer of access from check_access() all the way through to reporting. In preparation of using it in reporting. Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-13kcsan: test: Fix flaky test caseMarco Elver
If CONFIG_KCSAN_REPORT_VALUE_CHANGE_ONLY=n, then we may also see data races between the writers only. If we get unlucky and never capture a read-write data race, but only the write-write data races, then the test_no_value_change* test cases may incorrectly fail. The second problem is that the initial value needs to be reset, as otherwise we might actually observe a value change at the start. Fix it by also looking for the write-write data races, and resetting the value to what will be written. Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-13kcsan: test: Use kunit_skip() to skip testsMarco Elver
Use the new kunit_skip() to skip tests if requirements were not met. Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-13kcsan: test: Defer kcsan_test_init() after kunit initializationMarco Elver
When the test is built into the kernel (not a module), kcsan_test_init() and kunit_init() both use late_initcall(), which means kcsan_test_init() might see a NULL debugfs_rootdir as parent dentry, resulting in kcsan_test_init() and kcsan_debugfs_init() both trying to create a debugfs node named "kcsan" in debugfs root. One of them will show an error and be unsuccessful. Defer kcsan_test_init() until we're sure kunit was initialized. Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-13rcutorture: Avoid problematic critical section nesting on PREEMPT_RTScott Wood
rcutorture is generating some nesting scenarios that are not compatible on PREEMPT_RT. For example: preempt_disable(); rcu_read_lock_bh(); preempt_enable(); rcu_read_unlock_bh(); The problem here is that on PREEMPT_RT the bottom halves have to be disabled and enabled in preemptible context. Reorder locking: start with BH locking and continue with then with disabling preemption or interrupts. In the unlocking do it reverse by first enabling interrupts and preemption and BH at the very end. Ensure that on PREEMPT_RT BH locking remains unchanged if in non-preemptible context. Link: https://lkml.kernel.org/r/20190911165729.11178-6-swood@redhat.com Link: https://lkml.kernel.org/r/20210819182035.GF4126399@paulmck-ThinkPad-P17-Gen-1 Signed-off-by: Scott Wood <swood@redhat.com> [bigeasy: Drop ATOM_BH, make it only about changing BH in atomic context. Allow enabling RCU in IRQ-off section. Reword commit message.] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-13rcutorture: Don't cpuhp_remove_state() if cpuhp_setup_state() failedPaul E. McKenney
Currently, in CONFIG_RCU_BOOST kernels, if the rcu_torture_init() function's call to cpuhp_setup_state() fails, rcu_torture_cleanup() gamely passes nonsense to cpuhp_remove_state(). This results in strange and misleading splats. This commit therefore ensures that if the rcu_torture_init() function's call to cpuhp_setup_state() fails, rcu_torture_cleanup() avoids invoking cpuhp_remove_state(). Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-13rcuscale: Warn on individual rcu_scale_init() error conditionsPaul E. McKenney
When running rcuscale as a module, any rcu_scale_init() issues will be reflected in the error code from modprobe or insmod, as the case may be. However, these error codes are not available when running rcuscale built-in, for example, when using the kvm.sh script. This commit therefore adds WARN_ON_ONCE() to allow distinguishing rcu_scale_init() errors when running rcuscale built-in. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-13refscale: Warn on individual ref_scale_init() error conditionsPaul E. McKenney
When running refscale as a module, any ref_scale_init() issues will be reflected in the error code from modprobe or insmod, as the case may be. However, these error codes are not available when running refscale built-in, for example, when using the kvm.sh script. This commit therefore adds WARN_ON_ONCE() to allow distinguishing ref_scale_init() errors when running refscale built-in. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-13locktorture: Warn on individual lock_torture_init() error conditionsPaul E. McKenney
When running locktorture as a module, any lock_torture_init() issues will be reflected in the error code from modprobe or insmod, as the case may be. However, these error codes are not available when running locktorture built-in, for example, when using the kvm.sh script. This commit therefore adds WARN_ON_ONCE() to allow distinguishing lock_torture_init() errors when running locktorture built-in. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-13rcutorture: Warn on individual rcu_torture_init() error conditionsPaul E. McKenney
When running rcutorture as a module, any rcu_torture_init() issues will be reflected in the error code from modprobe or insmod, as the case may be. However, these error codes are not available when running rcutorture built-in, for example, when using the kvm.sh script. This commit therefore adds WARN_ON_ONCE() to allow distinguishing rcu_torture_init() errors when running rcutorture built-in. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-13rcutorture: Suppressing read-exit testing is not an errorPaul E. McKenney
Currently, specifying the rcutorture.read_exit_burst=0 kernel boot parameter will result in a -EINVAL exit code that will stop the rcutorture test run before it has fully initialized. This commit therefore uses a zero exit code in that case, thus allowing rcutorture.read_exit_burst=0 to complete normally. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-13bpf, cgroups: Fix cgroup v2 fallback on v1/v2 mixed modeDaniel Borkmann
Fix cgroup v1 interference when non-root cgroup v2 BPF programs are used. Back in the days, commit bd1060a1d671 ("sock, cgroup: add sock->sk_cgroup") embedded per-socket cgroup information into sock->sk_cgrp_data and in order to save 8 bytes in struct sock made both mutually exclusive, that is, when cgroup v1 socket tagging (e.g. net_cls/net_prio) is used, then cgroup v2 falls back to the root cgroup in sock_cgroup_ptr() (&cgrp_dfl_root.cgrp). The assumption made was "there is no reason to mix the two and this is in line with how legacy and v2 compatibility is handled" as stated in bd1060a1d671. However, with Kubernetes more widely supporting cgroups v2 as well nowadays, this assumption no longer holds, and the possibility of the v1/v2 mixed mode with the v2 root fallback being hit becomes a real security issue. Many of the cgroup v2 BPF programs are also used for policy enforcement, just to pick _one_ example, that is, to programmatically deny socket related system calls like connect(2) or bind(2). A v2 root fallback would implicitly cause a policy bypass for the affected Pods. In production environments, we have recently seen this case due to various circumstances: i) a different 3rd party agent and/or ii) a container runtime such as [0] in the user's environment configuring legacy cgroup v1 net_cls tags, which triggered implicitly mentioned root fallback. Another case is Kubernetes projects like kind [1] which create Kubernetes nodes in a container and also add cgroup namespaces to the mix, meaning programs which are attached to the cgroup v2 root of the cgroup namespace get attached to a non-root cgroup v2 path from init namespace point of view. And the latter's root is out of reach for agents on a kind Kubernetes node to configure. Meaning, any entity on the node setting cgroup v1 net_cls tag will trigger the bypass despite cgroup v2 BPF programs attached to the namespace root. Generally, this mutual exclusiveness does not hold anymore in today's user environments and makes cgroup v2 usage from BPF side fragile and unreliable. This fix adds proper struct cgroup pointer for the cgroup v2 case to struct sock_cgroup_data in order to address these issues; this implicitly also fixes the tradeoffs being made back then with regards to races and refcount leaks as stated in bd1060a1d671, and removes the fallback, so that cgroup v2 BPF programs always operate as expected. [0] https://github.com/nestybox/sysbox/ [1] https://kind.sigs.k8s.io/ Fixes: bd1060a1d671 ("sock, cgroup: add sock->sk_cgroup") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Stanislav Fomichev <sdf@google.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/bpf/20210913230759.2313-1-daniel@iogearbox.net
2021-09-13rcu-tasks: Wait for trc_read_check_handler() IPIsPaul E. McKenney
Currently, RCU Tasks Trace initializes the trc_n_readers_need_end counter to the value one, increments it before each trc_read_check_handler() IPI, then decrements it within trc_read_check_handler() if the target task was in a quiescent state (or if the target task moved to some other CPU while the IPI was in flight), complaining if the new value was zero. The rationale for complaining is that the initial value of one must be decremented away before zero can be reached, and this decrement has not yet happened. Except that trc_read_check_handler() is initiated with an asynchronous smp_call_function_single(), which might be significantly delayed. This can result in false-positive complaints about the counter reaching zero. This commit therefore waits for in-flight IPI handlers to complete before decrementing away the initial value of one from the trc_n_readers_need_end counter. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-13rcu: Fix existing exp request check in sync_sched_exp_online_cleanup()Neeraj Upadhyay
The sync_sched_exp_online_cleanup() checks to see if RCU needs an expedited quiescent state from the incoming CPU, sending it an IPI if so. Before sending IPI, it checks whether expedited qs need has been already requested for the incoming CPU, by checking rcu_data.cpu_no_qs.b.exp for the current cpu, on which sync_sched_exp_online_cleanup() is running. This works for the case where incoming CPU is same as self. However, for the case where incoming CPU is different from self, expedited request won't get marked, which can potentially delay reporting of expedited quiescent state for the incoming CPU. Fixes: e015a3411220 ("rcu: Avoid self-IPI in sync_sched_exp_online_cleanup()") Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-13rcu: Make rcu update module parameters world-readableJuri Lelli
rcu update module parameters currently don't appear in sysfs and this is a serviceability issue as it might be needed to access their default values at runtime. Fix this issue by changing rcu update module parameters permissions to world-readable. Suggested-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-13rcu: Make rcu_normal_after_boot writable againJuri Lelli
Certain configurations (e.g., systems that make heavy use of netns) need to use synchronize_rcu_expedited() to service RCU grace periods even after boot. Even though synchronize_rcu_expedited() has been traditionally considered harmful for RT for the heavy use of IPIs, it is perfectly usable under certain conditions (e.g. nohz_full). Make rcupdate.rcu_normal_after_boot= again writeable on RT (if NO_HZ_ FULL is defined), but keep its default value to 1 (enabled) to avoid regressions. Users who need synchronize_rcu_expedited() will boot with rcupdate.rcu_normal_after_ boot=0 in the kernel cmdline. Reflect the change in synchronize_rcu_expedited_wait() by removing the WARN related to CONFIG_PREEMPT_RT. Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-13rcu: Make rcutree_dying_cpu() use its "cpu" parameterPaul E. McKenney
The CPU-hotplug functions take a "cpu" parameter, but rcutree_dying_cpu() ignores it in favor of this_cpu_ptr(). This works at the moment, but it would be better to be consistent. This might also work better given some possible future changes. This commit therefore uses per_cpu_ptr() to avoid ignoring the rcutree_dying_cpu() function's argument. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-13rcu: Simplify rcu_report_dead() call to rcu_report_exp_rdp()Paul E. McKenney
Currently, rcu_report_dead() disables preemption across its call to rcu_report_exp_rdp(), but this is pointless because interrupts are already disabled by the caller. In addition, rcu_report_dead() computes the address of the outgoing CPU's rcu_data structure, which is also pointless because this address is already present in local variable rdp. This commit therefore drops the preemption disabling and passes rdp to rcu_report_exp_rdp(). Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-13rcu: Move rcu_dynticks_eqs_online() to rcu_cpu_starting()Paul E. McKenney
The purpose of rcu_dynticks_eqs_online() is to adjust the ->dynticks counter of an incoming CPU when required. It is currently invoked from rcutree_prepare_cpu(), which runs before the incoming CPU is running, and thus on some other CPU. This makes the per-CPU accesses in rcu_dynticks_eqs_online() iffy at best, and it all "works" only because the running CPU cannot possibly be in dyntick-idle mode, which means that rcu_dynticks_eqs_online() never has any effect. It is currently OK for rcu_dynticks_eqs_online() to have no effect, but only because the CPU-offline process just happens to leave ->dynticks in the correct state. After all, if ->dynticks were in the wrong state on a just-onlined CPU, rcutorture would complain bitterly the next time that CPU went idle, at least in kernels built with CONFIG_RCU_EQS_DEBUG=y, for example, those built by rcutorture scenario TREE04. One could argue that this means that rcu_dynticks_eqs_online() is unnecessary, however, removing it would make the CPU-online process vulnerable to slight changes in the CPU-offline process. One could also ask why it is safe to move the rcu_dynticks_eqs_online() call so late in the CPU-online process. Indeed, there was a time when it would not have been safe, which does much to explain its current location. However, the marking of a CPU as online from an RCU perspective has long since moved from rcutree_prepare_cpu() to rcu_cpu_starting(), and all that is required is that ->dynticks be set correctly by the time that the CPU is marked as online from an RCU perspective. After all, the RCU grace-period kthread does not check to see if offline CPUs are also idle. (In case you were curious, this is one reason why there is quiescent-state reporting as part of the offlining process.) This commit therefore moves the call to rcu_dynticks_eqs_online() from rcutree_prepare_cpu() to rcu_cpu_starting(), this latter being guaranteed to be running on the incoming CPU. The call to this function must of course be placed before this rcu_cpu_starting() announces this CPU's presence to RCU. Reported-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-13rcu: Comment rcu_gp_init() code waiting for CPU-hotplug operationsPaul E. McKenney
Near the beginning of rcu_gp_init() is a per-rcu_node loop that waits for CPU-hotplug operations that might have started before the new grace period did. This commit adds a comment explaining that this wait does not exclude CPU-hotplug operations. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-13rcu: Fix undefined Kconfig macrosZhouyi Zhou
Invoking scripts/checkkconfigsymbols.py in the Linux-kernel source tree located the following issues: 1. TREE_PREEMPT_RCU Referencing files: arch/sh/configs/sdk7786_defconfig It should now be CONFIG_PREEMPT_RCU. Except that the CONFIG_PREEMPT=y in that same file implies CONFIG_PREEMPT_RCU=y. Therefore, delete the CONFIG_TREE_PREEMPT_RCU=y line. The reason is as follows: In kernel/rcu/Kconfig, we have config PREEMPT_RCU bool default y if PREEMPTION https://www.kernel.org/doc/Documentation/kbuild/kconfig-language.txt says, "The default value is only assigned to the config symbol if no other value was set by the user (via the input prompt above)." there is no prompt in config PREEMPT_RCU entry, so we are guaranteed to get CONFIG_PREEMPT_RCU=y when CONFIG_PREEMPT is present. 2. RCU_CPU_STALL_INFO Referencing files: arch/xtensa/configs/nommu_kc705_defconfig The old Kconfig option RCU_CPU_STALL_INFO was removed by commit 75c27f119b64 ("rcu: Remove CONFIG_RCU_CPU_STALL_INFO"), and the kernel now acts as if this Kconfig option was unconditionally enabled. 3. RCU_NOCB_CPU_ALL Referencing files: Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst This is an old snapshot of the code. I update this from the real rcu_prepare_for_idle() function in kernel/rcu/tree_plugin.h. This change was tested by invoking "make htmldocs". 4. RCU_TORTURE_TESTS Referencing files: kernel/rcu/rcutorture.c Forward-progress checking conflicts with CPU-stall testing, so we should complain at "modprobe rcutorture" when both are enabled. Signed-off-by: Zhouyi Zhou <zhouzhouyi@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-13rcu: Eliminate rcu_implicit_dynticks_qs() local variable ruqpPaul E. McKenney
The rcu_implicit_dynticks_qs() function's local variable ruqp references the ->rcu_urgent_qs field in the rcu_data structure referenced by the function parameter rdp, with a rather odd method for computing the pointer to this field. This commit therefore simplifies things and saves a couple of lines of code by replacing each instance of ruqp with &rdp->need_heavy_qs. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-13rcu: Eliminate rcu_implicit_dynticks_qs() local variable rnhqpPaul E. McKenney
The rcu_implicit_dynticks_qs() function's local variable rnhqp references the ->rcu_need_heavy_qs field in the rcu_data structure referenced by the function parameter rdp, with a rather odd method for computing the pointer to this field. This commit therefore simplifies things and saves a few lines of code by replacing each instance of rnhqp with &rdp->need_heavy_qs. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-13rcu-nocb: Fix a couple of tree_nocb code-style nitsPaul E. McKenney
This commit removes a non-value-returning "return" statement at the end of __call_rcu_nocb_wake() and adds a blank line following declarations in nocb_cb_can_run(). Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-13rcu: Mark accesses to rcu_state.n_force_qsPaul E. McKenney
This commit marks accesses to the rcu_state.n_force_qs. These data races are hard to make happen, but syzkaller was equal to the task. Reported-by: syzbot+e08a83a1940ec3846cd5@syzkaller.appspotmail.com Acked-by: Marco Elver <elver@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-13bpf: Add oversize check before call kvcalloc()Bixuan Cui
Commit 7661809d493b ("mm: don't allow oversized kvmalloc() calls") add the oversize check. When the allocation is larger than what kmalloc() supports, the following warning triggered: WARNING: CPU: 0 PID: 8408 at mm/util.c:597 kvmalloc_node+0x108/0x110 mm/util.c:597 Modules linked in: CPU: 0 PID: 8408 Comm: syz-executor221 Not tainted 5.14.0-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:kvmalloc_node+0x108/0x110 mm/util.c:597 Call Trace: kvmalloc include/linux/mm.h:806 [inline] kvmalloc_array include/linux/mm.h:824 [inline] kvcalloc include/linux/mm.h:829 [inline] check_btf_line kernel/bpf/verifier.c:9925 [inline] check_btf_info kernel/bpf/verifier.c:10049 [inline] bpf_check+0xd634/0x150d0 kernel/bpf/verifier.c:13759 bpf_prog_load kernel/bpf/syscall.c:2301 [inline] __sys_bpf+0x11181/0x126e0 kernel/bpf/syscall.c:4587 __do_sys_bpf kernel/bpf/syscall.c:4691 [inline] __se_sys_bpf kernel/bpf/syscall.c:4689 [inline] __x64_sys_bpf+0x78/0x90 kernel/bpf/syscall.c:4689 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x3d/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae Reported-by: syzbot+f3e749d4c662818ae439@syzkaller.appspotmail.com Signed-off-by: Bixuan Cui <cuibixuan@huawei.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20210911005557.45518-1-cuibixuan@huawei.com
2021-09-13libbpf: Make libbpf_version.h non-auto-generatedAndrii Nakryiko
Turn previously auto-generated libbpf_version.h header into a normal header file. This prevents various tricky Makefile integration issues, simplifies the overall build process, but also allows to further extend it with some more versioning-related APIs in the future. To prevent accidental out-of-sync versions as defined by libbpf.map and libbpf_version.h, Makefile checks their consistency at build time. Simultaneously with this change bump libbpf.map to v0.6. Also undo adding libbpf's output directory into include path for kernel/bpf/preload, bpftool, and resolve_btfids, which is not necessary because libbpf_version.h is just a normal header like any other. Fixes: 0b46b7550560 ("libbpf: Add LIBBPF_DEPRECATED_SINCE macro for scheduling API deprecations") Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20210913222309.3220849-1-andrii@kernel.org
2021-09-13audit: rename struct node to struct audit_node to prevent future name collisionsChristophe Leroy
Future work in the powerpc code results in a name collision with the identified "node" as struct node defined in kernel/audit_tree.c conflicts with struct node defined in include/linux/node.h (below). This patch takes the proactive route and renames the audit code's struct node to audit_node. CC kernel/audit_tree.o kernel/audit_tree.c:33:9: error: redefinition of 'struct node' 33 | struct node { | ^~~~ In file included from ./include/linux/cpu.h:17, from ./include/linux/static_call.h:102, from ./arch/powerpc/include/asm/machdep.h:10, from ./arch/powerpc/include/asm/archrandom.h:7, from ./include/linux/random.h:121, from ./include/linux/net.h:18, from ./include/linux/skbuff.h:26, from kernel/audit.h:11, from kernel/audit_tree.c:2: ./include/linux/node.h:84:8: note: originally defined here 84 | struct node { | ^~~~ make[2]: *** [kernel/audit_tree.o] Error 1 Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Reviewed-by: Richard Guy Briggs <rgb@redhat.com> [PM: rewrite subj/desc as the build failure is just a RFC patch] Signed-off-by: Paul Moore <paul@paul-moore.com>
2021-09-13cgroup/cpuset: Change references of cpuset_mutex to cpuset_rwsemWaiman Long
Since commit 1243dc518c9d ("cgroup/cpuset: Convert cpuset_mutex to percpu_rwsem"), cpuset_mutex has been replaced by cpuset_rwsem which is a percpu rwsem. However, the comments in kernel/cgroup/cpuset.c still reference cpuset_mutex which are now incorrect. Change all the references of cpuset_mutex to cpuset_rwsem. Fixes: 1243dc518c9d ("cgroup/cpuset: Convert cpuset_mutex to percpu_rwsem") Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2021-09-13bpf: Introduce helper bpf_get_branch_snapshotSong Liu
Introduce bpf_get_branch_snapshot(), which allows tracing pogram to get branch trace from hardware (e.g. Intel LBR). To use the feature, the user need to create perf_event with proper branch_record filtering on each cpu, and then calls bpf_get_branch_snapshot in the bpf function. On Intel CPUs, VLBR event (raw event 0x1b00) can be use for this. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20210910183352.3151445-3-songliubraving@fb.com
2021-09-13perf: Enable branch record for software eventsSong Liu
The typical way to access branch record (e.g. Intel LBR) is via hardware perf_event. For CPUs with FREEZE_LBRS_ON_PMI support, PMI could capture reliable LBR. On the other hand, LBR could also be useful in non-PMI scenario. For example, in kretprobe or bpf fexit program, LBR could provide a lot of information on what happened with the function. Add API to use branch record for software use. Note that, when the software event triggers, it is necessary to stop the branch record hardware asap. Therefore, static_call is used to remove some branch instructions in this process. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/bpf/20210910183352.3151445-2-songliubraving@fb.com
2021-09-13dma-debug: prevent an error message from causing runtime problemsHamza Mahfooz
For some drivers, that use the DMA API. This error message can be reached several millions of times per second, causing spam to the kernel's printk buffer and bringing the CPU usage up to 100% (so, it should be rate limited). However, since there is at least one driver that is in the mainline and suffers from the error condition, it is more useful to err_printk() here instead of just rate limiting the error message (in hopes that it will make it easier for other drivers that suffer from this issue to be spotted). Link: https://lkml.kernel.org/r/fd67fbac-64bf-f0ea-01e1-5938ccfab9d0@arm.com Reported-by: Jeremy Linton <jeremy.linton@arm.com> Signed-off-by: Hamza Mahfooz <someguy@effective-light.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-09-12Merge tag 'sched_urgent_for_v5.15_rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Borislav Petkov: - Make sure the idle timer expires in hardirq context, on PREEMPT_RT - Make sure the run-queue balance callback is invoked only on the outgoing CPU * tag 'sched_urgent_for_v5.15_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched: Prevent balance_push() on remote runqueues sched/idle: Make the idle timer expire in hard interrupt context
2021-09-12Merge tag 'locking_urgent_for_v5.15_rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fixes from Borislav Petkov: - Fix the futex PI requeue machinery to not return to userspace in inconsistent state - Avoid a potential null pointer dereference in the ww_mutex deadlock check - Other smaller cleanups and optimizations * tag 'locking_urgent_for_v5.15_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: locking/rtmutex: Fix ww_mutex deadlock check futex: Remove unused variable 'vpid' in futex_proxy_trylock_atomic() futex: Avoid redundant task lookup futex: Clarify comment for requeue_pi_wake_futex() futex: Prevent inconsistent state and exit race futex: Return error code instead of assigning it without effect locking/rwsem: Add missing __init_rwsem() for PREEMPT_RT
2021-09-11Merge tag 'trace-v5.15-3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fixes from Steven Rostedt: "Minor fixes to the processing of the bootconfig tree" * tag 'trace-v5.15-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: bootconfig: Rename xbc_node_find_child() to xbc_node_find_subkey() tracing/boot: Fix to check the histogram control param is a leaf node tracing/boot: Fix trace_boot_hist_add_array() to check array is value
2021-09-10bpf, mm: Fix lockdep warning triggered by stack_map_get_build_id_offset()Yonghong Song
Currently the bpf selftest "get_stack_raw_tp" triggered the warning: [ 1411.304463] WARNING: CPU: 3 PID: 140 at include/linux/mmap_lock.h:164 find_vma+0x47/0xa0 [ 1411.304469] Modules linked in: bpf_testmod(O) [last unloaded: bpf_testmod] [ 1411.304476] CPU: 3 PID: 140 Comm: systemd-journal Tainted: G W O 5.14.0+ #53 [ 1411.304479] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 [ 1411.304481] RIP: 0010:find_vma+0x47/0xa0 [ 1411.304484] Code: de 48 89 ef e8 ba f5 fe ff 48 85 c0 74 2e 48 83 c4 08 5b 5d c3 48 8d bf 28 01 00 00 be ff ff ff ff e8 2d 9f d8 00 85 c0 75 d4 <0f> 0b 48 89 de 48 8 [ 1411.304487] RSP: 0018:ffffabd440403db8 EFLAGS: 00010246 [ 1411.304490] RAX: 0000000000000000 RBX: 00007f00ad80a0e0 RCX: 0000000000000000 [ 1411.304492] RDX: 0000000000000001 RSI: ffffffff9776b144 RDI: ffffffff977e1b0e [ 1411.304494] RBP: ffff9cf5c2f50000 R08: ffff9cf5c3eb25d8 R09: 00000000fffffffe [ 1411.304496] R10: 0000000000000001 R11: 00000000ef974e19 R12: ffff9cf5c39ae0e0 [ 1411.304498] R13: 0000000000000000 R14: 0000000000000000 R15: ffff9cf5c39ae0e0 [ 1411.304501] FS: 00007f00ae754780(0000) GS:ffff9cf5fba00000(0000) knlGS:0000000000000000 [ 1411.304504] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1411.304506] CR2: 000000003e34343c CR3: 0000000103a98005 CR4: 0000000000370ee0 [ 1411.304508] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 1411.304510] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 1411.304512] Call Trace: [ 1411.304517] stack_map_get_build_id_offset+0x17c/0x260 [ 1411.304528] __bpf_get_stack+0x18f/0x230 [ 1411.304541] bpf_get_stack_raw_tp+0x5a/0x70 [ 1411.305752] RAX: 0000000000000000 RBX: 5541f689495641d7 RCX: 0000000000000000 [ 1411.305756] RDX: 0000000000000001 RSI: ffffffff9776b144 RDI: ffffffff977e1b0e [ 1411.305758] RBP: ffff9cf5c02b2f40 R08: ffff9cf5ca7606c0 R09: ffffcbd43ee02c04 [ 1411.306978] bpf_prog_32007c34f7726d29_bpf_prog1+0xaf/0xd9c [ 1411.307861] R10: 0000000000000001 R11: 0000000000000044 R12: ffff9cf5c2ef60e0 [ 1411.307865] R13: 0000000000000005 R14: 0000000000000000 R15: ffff9cf5c2ef6108 [ 1411.309074] bpf_trace_run2+0x8f/0x1a0 [ 1411.309891] FS: 00007ff485141700(0000) GS:ffff9cf5fae00000(0000) knlGS:0000000000000000 [ 1411.309896] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1411.311221] syscall_trace_enter.isra.20+0x161/0x1f0 [ 1411.311600] CR2: 00007ff48514d90e CR3: 0000000107114001 CR4: 0000000000370ef0 [ 1411.312291] do_syscall_64+0x15/0x80 [ 1411.312941] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 1411.313803] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 1411.314223] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 1411.315082] RIP: 0033:0x7f00ad80a0e0 [ 1411.315626] Call Trace: [ 1411.315632] stack_map_get_build_id_offset+0x17c/0x260 To reproduce, first build `test_progs` binary: make -C tools/testing/selftests/bpf -j60 and then run the binary at tools/testing/selftests/bpf directory: ./test_progs -t get_stack_raw_tp The warning is due to commit 5b78ed24e8ec ("mm/pagemap: add mmap_assert_locked() annotations to find_vma*()") which added mmap_assert_locked() in find_vma() function. The mmap_assert_locked() function asserts that mm->mmap_lock needs to be held. But this is not the case for bpf_get_stack() or bpf_get_stackid() helper (kernel/bpf/stackmap.c), which uses mmap_read_trylock_non_owner() instead. Since mm->mmap_lock is not held in bpf_get_stack[id]() use case, the above warning is emitted during test run. This patch fixed the issue by (1). using mmap_read_trylock() instead of mmap_read_trylock_non_owner() to satisfy lockdep checking in find_vma(), and (2). droping lockdep for mmap_lock right before the irq_work_queue(). The function mmap_read_trylock_non_owner() is also removed since after this patch nobody calls it any more. Fixes: 5b78ed24e8ec ("mm/pagemap: add mmap_assert_locked() annotations to find_vma*()") Suggested-by: Jason Gunthorpe <jgg@ziepe.ca> Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Luigi Rizzo <lrizzo@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: linux-mm@kvack.org Link: https://lore.kernel.org/bpf/20210909155000.1610299-1-yhs@fb.com
2021-09-09bootconfig: Rename xbc_node_find_child() to xbc_node_find_subkey()Masami Hiramatsu
Rename xbc_node_find_child() to xbc_node_find_subkey() for clarifying that function returns a key node (no value node). Since there are xbc_node_for_each_child() (loop on all child nodes) and xbc_node_for_each_subkey() (loop on only subkey nodes), this name distinction is necessary to avoid confusing users. Link: https://lkml.kernel.org/r/163119459826.161018.11200274779483115300.stgit@devnote2 Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-09-09tracing/boot: Fix to check the histogram control param is a leaf nodeMasami Hiramatsu
Since xbc_node_find_child() doesn't ensure the returned node is a leaf node (key-value pair or do not have subkeys), use xbc_node_find_value to ensure the histogram control parameter is a leaf node in trace_boot_compose_hist_cmd(). Link: https://lkml.kernel.org/r/163119459059.161018.18341288218424528962.stgit@devnote2 Fixes: e66ed86ca6c5 ("tracing/boot: Add per-event histogram action options") Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>