summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2021-09-16scftorture: Count reschedule IPIsPaul E. McKenney
Currently, only those IPIs that invoke scftorture's scf_handler() IPI handler function are counted. This means that runs exercising only scftorture.weight_resched will look like they have made no forward progress, resulting in "GP HANG" complaints from the rcutorture scripting. This commit therefore increments the scf_invoked_count per-CPU counter immediately after calling resched_cpu(). Fixes: 1ac78b49d61d4 ("scftorture: Add an alternative IPI vector") Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-16scftorture: Account for weight_resched when checking for all zeroesPaul E. McKenney
The "all zero weights makes no sense" error is emitted even when scftorture.weight_resched is non-zero because it was left out of the enclosing "if" condition. This commit adds it in. Fixes: 1ac78b49d61d4 ("scftorture: Add an alternative IPI vector") Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-16scftorture: Shut down if nonsensical arguments givenPaul E. McKenney
If (say) a 10-hour scftorture run is started, but the module parameters are so nonsensical that the run doesn't even start, then scftorture will wait the full ten hours when run built into a guest OS. This commit therefore shuts down the system in this case so that the error is reported immediately instead of ten hours hence. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-16scftorture: Allow zero weight to exclude an smp_call_function*() categoryPaul E. McKenney
This commit reworks the weighting calculations to allow zero to be specified to disable a given weight. For example, specifying the scftorture.weight_resched=0 kernel boot parameter without specifying a non-zero value for any of the other scftorture.weight_* parameters would provide the default weights for the others, but would refrain from doing any resched-based IPIs. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-16rcu: Avoid unneeded function call in rcu_read_unlock()Waiman Long
Since commit aa40c138cc8f3 ("rcu: Report QS for outermost PREEMPT=n rcu_read_unlock() for strict GPs") the function rcu_read_unlock_strict() is invoked by the inlined rcu_read_unlock() function. However, rcu_read_unlock_strict() is an empty function in production kernels, which are built with CONFIG_RCU_STRICT_GRACE_PERIOD=n. There is a mention of rcu_read_unlock_strict() in the BPF verifier, but this is in a deny-list, meaning that BPF does not care whether rcu_read_unlock_strict() is ever called. This commit therefore provides a slight performance improvement by hoisting the check of CONFIG_RCU_STRICT_GRACE_PERIOD from rcu_read_unlock_strict() into rcu_read_unlock(), thus avoiding the pointless call to an empty function. Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-15rcu-tasks: Update comments to cond_resched_tasks_rcu_qs()Paul E. McKenney
The cond_resched_rcu_qs() function no longer exists, despite being mentioned several times in kernel/rcu/tasks.h. This commit therefore updates it to the current cond_resched_tasks_rcu_qs(). Reported-by: Neeraj Upadhyay <neeraju@codeaurora.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-15rcu-tasks: Fix IPI failure handling in trc_wait_for_one_readerNeeraj Upadhyay
The trc_wait_for_one_reader() function is called at multiple stages of trace rcu-tasks GP function, rcu_tasks_wait_gp(): - First, it is called as part of per task function - rcu_tasks_trace_pertask(), for all non-idle tasks. As part of per task processing, this function add the task in the holdout list and if the task is currently running on a CPU, it sends IPI to the task's CPU. The IPI handler takes action depending on whether task is in trace rcu-tasks read side critical section or not: - a. If the task is in trace rcu-tasks read side critical section (t->trc_reader_nesting != 0), the IPI handler sets the task's ->trc_reader_special.b.need_qs, so that this task notifies exit from its outermost read side critical section (by decrementing trc_n_readers_need_end) to the GP handling function. trc_wait_for_one_reader() also increments trc_n_readers_need_end, so that the trace rcu-tasks GP handler function waits for this task's read side exit notification. The IPI handler also sets t->trc_reader_checked to true, and no further IPIs are sent for this task, for this trace rcu-tasks grace period and this task can be removed from holdout list. - b. If the task is in the process of exiting its trace rcu-tasks read side critical section, (t->trc_reader_nesting < 0), defer this task's processing to future calls to trc_wait_for_one_reader(). - c. If task is not in rcu-task read side critical section, t->trc_reader_nesting == 0, ->trc_reader_checked is set for this task, so that this task is removed from holdout list. - Second, trc_wait_for_one_reader() is called as part of post scan, in function rcu_tasks_trace_postscan(), for all idle tasks. - Third, in function check_all_holdout_tasks_trace(), this function is called for each task in the holdout list, but only if there isn't a pending IPI for the task (->trc_ipi_to_cpu == -1). This function removed the task from holdout list, if IPI handler has completed the required work, to ensure that the current trace rcu-tasks grace period either waits for this task, or this task is not in a trace rcu-tasks read side critical section. Now, considering the scenario where smp_call_function_single() fails in first case, inside rcu_tasks_trace_pertask(). In this case, ->trc_ipi_to_cpu is set to the current CPU for that task. This will result in trc_wait_for_one_reader() getting skipped in third case, inside check_all_holdout_tasks_trace(), for this task. This further results in ->trc_reader_checked never getting set for this task, and the task not getting removed from holdout list. This can cause the current trace rcu-tasks grace period to stall. Fix the above problem, by resetting ->trc_ipi_to_cpu to -1, on smp_call_function_single() failure, so that future IPI calls can be send for this task. Note that all three of the trc_wait_for_one_reader() function's callers (rcu_tasks_trace_pertask(), rcu_tasks_trace_postscan(), check_all_holdout_tasks_trace()) hold cpu_read_lock(). This means that smp_call_function_single() cannot race with CPU hotplug, and thus should never fail. Therefore, also add a warning in order to report any such failure in case smp_call_function_single() grows some other reason for failure. Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-15rcu-tasks: Fix read-side primitives comment for call_rcu_tasks_traceNeeraj Upadhyay
call_rcu_tasks_trace() does have read-side primitives - rcu_read_lock_trace() and rcu_read_unlock_trace(). Fix this information in the comments. Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-15rcu-tasks: Clarify read side section info for rcu_tasks_rude GP primitivesNeeraj Upadhyay
RCU tasks rude variant does not check whether the current running context on a CPU is usermode. Read side critical section ends on transition to usermode execution, by the virtue of usermode execution being schedulable. Clarify this in comments for call_rcu_tasks_rude() and synchronize_rcu_tasks_rude(). Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-15rcu-tasks: Correct comparisons for CPU numbers in show_stalled_task_traceNeeraj Upadhyay
Valid CPU numbers can be zero or greater, but the checks for ->trc_ipi_to_cpu and tick_nohz_full_cpu()'s argument are for strictly greater than. This commit therefore corrects the check for no_hz_full cpu in show_stalled_task_trace() so as to include cpu 0. Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-15rcu-tasks: Correct firstreport usage in check_all_holdout_tasks_traceNeeraj Upadhyay
In check_all_holdout_tasks_trace(), firstreport is a pointer argument; so, check the dereferenced value, instead of checking the pointer. Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-15rcu-tasks: Fix s/rcu_add_holdout/trc_add_holdout/ typo in commentNeeraj Upadhyay
Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-15rcu-tasks: Move RTGS_WAIT_CBS to beginning of rcu_tasks_kthread() loopPaul E. McKenney
Early in debugging, it made some sense to differentiate the first iteration from subsequent iterations, but now this just causes confusion. This commit therefore moves the "set_tasks_gp_state(rtp, RTGS_WAIT_CBS)" statement to the beginning of the "for" loop in rcu_tasks_kthread(). Reported-by: Neeraj Upadhyay <neeraju@codeaurora.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-15rcu-tasks: Fix s/instruction/instructions/ typo in commentPaul E. McKenney
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-15rcu-tasks: Remove second argument of rcu_read_unlock_trace_special()Paul E. McKenney
The second argument of rcu_read_unlock_trace_special() is always zero. When called from exit_tasks_rcu_finish_trace(), it is the constant zero, and rcu_read_unlock_trace_special() doesn't get called from rcu_read_unlock_trace() unless the value of local variable "nesting" is zero because in that case the early return is taken instead. This commit therefore removes the "nesting" argument from the rcu_read_unlock_trace_special() function, substituting the constant zero within that function. This commit also adds a WARN_ON_ONCE() to rcu_read_lock_trace_held() in case non-zeroness some day appears. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-15rcu-tasks: Add trc_inspect_reader() checks for exiting critical sectionPaul E. McKenney
Currently, trc_inspect_reader() treats a task exiting its RCU Tasks Trace read-side critical section the same as being within that critical section. However, this can fail because that task might have already checked its .need_qs field, which means that it might never decrement the all-important trc_n_readers_need_end counter. Of course, for that to happen, the task would need to never again execute an RCU Tasks Trace read-side critical section, but this really could happen if the system's last trampoline was removed. Note that exit from such a critical section cannot be treated as a quiescent state due to the possibility of nested critical sections. This means that if trc_inspect_reader() sees a negative nesting value, it must set up to try again later. This commit therefore ignores tasks that are exiting their RCU Tasks Trace read-side critical sections so that they will be rechecked later. [ paulmck: Apply feedback from Neeraj Upadhyay and Boqun Feng. ] Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-15rcu-tasks: Simplify trc_read_check_handler() atomic operationsPaul E. McKenney
Currently, trc_wait_for_one_reader() atomically increments the trc_n_readers_need_end counter before sending the IPI invoking trc_read_check_handler(). All failure paths out of trc_read_check_handler() and also from the smp_call_function_single() within trc_wait_for_one_reader() must carefully atomically decrement this counter. This is more complex than it needs to be. This commit therefore simplifies things and saves a few lines of code by dispensing with the atomic decrements in favor of having trc_read_check_handler() do the atomic increment only in the success case. In theory, this represents no change in functionality. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-15locking/rwbase: Take care of ordering guarantee for fastpath readerBoqun Feng
Readers of rwbase can lock and unlock without taking any inner lock, if that happens, we need the ordering provided by atomic operations to satisfy the ordering semantics of lock/unlock. Without that, considering the follow case: { X = 0 initially } CPU 0 CPU 1 ===== ===== rt_write_lock(); X = 1 rt_write_unlock(): atomic_add(READER_BIAS - WRITER_BIAS, ->readers); // ->readers is READER_BIAS. rt_read_lock(): if ((r = atomic_read(->readers)) < 0) // True atomic_try_cmpxchg(->readers, r, r + 1); // succeed. <acquire the read lock via fast path> r1 = X; // r1 may be 0, because nothing prevent the reordering // of "X=1" and atomic_add() on CPU 1. Therefore audit every usage of atomic operations that may happen in a fast path, and add necessary barriers. Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20210909110203.953991276@infradead.org
2021-09-15locking/rwbase: Extract __rwbase_write_trylock()Peter Zijlstra
The code in rwbase_write_lock() is a little non-obvious vs the read+set 'trylock', extract the sequence into a helper function to clarify the code. This also provides a single site to fix fast-path ordering. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/YUCq3L+u44NDieEJ@hirez.programming.kicks-ass.net
2021-09-15locking/rwbase: Properly match set_and_save_state() to restore_state()Peter Zijlstra
Noticed while looking at the readers race. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Will Deacon <will@kernel.org> Link: https://lkml.kernel.org/r/20210909110203.828203010@infradead.org
2021-09-15events: Reuse value read using READ_ONCE instead of re-reading itBaptiste Lepers
In perf_event_addr_filters_apply, the task associated with the event (event->ctx->task) is read using READ_ONCE at the beginning of the function, checked, and then re-read from event->ctx->task, voiding all guarantees of the checks. Reuse the value that was read by READ_ONCE to ensure the consistency of the task struct throughout the function. Fixes: 375637bc52495 ("perf/core: Introduce address range filtering") Signed-off-by: Baptiste Lepers <baptiste.lepers@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210906015310.12802-1-baptiste.lepers@gmail.com
2021-09-15locking/lockdep: Avoid RCU-induced noinstr failPeter Zijlstra
vmlinux.o: warning: objtool: look_up_lock_class()+0xc7: call to rcu_read_lock_any_held() leaves .noinstr.text section Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20210624095148.311980536@infradead.org
2021-09-15PM: hibernate: Remove blk_status_to_errno in hib_wait_ioFalla Coulibaly
blk_status_to_errno doesn't appear to perform extra work besides converting blk_status_t to integer. This patch removes that unnecessary conversion as the return type of the function is blk_status_t. Signed-off-by: Falla Coulibaly <fallacoulibalyz@gmail.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-09-15PM: sleep: Do not assume that "mem" is always presentFlorian Fainelli
An implementation of suspend_ops is allowed to reject the PM_SUSPEND_MEM suspend type from its ->valid() callback, we should not assume that it is always present as this is not a correct reflection of what a firmware interface may support. Fixes: 406e79385f32 ("PM / sleep: System sleep state selection interface rework") Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-09-14bpf: Support for new btf kind BTF_KIND_TAGYonghong Song
LLVM14 added support for a new C attribute ([1]) __attribute__((btf_tag("arbitrary_str"))) This attribute will be emitted to dwarf ([2]) and pahole will convert it to BTF. Or for bpf target, this attribute will be emitted to BTF directly ([3], [4]). The attribute is intended to provide additional information for - struct/union type or struct/union member - static/global variables - static/global function or function parameter. For linux kernel, the btf_tag can be applied in various places to specify user pointer, function pre- or post- condition, function allow/deny in certain context, etc. Such information will be encoded in vmlinux BTF and can be used by verifier. The btf_tag can also be applied to bpf programs to help global verifiable functions, e.g., specifying preconditions, etc. This patch added basic parsing and checking support in kernel for new BTF_KIND_TAG kind. [1] https://reviews.llvm.org/D106614 [2] https://reviews.llvm.org/D106621 [3] https://reviews.llvm.org/D106622 [4] https://reviews.llvm.org/D109560 Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20210914223015.245546-1-yhs@fb.com
2021-09-14memblock: introduce saner 'memblock_free_ptr()' interfaceLinus Torvalds
The boot-time allocation interface for memblock is a mess, with 'memblock_alloc()' returning a virtual pointer, but then you are supposed to free it with 'memblock_free()' that takes a _physical_ address. Not only is that all kinds of strange and illogical, but it actually causes bugs, when people then use it like a normal allocation function, and it fails spectacularly on a NULL pointer: https://lore.kernel.org/all/20210912140820.GD25450@xsang-OptiPlex-9020/ or just random memory corruption if the debug checks don't catch it: https://lore.kernel.org/all/61ab2d0c-3313-aaab-514c-e15b7aa054a0@suse.cz/ I really don't want to apply patches that treat the symptoms, when the fundamental cause is this horribly confusing interface. I started out looking at just automating a sane replacement sequence, but because of this mix or virtual and physical addresses, and because people have used the "__pa()" macro that can take either a regular kernel pointer, or just the raw "unsigned long" address, it's all quite messy. So this just introduces a new saner interface for freeing a virtual address that was allocated using 'memblock_alloc()', and that was kept as a regular kernel pointer. And then it converts a couple of users that are obvious and easy to test, including the 'xbc_nodes' case in lib/bootconfig.c that caused problems. Reported-by: kernel test robot <oliver.sang@intel.com> Fixes: 40caa127f3c7 ("init: bootconfig: Remove all bootconfig data when the init memory is removed") Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-14bpf: Handle return value of BPF_PROG_TYPE_STRUCT_OPS progHou Tao
Currently if a function ptr in struct_ops has a return value, its caller will get a random return value from it, because the return value of related BPF_PROG_TYPE_STRUCT_OPS prog is just dropped. So adding a new flag BPF_TRAMP_F_RET_FENTRY_RET to tell bpf trampoline to save and return the return value of struct_ops prog if ret_size of the function ptr is greater than 0. Also restricting the flag to be used alone. Fixes: 85d33df357b6 ("bpf: Introduce BPF_MAP_TYPE_STRUCT_OPS") Signed-off-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20210914023351.3664499-1-houtao1@huawei.com
2021-09-14audit: Convert to SPDX identifierCai Huoqing
Use SPDX-License-Identifier instead of a verbose license text. Signed-off-by: Cai Huoqing <caihuoqing@baidu.com> Signed-off-by: Paul Moore <paul@paul-moore.com>
2021-09-14Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller
Daniel Borkmann says: ==================== pull-request: bpf 2021-09-14 The following pull-request contains BPF updates for your *net* tree. We've added 7 non-merge commits during the last 13 day(s) which contain a total of 18 files changed, 334 insertions(+), 193 deletions(-). The main changes are: 1) Fix mmap_lock lockdep splat in BPF stack map's build_id lookup, from Yonghong Song. 2) Fix BPF cgroup v2 program bypass upon net_cls/prio activation, from Daniel Borkmann. 3) Fix kvcalloc() BTF line info splat on oversized allocation attempts, from Bixuan Cui. 4) Fix BPF selftest build of task_pt_regs test for arm64/s390, from Jean-Philippe Brucker. 5) Fix BPF's disasm.{c,h} to dual-license so that it is aligned with bpftool given the former is a build dependency for the latter, from Daniel Borkmann with ACKs from contributors. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-09-13kcsan: selftest: Cleanup and add missing __initMarco Elver
Make test_encode_decode() more readable and add missing __init. Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-13kcsan: Move ctx to start of argument listMarco Elver
It is clearer if ctx is at the start of the function argument list; it'll be more consistent when adding functions with varying arguments but all requiring ctx. No functional change intended. Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-13kcsan: Support reporting scoped read-write access typeMarco Elver
Support generating the string representation of scoped read-write accesses for completeness. They will become required in planned changes. Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-13kcsan: Start stack trace with explicit location if providedMarco Elver
If an explicit access address is set, as is done for scoped accesses, always start the stack trace from that location. get_stack_skipnr() is changed into sanitize_stack_entries(), which if given an address, scans the stack trace for a matching function and then replaces that entry with the explicitly provided address. The previous reports for scoped accesses were all over the place, which could be quite confusing. We now always point at the start of the scope. Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-13kcsan: Save instruction pointer for scoped accessesMarco Elver
Save the instruction pointer for scoped accesses, so that it becomes possible for the reporting code to construct more accurate stack traces that will show the start of the scope. Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-13kcsan: Add ability to pass instruction pointer of access to reportingMarco Elver
Add the ability to pass an explicitly set instruction pointer of access from check_access() all the way through to reporting. In preparation of using it in reporting. Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-13kcsan: test: Fix flaky test caseMarco Elver
If CONFIG_KCSAN_REPORT_VALUE_CHANGE_ONLY=n, then we may also see data races between the writers only. If we get unlucky and never capture a read-write data race, but only the write-write data races, then the test_no_value_change* test cases may incorrectly fail. The second problem is that the initial value needs to be reset, as otherwise we might actually observe a value change at the start. Fix it by also looking for the write-write data races, and resetting the value to what will be written. Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-13kcsan: test: Use kunit_skip() to skip testsMarco Elver
Use the new kunit_skip() to skip tests if requirements were not met. Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-13kcsan: test: Defer kcsan_test_init() after kunit initializationMarco Elver
When the test is built into the kernel (not a module), kcsan_test_init() and kunit_init() both use late_initcall(), which means kcsan_test_init() might see a NULL debugfs_rootdir as parent dentry, resulting in kcsan_test_init() and kcsan_debugfs_init() both trying to create a debugfs node named "kcsan" in debugfs root. One of them will show an error and be unsuccessful. Defer kcsan_test_init() until we're sure kunit was initialized. Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-13rcutorture: Avoid problematic critical section nesting on PREEMPT_RTScott Wood
rcutorture is generating some nesting scenarios that are not compatible on PREEMPT_RT. For example: preempt_disable(); rcu_read_lock_bh(); preempt_enable(); rcu_read_unlock_bh(); The problem here is that on PREEMPT_RT the bottom halves have to be disabled and enabled in preemptible context. Reorder locking: start with BH locking and continue with then with disabling preemption or interrupts. In the unlocking do it reverse by first enabling interrupts and preemption and BH at the very end. Ensure that on PREEMPT_RT BH locking remains unchanged if in non-preemptible context. Link: https://lkml.kernel.org/r/20190911165729.11178-6-swood@redhat.com Link: https://lkml.kernel.org/r/20210819182035.GF4126399@paulmck-ThinkPad-P17-Gen-1 Signed-off-by: Scott Wood <swood@redhat.com> [bigeasy: Drop ATOM_BH, make it only about changing BH in atomic context. Allow enabling RCU in IRQ-off section. Reword commit message.] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-13rcutorture: Don't cpuhp_remove_state() if cpuhp_setup_state() failedPaul E. McKenney
Currently, in CONFIG_RCU_BOOST kernels, if the rcu_torture_init() function's call to cpuhp_setup_state() fails, rcu_torture_cleanup() gamely passes nonsense to cpuhp_remove_state(). This results in strange and misleading splats. This commit therefore ensures that if the rcu_torture_init() function's call to cpuhp_setup_state() fails, rcu_torture_cleanup() avoids invoking cpuhp_remove_state(). Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-13rcuscale: Warn on individual rcu_scale_init() error conditionsPaul E. McKenney
When running rcuscale as a module, any rcu_scale_init() issues will be reflected in the error code from modprobe or insmod, as the case may be. However, these error codes are not available when running rcuscale built-in, for example, when using the kvm.sh script. This commit therefore adds WARN_ON_ONCE() to allow distinguishing rcu_scale_init() errors when running rcuscale built-in. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-13refscale: Warn on individual ref_scale_init() error conditionsPaul E. McKenney
When running refscale as a module, any ref_scale_init() issues will be reflected in the error code from modprobe or insmod, as the case may be. However, these error codes are not available when running refscale built-in, for example, when using the kvm.sh script. This commit therefore adds WARN_ON_ONCE() to allow distinguishing ref_scale_init() errors when running refscale built-in. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-13locktorture: Warn on individual lock_torture_init() error conditionsPaul E. McKenney
When running locktorture as a module, any lock_torture_init() issues will be reflected in the error code from modprobe or insmod, as the case may be. However, these error codes are not available when running locktorture built-in, for example, when using the kvm.sh script. This commit therefore adds WARN_ON_ONCE() to allow distinguishing lock_torture_init() errors when running locktorture built-in. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-13rcutorture: Warn on individual rcu_torture_init() error conditionsPaul E. McKenney
When running rcutorture as a module, any rcu_torture_init() issues will be reflected in the error code from modprobe or insmod, as the case may be. However, these error codes are not available when running rcutorture built-in, for example, when using the kvm.sh script. This commit therefore adds WARN_ON_ONCE() to allow distinguishing rcu_torture_init() errors when running rcutorture built-in. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-13rcutorture: Suppressing read-exit testing is not an errorPaul E. McKenney
Currently, specifying the rcutorture.read_exit_burst=0 kernel boot parameter will result in a -EINVAL exit code that will stop the rcutorture test run before it has fully initialized. This commit therefore uses a zero exit code in that case, thus allowing rcutorture.read_exit_burst=0 to complete normally. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-13bpf, cgroups: Fix cgroup v2 fallback on v1/v2 mixed modeDaniel Borkmann
Fix cgroup v1 interference when non-root cgroup v2 BPF programs are used. Back in the days, commit bd1060a1d671 ("sock, cgroup: add sock->sk_cgroup") embedded per-socket cgroup information into sock->sk_cgrp_data and in order to save 8 bytes in struct sock made both mutually exclusive, that is, when cgroup v1 socket tagging (e.g. net_cls/net_prio) is used, then cgroup v2 falls back to the root cgroup in sock_cgroup_ptr() (&cgrp_dfl_root.cgrp). The assumption made was "there is no reason to mix the two and this is in line with how legacy and v2 compatibility is handled" as stated in bd1060a1d671. However, with Kubernetes more widely supporting cgroups v2 as well nowadays, this assumption no longer holds, and the possibility of the v1/v2 mixed mode with the v2 root fallback being hit becomes a real security issue. Many of the cgroup v2 BPF programs are also used for policy enforcement, just to pick _one_ example, that is, to programmatically deny socket related system calls like connect(2) or bind(2). A v2 root fallback would implicitly cause a policy bypass for the affected Pods. In production environments, we have recently seen this case due to various circumstances: i) a different 3rd party agent and/or ii) a container runtime such as [0] in the user's environment configuring legacy cgroup v1 net_cls tags, which triggered implicitly mentioned root fallback. Another case is Kubernetes projects like kind [1] which create Kubernetes nodes in a container and also add cgroup namespaces to the mix, meaning programs which are attached to the cgroup v2 root of the cgroup namespace get attached to a non-root cgroup v2 path from init namespace point of view. And the latter's root is out of reach for agents on a kind Kubernetes node to configure. Meaning, any entity on the node setting cgroup v1 net_cls tag will trigger the bypass despite cgroup v2 BPF programs attached to the namespace root. Generally, this mutual exclusiveness does not hold anymore in today's user environments and makes cgroup v2 usage from BPF side fragile and unreliable. This fix adds proper struct cgroup pointer for the cgroup v2 case to struct sock_cgroup_data in order to address these issues; this implicitly also fixes the tradeoffs being made back then with regards to races and refcount leaks as stated in bd1060a1d671, and removes the fallback, so that cgroup v2 BPF programs always operate as expected. [0] https://github.com/nestybox/sysbox/ [1] https://kind.sigs.k8s.io/ Fixes: bd1060a1d671 ("sock, cgroup: add sock->sk_cgroup") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Stanislav Fomichev <sdf@google.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/bpf/20210913230759.2313-1-daniel@iogearbox.net
2021-09-13rcu-tasks: Wait for trc_read_check_handler() IPIsPaul E. McKenney
Currently, RCU Tasks Trace initializes the trc_n_readers_need_end counter to the value one, increments it before each trc_read_check_handler() IPI, then decrements it within trc_read_check_handler() if the target task was in a quiescent state (or if the target task moved to some other CPU while the IPI was in flight), complaining if the new value was zero. The rationale for complaining is that the initial value of one must be decremented away before zero can be reached, and this decrement has not yet happened. Except that trc_read_check_handler() is initiated with an asynchronous smp_call_function_single(), which might be significantly delayed. This can result in false-positive complaints about the counter reaching zero. This commit therefore waits for in-flight IPI handlers to complete before decrementing away the initial value of one from the trc_n_readers_need_end counter. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-13rcu: Fix existing exp request check in sync_sched_exp_online_cleanup()Neeraj Upadhyay
The sync_sched_exp_online_cleanup() checks to see if RCU needs an expedited quiescent state from the incoming CPU, sending it an IPI if so. Before sending IPI, it checks whether expedited qs need has been already requested for the incoming CPU, by checking rcu_data.cpu_no_qs.b.exp for the current cpu, on which sync_sched_exp_online_cleanup() is running. This works for the case where incoming CPU is same as self. However, for the case where incoming CPU is different from self, expedited request won't get marked, which can potentially delay reporting of expedited quiescent state for the incoming CPU. Fixes: e015a3411220 ("rcu: Avoid self-IPI in sync_sched_exp_online_cleanup()") Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-13rcu: Make rcu update module parameters world-readableJuri Lelli
rcu update module parameters currently don't appear in sysfs and this is a serviceability issue as it might be needed to access their default values at runtime. Fix this issue by changing rcu update module parameters permissions to world-readable. Suggested-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-13rcu: Make rcu_normal_after_boot writable againJuri Lelli
Certain configurations (e.g., systems that make heavy use of netns) need to use synchronize_rcu_expedited() to service RCU grace periods even after boot. Even though synchronize_rcu_expedited() has been traditionally considered harmful for RT for the heavy use of IPIs, it is perfectly usable under certain conditions (e.g. nohz_full). Make rcupdate.rcu_normal_after_boot= again writeable on RT (if NO_HZ_ FULL is defined), but keep its default value to 1 (enabled) to avoid regressions. Users who need synchronize_rcu_expedited() will boot with rcupdate.rcu_normal_after_ boot=0 in the kernel cmdline. Reflect the change in synchronize_rcu_expedited_wait() by removing the WARN related to CONFIG_PREEMPT_RT. Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>