summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2022-02-01rcutorture: Fix rcu_fwd_mutex deadlockPaul E. McKenney
The rcu_torture_fwd_cb_hist() function acquires rcu_fwd_mutex, but is invoked from rcutorture_oom_notify() function, which hold this same mutex across this call. This commit fixes the resulting deadlock. Reported-by: kernel test robot <oliver.sang@intel.com> Tested-by: Oliver Sang <oliver.sang@intel.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-02-01rcutorture: Add end-of-test check to rcu_torture_fwd_prog() loopPaul E. McKenney
The second and subsequent forward-progress kthreads loop waiting for the first forward-progress kthread to start the next test interval. Unfortunately, if the test ends while one of those kthreads is waiting, the test will hang. This hang occurs because that wait loop fails to check for the end of the test. This commit therefore adds an end-of-test check to that wait loop. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-02-01rcutorture: Make rcu_fwd_cb_nodelay be a counterPaul E. McKenney
Back when only one rcutorture kthread could do forward-progress testing, it was just fine for rcu_fwd_cb_nodelay to be a non-atomic bool. It was set at the start of forward-progress testing and cleared at the end. But now that there are multiple threads, the value can be cleared while one of the threads is still doing forward-progress testing. This commit therefore makes rcu_fwd_cb_nodelay be an atomic counter, replacing the WRITE_ONCE() operations with atomic_inc() and atomic_dec(). Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-02-01rcutorture: Increase visibility of forward-progress hangsPaul E. McKenney
This commit adds a few pr_alert() calls to rcutorture's forward-progress testing in order to better diagnose shutdown-time hangs. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-02-01torture: Distinguish kthread stopping and being asked to stopPaul E. McKenney
Right now, if a given kthread (call it "kthread") realizes that it needs to stop, "Stopping kthread" is written to the console. When the cleanup code decides that it is time to stop that kthread, "Stopping kthread tasks" is written to the console. These two events might happen in either order, especially in the case of time-based torture-test shutdown. But it is hard to distinguish these, especially for those unfamiliar with the torture tests. This commit therefore changes the first case from "Stopping kthread" to "kthread is stopping" to make things more clear. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-02-01rcutorture: Print message before invoking ->cb_barrier()Paul E. McKenney
The various ->cb_barrier() functions, for example, rcu_barrier(), sometimes cause rcutorture hangs. But currently, the last console message is the unenlightening "Stopping rcu_torture_stats". This commit therefore prints a message of the form "rcu_torture_cleanup: Invoking rcu_barrier+0x0/0x1e0()" to help point people in the right direction. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-02-01rcu: Add per-CPU rcuc task dumps to RCU CPU stall warningsZqiang
When the rcutree.use_softirq kernel boot parameter is set to zero, all RCU_SOFTIRQ processing is carried out by the per-CPU rcuc kthreads. If these kthreads are being starved, quiescent states will not be reported, which in turn means that the grace period will not end, which can in turn trigger RCU CPU stall warnings. This commit therefore dumps stack traces of stalled CPUs' rcuc kthreads, which can help identify what is preventing those kthreads from running. Suggested-by: Ammar Faizi <ammarfaizi2@gnuweeb.org> Reviewed-by: Ammar Faizi <ammarfaizi2@gnuweeb.org> Signed-off-by: Zqiang <qiang1.zhang@intel.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-02-01rcu: Don't deboost before reporting expedited quiescent statePaul E. McKenney
Currently rcu_preempt_deferred_qs_irqrestore() releases rnp->boost_mtx before reporting the expedited quiescent state. Under heavy real-time load, this can result in this function being preempted before the quiescent state is reported, which can in turn prevent the expedited grace period from completing. Tim Murray reports that the resulting expedited grace periods can take hundreds of milliseconds and even more than one second, when they should normally complete in less than a millisecond. This was fine given that there were no particular response-time constraints for synchronize_rcu_expedited(), as it was designed for throughput rather than latency. However, some users now need sub-100-millisecond response-time constratints. This patch therefore follows Neeraj's suggestion (seconded by Tim and by Uladzislau Rezki) of simply reversing the two operations. Reported-by: Tim Murray <timmurray@google.com> Reported-by: Joel Fernandes <joelaf@google.com> Reported-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Tested-by: Tim Murray <timmurray@google.com> Cc: Todd Kjos <tkjos@google.com> Cc: Sandeep Patil <sspatil@google.com> Cc: <stable@vger.kernel.org> # 5.4.x Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-02-01rcu: Elevate priority of offloaded callback threadsAlison Chaiken
When CONFIG_PREEMPT_RT=y, the rcutree.kthread_prio command-line parameter signals initialization code to boost the priority of rcuc callbacks to the designated value. With the additional CONFIG_RCU_NOCB_CPU=y configuration and an additional rcu_nocbs command-line parameter, the callbacks on the listed cores are offloaded to new rcuop kthreads that are not pinned to the cores whose post-grace-period work is performed. While the rcuop kthreads perform the same function as the rcuc kthreads they offload, the kthread_prio parameter only boosts the priority of the rcuc kthreads. Fix this inconsistency by elevating rcuop kthreads to the same priority as the rcuc kthreads. Signed-off-by: Alison Chaiken <achaiken@aurora.tech> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-02-01rcu: Make priority of grace-period thread consistentAlison Chaiken
The priority of RCU grace period threads is set to kthread_prio when they are launched from rcu_spawn_gp_kthread(). The same is not true of rcu_spawn_one_nocb_kthread(). Accordingly, add priority elevation to rcu_spawn_one_nocb_kthread(). Signed-off-by: Alison Chaiken <achaiken@aurora.tech> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-02-01rcu: Move kthread_prio bounds-check to a separate functionAlison Chaiken
Move the bounds-check of the kthread_prio cmdline parameter to a new function in order to faciliate a different callsite. Signed-off-by: Alison Chaiken <achaiken@aurora.tech> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-02-01rcu: Create per-cpu rcuc kthreads only when rcutree.use_softirq=0Zqiang
The per-CPU "rcuc" kthreads are used only by kernels booted with rcutree.use_softirq=0, but they are nevertheless unconditionally created by kernels built with CONFIG_RCU_BOOST=y. This results in "rcuc" kthreads being created that are never actually used. This commit therefore refrains from creating these kthreads unless the kernel is actually booted with rcutree.use_softirq=0. Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Zqiang <qiang1.zhang@intel.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-02-01rcu: Remove unused rcu_state.boostNeeraj Upadhyay
Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-02-01rcu/nocb: Handle concurrent nocb kthreads creationNeeraj Upadhyay
When multiple CPUs in the same nocb gp/cb group concurrently come online, they might try to concurrently create the same rcuog kthread. Fix this by using nocb gp CPU's spawn mutex to provide mutual exclusion for the rcuog kthread creation code. [ paulmck: Whitespace fixes per kernel test robot feedback. ] Acked-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-02-01rcu: Mark accesses to boost_starttimePaul E. McKenney
The boost_starttime shared variable has conflicting unmarked C-language accesses, which are dangerous at best. This commit therefore adds appropriate marking. This was found by KCSAN. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-02-01rcu: Mark ->expmask access in synchronize_rcu_expedited_wait()Paul E. McKenney
This commit adds a READ_ONCE() to an access to the rcu_node structure's ->expmask field to prevent compiler mischief. Detected by KCSAN. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-02-01rcu/exp: Fix check for idle context in rcu_exp_handlerNeeraj Upadhyay
For PREEMPT_RCU, the rcu_exp_handler() function checks whether the current CPU is in idle, by calling rcu_dynticks_curr_cpu_in_eqs(). However, rcu_exp_handler() is called in IPI handler context. So, it should be checking the idle context using rcu_is_cpu_rrupt_from_idle(). Fix this by using rcu_is_cpu_rrupt_from_idle() instead of rcu_dynticks_curr_cpu_in_eqs(). Non-preempt configuration already uses the correct check. Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-02-01bpf: Drop libbpf, libelf, libz dependency from bpf preload.Alexei Starovoitov
Drop libbpf, libelf, libz dependency from bpf preload. This reduces bpf_preload_umd binary size from 1.7M to 30k unstripped with debug info and from 300k to 19k stripped. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20220131220528.98088-8-alexei.starovoitov@gmail.com
2022-02-01bpf: Open code obj_get_info_by_fd in bpf preload.Alexei Starovoitov
Open code obj_get_info_by_fd in bpf preload. It's the last part of libbpf that preload/iterators were using. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20220131220528.98088-7-alexei.starovoitov@gmail.com
2022-02-01bpf: Convert bpf preload to light skeleton.Alexei Starovoitov
Convert bpffs preload iterators to light skeleton. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20220131220528.98088-6-alexei.starovoitov@gmail.com
2022-02-01bpf: Remove unnecessary setrlimit from bpf preload.Alexei Starovoitov
BPF programs and maps are memcg accounted. setrlimit is obsolete. Remove its use from bpf preload. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20220131220528.98088-5-alexei.starovoitov@gmail.com
2022-02-01Merge tag 'audit-pr-20220131' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit Pull audit fix from Paul Moore: "A single audit patch to fix problems relating to audit queuing and system responsiveness when "audit=1" is specified on the kernel command line and the audit daemon is SIGSTOP'd for an extended period of time" * tag 'audit-pr-20220131' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit: audit: improve audit queue handling when "audit=1" on cmdline
2022-02-01cgroup-v1: Require capabilities to set release_agentEric W. Biederman
The cgroup release_agent is called with call_usermodehelper. The function call_usermodehelper starts the release_agent with a full set fo capabilities. Therefore require capabilities when setting the release_agaent. Reported-by: Tabitha Sable <tabitha.c.sable@gmail.com> Tested-by: Tabitha Sable <tabitha.c.sable@gmail.com> Fixes: 81a6a5cdd2c5 ("Task Control Groups: automatic userspace notification of idle cgroups") Cc: stable@vger.kernel.org # v2.6.24+ Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2022-01-31bpf: make bpf_copy_from_user_task() gpl onlyKenta Tada
access_process_vm() is exported by EXPORT_SYMBOL_GPL(). Signed-off-by: Kenta Tada <Kenta.Tada@sony.com> Link: https://lore.kernel.org/r/20220128170906.21154-1-Kenta.Tada@sony.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-01-31padata: replace cpumask_weight with cpumask_empty in padata.cYury Norov
padata_do_parallel() calls cpumask_weight() to check if any bit of a given cpumask is set. We can do it more efficiently with cpumask_empty() because cpumask_empty() stops traversing the cpumask as soon as it finds first set bit, while cpumask_weight() counts all bits unconditionally. Signed-off-by: Yury Norov <yury.norov@gmail.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2022-01-30Merge tag 'perf_urgent_for_v5.17_rc2_p2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Borislav Petkov: - Prevent accesses to the per-CPU cgroup context list from another CPU except the one it belongs to, to avoid list corruption - Make sure parent events are always woken up to avoid indefinite hangs in the traced workload * tag 'perf_urgent_for_v5.17_rc2_p2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/core: Fix cgroup event list management perf: Always wake the parent event
2022-01-30Merge tag 'sched_urgent_for_v5.17_rc2_p2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Borislav Petkov: "Make sure the membarrier-rseq fence commands are part of the reported set when querying membarrier(2) commands through MEMBARRIER_CMD_QUERY" * tag 'sched_urgent_for_v5.17_rc2_p2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/membarrier: Fix membarrier-rseq fence command missing from query bitmask
2022-01-30psi: fix "defined but not used" warnings when CONFIG_PROC_FS=nSuren Baghdasaryan
When CONFIG_PROC_FS is disabled psi code generates the following warnings: kernel/sched/psi.c:1364:30: warning: 'psi_cpu_proc_ops' defined but not used [-Wunused-const-variable=] 1364 | static const struct proc_ops psi_cpu_proc_ops = { | ^~~~~~~~~~~~~~~~ kernel/sched/psi.c:1355:30: warning: 'psi_memory_proc_ops' defined but not used [-Wunused-const-variable=] 1355 | static const struct proc_ops psi_memory_proc_ops = { | ^~~~~~~~~~~~~~~~~~~ kernel/sched/psi.c:1346:30: warning: 'psi_io_proc_ops' defined but not used [-Wunused-const-variable=] 1346 | static const struct proc_ops psi_io_proc_ops = { | ^~~~~~~~~~~~~~~ Make definitions of these structures and related functions conditional on CONFIG_PROC_FS config. Link: https://lkml.kernel.org/r/20220119223940.787748-3-surenb@google.com Fixes: 0e94682b73bf ("psi: introduce psi monitor") Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reported-by: kernel test robot <lkp@intel.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-28Merge tag 'pm-5.17-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management fixes from Rafael Wysocki: "These make the buffer handling in pm_show_wakelocks() more robust and drop an unused hibernation-related function. Specifics: - Make the buffer handling in pm_show_wakelocks() more robust by using sysfs_emit_at() in it to generate output (Greg Kroah-Hartman). - Drop register_nosave_region_late() which is not used (Amadeusz Sławiński)" * tag 'pm-5.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: PM: hibernate: Remove register_nosave_region_late() PM: wakeup: simplify the output logic of pm_show_wakelocks()
2022-01-28Merge tag 'trace-v5.17-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pulltracing fixes from Steven Rostedt: - Limit mcount build time sorting to only those archs that we know it works for. - Fix memory leak in error path of histogram setup - Fix and clean up rel_loc array out of bounds issue - tools/rtla documentation fixes - Fix issues with histogram logic * tag 'trace-v5.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing: Don't inc err_log entry count if entry allocation fails tracing: Propagate is_signed to expression tracing: Fix smatch warning for do while check in event_hist_trigger_parse() tracing: Fix smatch warning for null glob in event_hist_trigger_parse() tools/tracing: Update Makefile to build rtla rtla: Make doc build optional tracing/perf: Avoid -Warray-bounds warning for __rel_loc macro tracing: Avoid -Warray-bounds warning for __rel_loc macro tracing/histogram: Fix a potential memory leak for kstrdup() ftrace: Have architectures opt-in for mcount build time sorting
2022-01-28Merge branch 'ucount-rlimit-fixes-for-v5.17-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull ucount rlimit fix from Eric Biederman. Make sure the ucounts have a reference to the user namespace it refers to, so that users that themselves don't carry such a reference around can safely use the ucount functions. * 'ucount-rlimit-fixes-for-v5.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: ucount: Make get_ucount a safe get_user replacement
2022-01-28Merge tag 'rcu-urgent.2022.01.26a' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu Pull RCU fix from Paul McKenney: "This fixes a brown-paper-bag bug in RCU tasks that causes things like BPF and ftrace to fail miserably on systems with non-power-of-two numbers of CPUs. It fixes a math error added in 7a30871b6a27 ("rcu-tasks: Introduce ->percpu_enqueue_shift for dynamic queue selection') during the v5.17 merge window. This commit works correctly only on systems with a power-of-two number of CPUs, which just so happens to be the kind that rcutorture always uses by default. This pull request fixes the math so that things also work on systems that don't happen to have a power-of-two number of CPUs" * tag 'rcu-urgent.2022.01.26a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: rcu-tasks: Fix computation of CPU-to-list shift counts
2022-01-27tracing: Don't inc err_log entry count if entry allocation failsTom Zanussi
tr->n_err_log_entries should only be increased if entry allocation succeeds. Doing it when it fails won't cause any problems other than wasting an entry, but should be fixed anyway. Link: https://lkml.kernel.org/r/cad1ab28f75968db0f466925e7cba5970cec6c29.1643319703.git.zanussi@kernel.org Cc: stable@vger.kernel.org Fixes: 2f754e771b1a6 ("tracing: Don't inc err_log entry count if entry allocation fails") Signed-off-by: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-01-27tracing: Propagate is_signed to expressionTom Zanussi
During expression parsing, a new expression field is created which should inherit the properties of the operands, such as size and is_signed. is_signed propagation was missing, causing spurious errors with signed operands. Add it in parse_expr() and parse_unary() to fix the problem. Link: https://lkml.kernel.org/r/f4dac08742fd7a0920bf80a73c6c44042f5eaa40.1643319703.git.zanussi@kernel.org Cc: stable@vger.kernel.org Fixes: 100719dcef447 ("tracing: Add simple expression support to hist triggers") Reported-by: Yordan Karadzhov <ykaradzhov@vmware.com> BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=215513 Signed-off-by: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-01-27tracing: Fix smatch warning for do while check in event_hist_trigger_parse()Tom Zanussi
The patch ec5ce0987541: "tracing: Allow whitespace to surround hist trigger filter" from Jan 15, 2018, leads to the following Smatch static checker warning: kernel/trace/trace_events_hist.c:6199 event_hist_trigger_parse() warn: 'p' can't be NULL. Since p is always checked for a NULL value at the top of loop and nothing in the rest of the loop will set it to NULL, the warning is correct and might as well be 1 to silence the warning. Link: https://lkml.kernel.org/r/a1d4c79766c0cf61e20438dc35244d216633fef6.1643319703.git.zanussi@kernel.org Fixes: ec5ce09875410 ("tracing: Allow whitespace to surround hist trigger filter") Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-01-27tracing: Fix smatch warning for null glob in event_hist_trigger_parse()Tom Zanussi
The recent rename of event_hist_trigger_parse() caused smatch re-evaluation of trace_events_hist.c and as a result an old warning was found: kernel/trace/trace_events_hist.c:6174 event_hist_trigger_parse() error: we previously assumed 'glob' could be null (see line 6166) glob should never be null (and apparently smatch can also figure that out and skip the warning when using the cross-function DB (but which can't be used with a 0day build as it takes too much time to generate)). Nonetheless for clarity, remove the test but add a WARN_ON() in case the code ever changes. Link: https://lkml.kernel.org/r/96925e5c1f116654ada7ea0613d930b1266b5e1c.1643319703.git.zanussi@kernel.org Fixes: f404da6e1d46c ("tracing: Add 'last error' error facility for hist triggers") Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-01-27tracing/histogram: Fix a potential memory leak for kstrdup()Xiaoke Wang
kfree() is missing on an error path to free the memory allocated by kstrdup(): p = param = kstrdup(data->params[i], GFP_KERNEL); So it is better to free it via kfree(p). Link: https://lkml.kernel.org/r/tencent_C52895FD37802832A3E5B272D05008866F0A@qq.com Cc: stable@vger.kernel.org Fixes: d380dcde9a07c ("tracing: Fix now invalid var_ref_vals assumption in trace action") Signed-off-by: Xiaoke Wang <xkernel.wang@foxmail.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-01-27ftrace: Have architectures opt-in for mcount build time sortingSteven Rostedt (Google)
First S390 complained that the sorting of the mcount sections at build time caused the kernel to crash on their architecture. Now PowerPC is complaining about it too. And also ARM64 appears to be having issues. It may be necessary to also update the relocation table for the values in the mcount table. Not only do we have to sort the table, but also update the relocations that may be applied to the items in the table. If the system is not relocatable, then it is fine to sort, but if it is, some architectures may have issues (although x86 does not as it shifts all addresses the same). Add a HAVE_BUILDTIME_MCOUNT_SORT that an architecture can set to say it is safe to do the sorting at build time. Also update the config to compile in build time sorting in the sorttable code in scripts/ to depend on CONFIG_BUILDTIME_MCOUNT_SORT. Link: https://lore.kernel.org/all/944D10DA-8200-4BA9-8D0A-3BED9AA99F82@linux.ibm.com/ Link: https://lkml.kernel.org/r/20220127153821.3bc1ac6e@gandalf.local.home Cc: Ingo Molnar <mingo@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Yinan Liu <yinan@linux.alibaba.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Kees Cook <keescook@chromium.org> Reported-by: Sachin Sant <sachinp@linux.ibm.com> Reviewed-by: Mark Rutland <mark.rutland@arm.com> Tested-by: Mark Rutland <mark.rutland@arm.com> [arm64] Tested-by: Sachin Sant <sachinp@linux.ibm.com> Fixes: 72b3942a173c ("scripts: ftrace - move the sort-processing in ftrace_init") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-01-27Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
No conflicts. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-27bpf: reject program if a __user tagged memory accessed in kernel wayYonghong Song
BPF verifier supports direct memory access for BPF_PROG_TYPE_TRACING type of bpf programs, e.g., a->b. If "a" is a pointer pointing to kernel memory, bpf verifier will allow user to write code in C like a->b and the verifier will translate it to a kernel load properly. If "a" is a pointer to user memory, it is expected that bpf developer should be bpf_probe_read_user() helper to get the value a->b. Without utilizing BTF __user tagging information, current verifier will assume that a->b is a kernel memory access and this may generate incorrect result. Now BTF contains __user information, it can check whether the pointer points to a user memory or not. If it is, the verifier can reject the program and force users to use bpf_probe_read_user() helper explicitly. In the future, we can easily extend btf_add_space for other address space tagging, for example, rcu/percpu etc. Signed-off-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20220127154606.654961-1-yhs@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-01-27cgroup/bpf: fast path skb BPF filteringPavel Begunkov
Even though there is a static key protecting from overhead from cgroup-bpf skb filtering when there is nothing attached, in many cases it's not enough as registering a filter for one type will ruin the fast path for all others. It's observed in production servers I've looked at but also in laptops, where registration is done during init by systemd or something else. Add a per-socket fast path check guarding from such overhead. This affects both receive and transmit paths of TCP, UDP and other protocols. It showed ~1% tx/s improvement in small payload UDP send benchmarks using a real NIC and in a server environment and the number jumps to 2-3% for preemtible kernels. Reviewed-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/r/d8c58857113185a764927a46f4b5a058d36d3ec3.1643292455.git.asml.silence@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-01-27psi: Fix "defined but not used" warnings when CONFIG_PROC_FS=nSuren Baghdasaryan
When CONFIG_PROC_FS is disabled psi code generates the following warnings: kernel/sched/psi.c:1364:30: warning: 'psi_cpu_proc_ops' defined but not used [-Wunused-const-variable=] 1364 | static const struct proc_ops psi_cpu_proc_ops = { | ^~~~~~~~~~~~~~~~ kernel/sched/psi.c:1355:30: warning: 'psi_memory_proc_ops' defined but not used [-Wunused-const-variable=] 1355 | static const struct proc_ops psi_memory_proc_ops = { | ^~~~~~~~~~~~~~~~~~~ kernel/sched/psi.c:1346:30: warning: 'psi_io_proc_ops' defined but not used [-Wunused-const-variable=] 1346 | static const struct proc_ops psi_io_proc_ops = { | ^~~~~~~~~~~~~~~ Make definitions of these structures and related functions conditional on CONFIG_PROC_FS config. Fixes: 0e94682b73bf ("psi: introduce psi monitor") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220119223940.787748-3-surenb@google.com
2022-01-27sched/uclamp: Fix iowait boost escaping uclamp restrictionQais Yousef
iowait_boost signal is applied independently of util and doesn't take into account uclamp settings of the rq. An io heavy task that is capped by uclamp_max could still request higher frequency because sugov_iowait_apply() doesn't clamp the boost via uclamp_rq_util_with() like effective_cpu_util() does. Make sure that iowait_boost honours uclamp requests by calling uclamp_rq_util_with() when applying the boost. Fixes: 982d9cdc22c9 ("sched/cpufreq, sched/uclamp: Add clamps for FAIR and RT tasks") Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://lore.kernel.org/r/20211216225320.2957053-3-qais.yousef@arm.com
2022-01-27sched/sugov: Ignore 'busy' filter when rq is capped by uclamp_maxQais Yousef
sugov_update_single_{freq, perf}() contains a 'busy' filter that ensures we don't bring the frqeuency down if there's no idle time (CPU is busy). The problem is that with uclamp_max we will have scenarios where a busy task is capped to run at a lower frequency and this filter prevents applying the capping when this task starts running. We handle this by skipping the filter when uclamp is enabled and the rq is being capped by uclamp_max. We introduce a new function uclamp_rq_is_capped() to help detecting when this capping is taking effect. Some code shuffling was required to allow using cpu_util_{cfs, rt}() in this new function. On 2 Core SMT2 Intel laptop I see: Without this patch: uclampset -M 0 sysbench --test=cpu --threads = 4 run produces a score of ~3200 consistently. Which is the highest possible. Compiling the kernel also results in frequency running at max 3.1GHz all the time - running uclampset -M 400 to cap it has no effect without this patch. With this patch: uclampset -M 0 sysbench --test=cpu --threads = 4 run produces a score of ~1100 with some outliers in ~1700. Uclamp max aggregates the performance requirements, so having high values sometimes is expected if some other task happens to require that frequency starts running at the same time. When compiling the kernel with uclampset -M 400 I can see the frequencies mostly in the ~2GHz region. Helpful to conserve power and prevent heating when not plugged in. Fixes: 982d9cdc22c9 ("sched/cpufreq, sched/uclamp: Add clamps for FAIR and RT tasks") Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20211216225320.2957053-2-qais.yousef@arm.com
2022-01-27sched/core: Export pelt_thermal_tpQais Yousef
We can't use this tracepoint in modules without having the symbol exported first, fix that. Fixes: 765047932f15 ("sched/pelt: Add support to track thermal pressure") Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20211028115005.873539-1-qais.yousef@arm.com
2022-01-27sched/numa: initialize numa statistics when forking new taskHonglei Wang
The child processes will inherit numa_pages_migrated and total_numa_faults from the parent. It means even if there is no numa fault happen on the child, the statistics in /proc/$pid of the child process might show huge amount. This is a bit weird. Let's initialize them when do fork. Signed-off-by: Honglei Wang <wanghonglei@didichuxing.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Mel Gorman <mgorman@suse.de> Link: https://lore.kernel.org/r/20220113133920.49900-1-wanghonglei@didichuxing.com
2022-01-27sched/debug: Remove mpol_get/put and task_lock/unlock from sched_show_numaBharata B Rao
The older format of /proc/pid/sched printed home node info which required the mempolicy and task lock around mpol_get(). However the format has changed since then and there is no need for sched_show_numa() any more to have mempolicy argument, asssociated mpol_get/put and task_lock/unlock. Remove them. Fixes: 397f2378f1361 ("sched/numa: Fix numa balancing stats in /proc/pid/sched") Signed-off-by: Bharata B Rao <bharata@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Acked-by: Mel Gorman <mgorman@suse.de> Link: https://lore.kernel.org/r/20220118050515.2973-1-bharata@amd.com
2022-01-26ucount: Make get_ucount a safe get_user replacementEric W. Biederman
When the ucount code was refactored to create get_ucount it was missed that some of the contexts in which a rlimit is kept elevated can be the only reference to the user/ucount in the system. Ordinary ucount references exist in places that also have a reference to the user namspace, but in POSIX message queues, the SysV shm code, and the SIGPENDING code there is no independent user namespace reference. Inspection of the the user_namespace show no instance of circular references between struct ucounts and the user_namespace. So hold a reference from struct ucount to i's user_namespace to resolve this problem. Link: https://lore.kernel.org/lkml/YZV7Z+yXbsx9p3JN@fixkernel.com/ Reported-by: Qian Cai <quic_qiancai@quicinc.com> Reported-by: Mathias Krause <minipli@grsecurity.net> Tested-by: Mathias Krause <minipli@grsecurity.net> Reviewed-by: Mathias Krause <minipli@grsecurity.net> Reviewed-by: Alexey Gladkov <legion@kernel.org> Fixes: d64696905554 ("Reimplement RLIMIT_SIGPENDING on top of ucounts") Fixes: 6e52a9f0532f ("Reimplement RLIMIT_MSGQUEUE on top of ucounts") Fixes: d7c9e99aee48 ("Reimplement RLIMIT_MEMLOCK on top of ucounts") Cc: stable@vger.kernel.org Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2022-01-26rcu-tasks: Fix computation of CPU-to-list shift countsPaul E. McKenney
The ->percpu_enqueue_shift field is used to map from the running CPU number to the index of the corresponding callback list. This mapping can change at runtime in response to varying callback load, resulting in varying levels of contention on the callback-list locks. Unfortunately, the initial value of this field is correct only if the system happens to have a power-of-two number of CPUs, otherwise the callbacks from the high-numbered CPUs can be placed into the callback list indexed by 1 (rather than 0), and those index-1 callbacks will be ignored. This can result in soft lockups and hangs. This commit therefore corrects this mapping, adding one to this shift count as needed for systems having odd numbers of CPUs. Fixes: 7a30871b6a27 ("rcu-tasks: Introduce ->percpu_enqueue_shift for dynamic queue selection") Reported-by: Andrii Nakryiko <andrii.nakryiko@gmail.com> Cc: Reported-by: Martin Lau <kafai@fb.com> Cc: Neeraj Upadhyay <neeraj.iitr10@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-01-26cpuset: Fix the bug that subpart_cpus updated wrongly in update_cpumask()Tianchen Ding
subparts_cpus should be limited as a subset of cpus_allowed, but it is updated wrongly by using cpumask_andnot(). Use cpumask_and() instead to fix it. Fixes: ee8dde0cd2ce ("cpuset: Add new v2 cpuset.sched.partition flag") Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>