summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2020-08-24scftorture: Add smp_call_function_many() memory-ordering checksPaul E. McKenney
This commit adds checks for memory misordering across calls to and returns from smp_call_function_many() in the case where the caller waits. Misordering results in a splat. Note that in contrast to smp_call_function_single(), this code does not test memory ordering into the handler in the no-wait case because none of the handlers would be able to free the scf_check structure without introducing heavy synchronization to work out which was last. [ paulmck: s/GFP_KERNEL/GFP_ATOMIC/ per kernel test robot feedback. ] Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-08-24scftorture: Add smp_call_function_single() memory-ordering checksPaul E. McKenney
This commit adds checks for memory misordering across calls to smp_call_function_single() and also across returns in the case where the caller waits. Misordering results in a splat. [ paulmck: s/GFP_KERNEL/GFP_ATOMIC/ per kernel test robot feedback. ] Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-08-24scftorture: Summarize per-thread statisticsPaul E. McKenney
This commit summarizes the per-thread statistics, providing counts of the number of single, many, and all calls, both no-wait and wait, and, for the single case, the number where the target CPU was offline. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-08-24tick-sched: Clarify "NOHZ: local_softirq_pending" warningPaul E. McKenney
Currently, can_stop_idle_tick() prints "NOHZ: local_softirq_pending HH" (where "HH" is the hexadecimal softirq vector number) when one or more non-RCU softirq handlers are still enabled when checking to stop the scheduler-tick interrupt. This message is not as enlightening as one might hope, so this commit changes it to "NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #HH". Reported-by: Andy Lutomirski <luto@kernel.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-08-24scftorture: Implement weighted primitive selectionPaul E. McKenney
This commit uses the scftorture.weight* kernel parameters to randomly chooses between smp_call_function_single(), smp_call_function_many(), and smp_call_function(). For each variant, it also randomly chooses whether to invoke it synchronously (wait=1) or asynchronously (wait=0). The percentage weighting for each option are dumped to the console log (search for "scf_sel_dump"). This accumulates statistics, which a later commit will dump out at the end of the run. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-08-24scftorture: Add smp_call_function() torture testPaul E. McKenney
This commit adds an smp_call_function() torture test that repeatedly invokes this function and complains if things go badly awry. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-08-24rcu: Remove unused __rcu_is_watching() functionPaul E. McKenney
The x86/entry work removed all uses of __rcu_is_watching(), therefore this commit removes it entirely. Cc: Andy Lutomirski <luto@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: <x86@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-08-24rcu: Make FQS more aggressive in complaining about offline CPUsJoel Fernandes (Google)
The RCU grace-period kthread's force-quiescent state (FQS) loop should never see an offline CPU that has not yet reported a quiescent state. After all, the offline CPU should have reported a quiescent state during the CPU-offline process, or, failing that, by rcu_gp_init() if it ran concurrently with either the CPU going offline or the last task on a leaf rcu_node structure exiting its RCU read-side critical section while all CPUs corresponding to that structure are offline. The FQS loop should therefore complain if it does see an offline CPU that has not yet reported a quiescent state. And it does, but only once the grace period has been in force for a full second. This commit therefore makes this warning more aggressive, so that it will trigger as soon as the condition makes its appearance. Light testing with TREE03 and hotplug shows no warnings. This commit also converts the warning to WARN_ON_ONCE() in order to stave off possible log spam. Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-08-24rcu: Clarify comments about FQS loop reporting quiescent statesJoel Fernandes (Google)
Since at least v4.19, the FQS loop no longer reports quiescent states for offline CPUs except in emergency situations. This commit therefore fixes the comment in rcu_gp_init() to match the current code. Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-08-24rcu/nocb: Add a warning for non-GP kthread running GP codePaul E. McKenney
This commit increases RCU's ability to defend itself by emitting a warning if one of the nocb CB kthreads invokes the GP kthread's wait function. This warning augments a similar check that is carried out at the end of rcutorture testing and when RCU CPU stall warnings are emitted. The problem with those checks is that the miscreants have long since departed and disposed of any and all evidence. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-08-24rcu: Move rcu_cpu_started per-CPU variable to rcu_dataPaul E. McKenney
When the rcu_cpu_started per-CPU variable was added by commit f64c6013a202 ("rcu/x86: Provide early rcu_cpu_starting() callback"), there were multiple sets of per-CPU rcu_data structures. Therefore, the rcu_cpu_started flag was added as a separate per-CPU variable. But now there is only one set of per-CPU rcu_data structures, so this commit moves rcu_cpu_started to a new ->cpu_started field in that structure. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-08-24rcu: Add READ_ONCE() to rcu_do_batch() access to rcu_cpu_stall_ftrace_dumpPaul E. McKenney
Given that sysfs can change the value of rcu_cpu_stall_ftrace_dump at any time, this commit adds a READ_ONCE() to the accesses to that variable. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-08-24rcu: Add READ_ONCE() to rcu_do_batch() access to rcu_kick_kthreadsPaul E. McKenney
Given that sysfs can change the value of rcu_kick_kthreads at any time, this commit adds a READ_ONCE() to the sole access to that variable. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-08-24rcu: Add READ_ONCE() to rcu_do_batch() access to rcu_resched_nsPaul E. McKenney
Given that sysfs can change the value of rcu_resched_ns at any time, this commit adds a READ_ONCE() to the sole access to that variable. While in the area, this commit also adds bounds checking, clamping the value to at least a millisecond, but no longer than a second. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-08-24rcu: Add READ_ONCE() to rcu_do_batch() access to rcu_divisorPaul E. McKenney
Given that sysfs can change the value of rcu_divisor at any time, this commit adds a READ_ONCE to the sole access to that variable. While in the area, this commit also adds bounds checking, clamping the value to a shift that makes sense for a signed long. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-08-24nocb: Remove show_rcu_nocb_state() false positive printoutPaul E. McKenney
The rcu_data structure's ->nocb_timer field is used to defer wakeups of the corresponding no-CBs CPU's grace-period kthread ("rcuog*"), and that structure's ->nocb_defer_wakeup field is used to track such deferral. This means that the show_rcu_nocb_state() printing an error when those fields are set for a CPU not corresponding to a no-CBs grace-period kthread is erroneous. This commit therefore switches the check from ->nocb_timer to ->nocb_bypass_timer and removes the check of ->nocb_defer_wakeup. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-08-24rcu/tree: Remove CONFIG_PREMPT_RCU check in force_qs_rnp()Neeraj Upadhyay
Originally, the call to rcu_preempt_blocked_readers_cgp() from force_qs_rnp() had to be conditioned on CONFIG_PREEMPT_RCU=y, as in commit a77da14ce9af ("rcu: Yet another fix for preemption and CPU hotplug"). However, there is now a CONFIG_PREEMPT_RCU=n definition of rcu_preempt_blocked_readers_cgp() that unconditionally returns zero, so invoking it is now safe. In addition, the CONFIG_PREEMPT_RCU=n definition of rcu_initiate_boost() simply releases the rcu_node structure's ->lock, which is what happens when the "if" condition evaluates to false. This commit therefore drops the IS_ENABLED(CONFIG_PREEMPT_RCU) check, so that rcu_initiate_boost() is called only in CONFIG_PREEMPT_RCU=y kernels when there are readers blocking the current grace period. This does not change the behavior, but reduces code-reader confusion by eliminating non-CONFIG_PREEMPT_RCU=y calls to rcu_initiate_boost(). Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-08-24rcu/tree: Force quiescent state on callback overloadNeeraj Upadhyay
On callback overload, it is necessary to quickly detect idle CPUs, and rcu_gp_fqs_check_wake() checks for this condition. Unfortunately, the code following the call to this function does not repeat this check, which means that in reality no actual quiescent-state forcing, instead only a couple of quick and pointless wakeups at the beginning of the grace period. This commit therefore adds a check for the RCU_GP_FLAG_OVLD flag in the post-wakeup "if" statement in rcu_gp_fqs_loop(). Fixes: 1fca4d12f4637 ("rcu: Expedite first two FQS scans under callback-overload conditions") Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-08-24nocb: Clarify RCU nocb CPU error messagePaul E. McKenney
A message of the form "rcu: !!! lDTs ." can be tracked down, but doing so is not trivial. This commit therefore eases this process by adding text so that this error message now reads as follows: "rcu: nocb GP activity on CB-only CPU!!! lDTs ." Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-08-24rcu/trace: Use gp_seq_req in acceleration's rcu_grace_period tracepointJoel Fernandes (Google)
During acceleration of CB, the rsp's gp_seq is rcu_seq_snap'd. This is the value used for acceleration - it is the value of gp_seq at which it is safe the execute all callbacks in the callback list. The rdp's gp_seq is not very useful for this scenario. Make rcu_grace_period report the gp_seq_req instead as it allows one to reason about how the acceleration works. Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-08-24rcu: Initialize at declaration time in rcu_exp_handler()Paul E. McKenney
This commit moves the initialization of the CONFIG_PREEMPT=n version of the rcu_exp_handler() function's rdp and rnp local variables into their respective declarations to save a couple lines of code. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-08-24srcu: Remove KCSAN stubsPaul E. McKenney
KCSAN is now in mainline, so this commit removes the stubs for the data_race(), ASSERT_EXCLUSIVE_WRITER(), and ASSERT_EXCLUSIVE_ACCESS() macros. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-08-24rcu: Remove KCSAN stubs from update.cPaul E. McKenney
KCSAN is now in mainline, so this commit removes the stubs for the data_race(), ASSERT_EXCLUSIVE_WRITER(), and ASSERT_EXCLUSIVE_ACCESS() macros. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-08-24rcu: Remove KCSAN stubsPaul E. McKenney
KCSAN is now in mainline, so this commit removes the stubs for the data_race(), ASSERT_EXCLUSIVE_WRITER(), and ASSERT_EXCLUSIVE_ACCESS() macros. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-08-24kcsan: Optimize debugfs stats countersMarco Elver
Remove kcsan_counter_inc/dec() functions, as they perform no other logic, and are no longer needed. This avoids several calls in kcsan_setup_watchpoint() and kcsan_found_watchpoint(), as well as lets the compiler warn us about potential out-of-bounds accesses as the array's size is known at all usage sites at compile-time. Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-08-24kcsan: Use pr_fmt for consistencyMarco Elver
Use the same pr_fmt throughout for consistency. [ The only exception is report.c, where the format must be kept precisely as-is. ] Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-08-24kcsan: Show message if enabled earlyMarco Elver
Show a message in the kernel log if KCSAN was enabled early. Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-08-24kcsan: Remove debugfs test commandMarco Elver
Remove the debugfs test command, as it is no longer needed now that we have the KUnit+Torture based kcsan-test module. This is to avoid confusion around how KCSAN should be tested, as only the kcsan-test module is maintained. Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-08-24kcsan: Simplify constant string handlingMarco Elver
Simplify checking prefixes and length calculation of constant strings. For the former, the kernel provides str_has_prefix(), and the latter we should just use strlen("..") because GCC and Clang have optimizations that optimize these into constants. No functional change intended. Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-08-24kcsan: Simplify debugfs counter to name mappingMarco Elver
Simplify counter ID to name mapping by using an array with designated inits. This way, we can turn a run-time BUG() into a compile-time static assertion failure if a counter name is missing. No functional change intended. Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-08-24kcsan: Test support for compound instrumentationMarco Elver
Changes kcsan-test module to support checking reports that include compound instrumentation. Since we should not fail the test if this support is unavailable, we have to add a config variable that the test can use to decide what to check for. Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-08-24kcsan: Add missing CONFIG_KCSAN_IGNORE_ATOMICS checksMarco Elver
Add missing CONFIG_KCSAN_IGNORE_ATOMICS checks for the builtin atomics instrumentation. Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-08-24kcsan: Skew delay to be longer for certain access typesMarco Elver
For compound instrumentation and assert accesses, skew the watchpoint delay to be longer if randomized. This is useful to improve race detection for such accesses. For compound accesses we should increase the delay as we've aggregated both read and write instrumentation. By giving up 1 call into the runtime, we're less likely to set up a watchpoint and thus less likely to detect a race. We can balance this by increasing the watchpoint delay. For assert accesses, we know these are of increased interest, and we wish to increase our chances of detecting races for such checks. Note that, kcsan_udelay_{task,interrupt} define the upper bound delays. When randomized, delays are uniformly distributed between [0, delay]. Skewing the delay does not break this promise as long as the defined upper bounds are still adhered to. The current skew results in delays uniformly distributed between [delay/2, delay]. Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-08-24kcsan: Support compounded read-write instrumentationMarco Elver
Add support for compounded read-write instrumentation if supported by the compiler. Adds the necessary instrumentation functions, and a new type which is used to generate a more descriptive report. Furthermore, such compounded memory access instrumentation is excluded from the "assume aligned writes up to word size are atomic" rule, because we cannot assume that the compiler emits code that is atomic for compound ops. LLVM/Clang added support for the feature in: https://github.com/llvm/llvm-project/commit/785d41a261d136b64ab6c15c5d35f2adc5ad53e3 The new instrumentation is emitted for sets of memory accesses in the same basic block to the same address with at least one read appearing before a write. These typically result from compound operations such as ++, --, +=, -=, |=, &=, etc. but also equivalent forms such as "var = var + 1". Where the compiler determines that it is equivalent to emit a call to a single __tsan_read_write instead of separate __tsan_read and __tsan_write, we can then benefit from improved performance and better reporting for such access patterns. The new reports now show that the ops are both reads and writes, for example: read-write to 0xffffffff90548a38 of 8 bytes by task 143 on cpu 3: test_kernel_rmw_array+0x45/0xa0 access_thread+0x71/0xb0 kthread+0x21e/0x240 ret_from_fork+0x22/0x30 read-write to 0xffffffff90548a38 of 8 bytes by task 144 on cpu 2: test_kernel_rmw_array+0x45/0xa0 access_thread+0x71/0xb0 kthread+0x21e/0x240 ret_from_fork+0x22/0x30 Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-08-24kcsan: Add atomic builtin test caseMarco Elver
Adds test case to kcsan-test module, to test atomic builtin instrumentation works. Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-08-24kcsan: Add support for atomic builtinsMarco Elver
Some architectures (currently e.g. s390 partially) implement atomics using the compiler's atomic builtins (__atomic_*, __sync_*). To support enabling KCSAN on such architectures in future, or support experimental use of these builtins, implement support for them. We should also avoid breaking KCSAN kernels due to use (accidental or otherwise) of atomic builtins in drivers, as has happened in the past: https://lkml.kernel.org/r/5231d2c0-41d9-6721-e15f-a7eedf3ce69e@infradead.org The instrumentation is subtly different from regular reads/writes: TSAN instrumentation replaces the use of atomic builtins with a call into the runtime, and the runtime's job is to also execute the desired atomic operation. We rely on the __atomic_* compiler builtins, available with all KCSAN-supported compilers, to implement each TSAN atomic instrumentation function. Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-08-23treewide: Use fallthrough pseudo-keywordGustavo A. R. Silva
Replace the existing /* fall through */ comments and its variants with the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary fall-through markings when it is the case. [1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2020-08-23Merge tag 'core-urgent-2020-08-23' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull entry fix from Thomas Gleixner: "A single bug fix for the common entry code. The transcription of the x86 version messed up the reload of the syscall number from pt_regs after ptrace and seccomp which breaks syscall number rewriting" * tag 'core-urgent-2020-08-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: core/entry: Respect syscall number rewrites
2020-08-23Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netLinus Torvalds
Pull networking fixes from David Miller: "Nothing earth shattering here, lots of small fixes (f.e. missing RCU protection, bad ref counting, missing memset(), etc.) all over the place: 1) Use get_file_rcu() in task_file iterator, from Yonghong Song. 2) There are two ways to set remote source MAC addresses in macvlan driver, but only one of which validates things properly. Fix this. From Alvin Šipraga. 3) Missing of_node_put() in gianfar probing, from Sumera Priyadarsini. 4) Preserve device wanted feature bits across multiple netlink ethtool requests, from Maxim Mikityanskiy. 5) Fix rcu_sched stall in task and task_file bpf iterators, from Yonghong Song. 6) Avoid reset after device destroy in ena driver, from Shay Agroskin. 7) Missing memset() in netlink policy export reallocation path, from Johannes Berg. 8) Fix info leak in __smc_diag_dump(), from Peilin Ye. 9) Decapsulate ECN properly for ipv6 in ipv4 tunnels, from Mark Tomlinson. 10) Fix number of data stream negotiation in SCTP, from David Laight. 11) Fix double free in connection tracker action module, from Alaa Hleihel. 12) Don't allow empty NHA_GROUP attributes, from Nikolay Aleksandrov" * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (46 commits) net: nexthop: don't allow empty NHA_GROUP bpf: Fix two typos in uapi/linux/bpf.h net: dsa: b53: check for timeout tipc: call rcu_read_lock() in tipc_aead_encrypt_done() net/sched: act_ct: Fix skb double-free in tcf_ct_handle_fragments() error flow net: sctp: Fix negotiation of the number of data streams. dt-bindings: net: renesas, ether: Improve schema validation gre6: Fix reception with IP6_TNL_F_RCV_DSCP_COPY hv_netvsc: Fix the queue_mapping in netvsc_vf_xmit() hv_netvsc: Remove "unlikely" from netvsc_select_queue bpf: selftests: global_funcs: Check err_str before strstr bpf: xdp: Fix XDP mode when no mode flags specified selftests/bpf: Remove test_align leftovers tools/resolve_btfids: Fix sections with wrong alignment net/smc: Prevent kernel-infoleak in __smc_diag_dump() sfc: fix build warnings on 32-bit net: phy: mscc: Fix a couple of spelling mistakes "spcified" -> "specified" libbpf: Fix map index used in error message net: gemini: Fix missing free_netdev() in error path of gemini_ethernet_port_probe() net: atlantic: Use readx_poll_timeout() for large timeout ...
2020-08-23timekeeping: Provide multi-timestamp accessor to NMI safe timekeeperThomas Gleixner
printk wants to store various timestamps (MONOTONIC, REALTIME, BOOTTIME) to make correlation of dmesg from several systems easier. Provide an interface to retrieve all three timestamps in one go. There are some caveats: 1) Boot time and late sleep time injection Boot time is a racy access on 32bit systems if the sleep time injection happens late during resume and not in timekeeping_resume(). That could be avoided by expanding struct tk_read_base with boot offset for 32bit and adding more overhead to the update. As this is a hard to observe once per resume event which can be filtered with reasonable effort using the accurate mono/real timestamps, it's probably not worth the trouble. Aside of that it might be possible on 32 and 64 bit to observe the following when the sleep time injection happens late: CPU 0 CPU 1 timekeeping_resume() ktime_get_fast_timestamps() mono, real = __ktime_get_real_fast() inject_sleep_time() update boot offset boot = mono + bootoffset; That means that boot time already has the sleep time adjustment, but real time does not. On the next readout both are in sync again. Preventing this for 64bit is not really feasible without destroying the careful cache layout of the timekeeper because the sequence count and struct tk_read_base would then need two cache lines instead of one. 2) Suspend/resume timestamps Access to the time keeper clock source is disabled accross the innermost steps of suspend/resume. The accessors still work, but the timestamps are frozen until time keeping is resumed which happens very early. For regular suspend/resume there is no observable difference vs. sched clock, but it might affect some of the nasty low level debug printks. OTOH, access to sched clock is not guaranteed accross suspend/resume on all systems either so it depends on the hardware in use. If that turns out to be a real problem then this could be mitigated by using sched clock in a similar way as during early boot. But it's not as trivial as on early boot because it needs some careful protection against the clock monotonic timestamp jumping backwards on resume. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20200814115512.159981360@linutronix.de
2020-08-23timekeeping: Utilize local_clock() for NMI safe timekeeper during early bootThomas Gleixner
During early boot the NMI safe timekeeper returns 0 until the first clocksource becomes available. This prevents it from being used for printk or other facilities which today use sched clock. sched clock can be available way before timekeeping is initialized. The obvious workaround for this is to utilize the early sched clock in the default dummy clock read function until a clocksource becomes available. After switching to the clocksource clock MONOTONIC and BOOTTIME will not jump because the timekeeping_init() bases clock MONOTONIC on sched clock and the offset between clock MONOTONIC and BOOTTIME is zero during boot. Clock REALTIME cannot provide useful timestamps during early boot up to the point where a persistent clock becomes available, which is either in timekeeping_init() or later when the RTC driver which might depend on I2C or other subsystems is initialized. There is a minor difference to sched_clock() vs. suspend/resume. As the timekeeper clock source might not be accessible during suspend, after timekeeping_suspend() timestamps freeze up to the point where timekeeping_resume() is invoked. OTOH this is true for some sched clock implementations as well. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20200814115512.041422402@linutronix.de
2020-08-21bpf: sockmap: Allow update from BPFLorenz Bauer
Allow calling bpf_map_update_elem on sockmap and sockhash from a BPF context. The synchronization required for this is a bit fiddly: we need to prevent the socket from changing its state while we add it to the sockmap, since we rely on getting a callback via sk_prot->unhash. However, we can't just lock_sock like in sock_map_sk_acquire because that might sleep. So instead we disable softirq processing and use bh_lock_sock to prevent further modification. Yet, this is still not enough. BPF can be called in contexts where the current CPU might have locked a socket. If the BPF can get a hold of such a socket, inserting it into a sockmap would lead to a deadlock. One straight forward example are sock_ops programs that have ctx->sk, but the same problem exists for kprobes, etc. We deal with this by allowing sockmap updates only from known safe contexts. Improper usage is rejected by the verifier. I've audited the enabled contexts to make sure they can't run in a locked context. It's possible that CGROUP_SKB and others are safe as well, but the auditing here is much more difficult. In any case, we can extend the safe contexts when the need arises. Signed-off-by: Lorenz Bauer <lmb@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20200821102948.21918-6-lmb@cloudflare.com
2020-08-21bpf: Override the meaning of ARG_PTR_TO_MAP_VALUE for sockmap and sockhashLorenz Bauer
The verifier assumes that map values are simple blobs of memory, and therefore treats ARG_PTR_TO_MAP_VALUE, etc. as such. However, there are map types where this isn't true. For example, sockmap and sockhash store sockets. In general this isn't a big problem: we can just write helpers that explicitly requests PTR_TO_SOCKET instead of ARG_PTR_TO_MAP_VALUE. The one exception are the standard map helpers like map_update_elem, map_lookup_elem, etc. Here it would be nice we could overload the function prototype for different kinds of maps. Unfortunately, this isn't entirely straight forward: We only know the type of the map once we have resolved meta->map_ptr in check_func_arg. This means we can't swap out the prototype in check_helper_call until we're half way through the function. Instead, modify check_func_arg to treat ARG_PTR_TO_MAP_VALUE to mean "the native type for the map" instead of "pointer to memory" for sockmap and sockhash. This means we don't have to modify the function prototype at all Signed-off-by: Lorenz Bauer <lmb@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20200821102948.21918-5-lmb@cloudflare.com
2020-08-21bpf: sockmap: Call sock_map_update_elem directlyLorenz Bauer
Don't go via map->ops to call sock_map_update_elem, since we know what function to call in bpf_map_update_value. Since we currently don't allow calling map_update_elem from BPF context, we can remove ops->map_update_elem and rename the function to sock_map_update_elem_sys. Signed-off-by: Lorenz Bauer <lmb@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20200821102948.21918-4-lmb@cloudflare.com
2020-08-21Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge misc fixes from Andrew Morton: "11 patches. Subsystems affected by this: misc, mm/hugetlb, mm/vmalloc, mm/misc, romfs, relay, uprobes, squashfs, mm/cma, mm/pagealloc" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: mm, page_alloc: fix core hung in free_pcppages_bulk() mm: include CMA pages in lowmem_reserve at boot squashfs: avoid bio_alloc() failure with 1Mbyte blocks uprobes: __replace_page() avoid BUG in munlock_vma_page() kernel/relay.c: fix memleak on destroy relay channel romfs: fix uninitialized memory leak in romfs_dev_read() mm/rodata_test.c: fix missing function declaration mm/vunmap: add cond_resched() in vunmap_pmd_range khugepaged: adjust VM_BUG_ON_MM() in __khugepaged_enter() hugetlb_cgroup: convert comma to semicolon mailmap: add Andi Kleen
2020-08-21bpf: Implement link_query callbacks in map element iteratorsYonghong Song
For bpf_map_elem and bpf_sk_local_storage bpf iterators, additional map_id should be shown for fdinfo and userspace query. For example, the following is for a bpf_map_elem iterator. $ cat /proc/1753/fdinfo/9 pos: 0 flags: 02000000 mnt_id: 14 link_type: iter link_id: 34 prog_tag: 104be6d3fe45e6aa prog_id: 173 target_name: bpf_map_elem map_id: 127 Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200821184419.574240-1-yhs@fb.com
2020-08-21bpf: Implement link_query for bpf iteratorsYonghong Song
This patch implemented bpf_link callback functions show_fdinfo and fill_link_info to support link_query interface. The general interface for show_fdinfo and fill_link_info will print/fill the target_name. Each targets can register show_fdinfo and fill_link_info callbacks to print/fill more target specific information. For example, the below is a fdinfo result for a bpf task iterator. $ cat /proc/1749/fdinfo/7 pos: 0 flags: 02000000 mnt_id: 14 link_type: iter link_id: 11 prog_tag: 990e1f8152f7e54f prog_id: 59 target_name: task Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200821184418.574122-1-yhs@fb.com
2020-08-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller
Alexei Starovoitov says: ==================== pull-request: bpf 2020-08-21 The following pull-request contains BPF updates for your *net* tree. We've added 11 non-merge commits during the last 5 day(s) which contain a total of 12 files changed, 78 insertions(+), 24 deletions(-). The main changes are: 1) three fixes in BPF task iterator logic, from Yonghong. 2) fix for compressed dwarf sections in vmlinux, from Jiri. 3) fix xdp attach regression, from Andrii. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-21uprobes: __replace_page() avoid BUG in munlock_vma_page()Hugh Dickins
syzbot crashed on the VM_BUG_ON_PAGE(PageTail) in munlock_vma_page(), when called from uprobes __replace_page(). Which of many ways to fix it? Settled on not calling when PageCompound (since Head and Tail are equals in this context, PageCompound the usual check in uprobes.c, and the prior use of FOLL_SPLIT_PMD will have cleared PageMlocked already). Fixes: 5a52c9df62b4 ("uprobe: use FOLL_SPLIT_PMD instead of FOLL_SPLIT") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Acked-by: Song Liu <songliubraving@fb.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: <stable@vger.kernel.org> [5.4+] Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008161338360.20413@eggly.anvils Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-21kernel/relay.c: fix memleak on destroy relay channelWei Yongjun
kmemleak report memory leak as follows: unreferenced object 0x607ee4e5f948 (size 8): comm "syz-executor.1", pid 2098, jiffies 4295031601 (age 288.468s) hex dump (first 8 bytes): 00 00 00 00 00 00 00 00 ........ backtrace: relay_open kernel/relay.c:583 [inline] relay_open+0xb6/0x970 kernel/relay.c:563 do_blk_trace_setup+0x4a8/0xb20 kernel/trace/blktrace.c:557 __blk_trace_setup+0xb6/0x150 kernel/trace/blktrace.c:597 blk_trace_ioctl+0x146/0x280 kernel/trace/blktrace.c:738 blkdev_ioctl+0xb2/0x6a0 block/ioctl.c:613 block_ioctl+0xe5/0x120 fs/block_dev.c:1871 vfs_ioctl fs/ioctl.c:48 [inline] __do_sys_ioctl fs/ioctl.c:753 [inline] __se_sys_ioctl fs/ioctl.c:739 [inline] __x64_sys_ioctl+0x170/0x1ce fs/ioctl.c:739 do_syscall_64+0x33/0x40 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 'chan->buf' is malloced in relay_open() by alloc_percpu() but not free while destroy the relay channel. Fix it by adding free_percpu() before return from relay_destroy_channel(). Fixes: 017c59c042d0 ("relay: Use per CPU constructs for the relay channel buffer pointers") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: David Rientjes <rientjes@google.com> Cc: Michel Lespinasse <walken@google.com> Cc: Daniel Axtens <dja@axtens.net> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Akash Goel <akash.goel@intel.com> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20200817122826.48518-1-weiyongjun1@huawei.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>