summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2020-05-19kprobes: Prevent probes in .noinstr.text sectionThomas Gleixner
Instrumentation is forbidden in the .noinstr.text section. Make kprobes respect this. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Link: https://lkml.kernel.org/r/20200505134100.179862032@linutronix.de
2020-05-19rcu: Provide __rcu_is_watching()Thomas Gleixner
Same as rcu_is_watching() but without the preempt_disable/enable() pair inside the function. It is merked noinstr so it ends up in the non-instrumentable text section. This is useful for non-preemptible code especially in the low level entry section. Using rcu_is_watching() there results in a call to the preempt_schedule_notrace() thunk which triggers noinstr section warnings in objtool. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200512213810.518709291@linutronix.de
2020-05-19rcu: Provide rcu_irq_exit_preempt()Thomas Gleixner
Interrupts and exceptions invoke rcu_irq_enter() on entry and need to invoke rcu_irq_exit() before they either return to the interrupted code or invoke the scheduler due to preemption. The general assumption is that RCU idle code has to have preemption disabled so that a return from interrupt cannot schedule. So the return from interrupt code invokes rcu_irq_exit() and preempt_schedule_irq(). If there is any imbalance in the rcu_irq/nmi* invocations or RCU idle code had preemption enabled then this goes unnoticed until the CPU goes idle or some other RCU check is executed. Provide rcu_irq_exit_preempt() which can be invoked from the interrupt/exception return code in case that preemption is enabled. It invokes rcu_irq_exit() and contains a few sanity checks in case that CONFIG_PROVE_RCU is enabled to catch such issues directly. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200505134904.364456424@linutronix.de
2020-05-19rcu: Make RCU IRQ enter/exit functions rely on in_nmi()Paul E. McKenney
The rcu_nmi_enter_common() and rcu_nmi_exit_common() functions take an "irq" parameter that indicates whether these functions have been invoked from an irq handler (irq==true) or an NMI handler (irq==false). However, recent changes have applied notrace to a few critical functions such that rcu_nmi_enter_common() and rcu_nmi_exit_common() many now rely on in_nmi(). Note that in_nmi() works no differently than before, but rather that tracing is now prohibited in code regions where in_nmi() would incorrectly report NMI state. Therefore remove the "irq" parameter and inline rcu_nmi_enter_common() and rcu_nmi_exit_common() into rcu_nmi_enter() and rcu_nmi_exit(), respectively. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Link: https://lkml.kernel.org/r/20200505134101.617130349@linutronix.de
2020-05-19rcu/tree: Mark the idle relevant functions noinstrThomas Gleixner
These functions are invoked from context tracking and other places in the low level entry code. Move them into the .noinstr.text section to exclude them from instrumentation. Mark the places which are safe to invoke traceable functions with instrumentation_begin/end() so objtool won't complain. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Paul E. McKenney <paulmck@kernel.org> Link: https://lkml.kernel.org/r/20200505134100.575356107@linutronix.de
2020-05-19sh/ftrace: Move arch_ftrace_nmi_{enter,exit} into nmi exceptionPeter Zijlstra
SuperH is the last remaining user of arch_ftrace_nmi_{enter,exit}(), remove it from the generic code and into the SuperH code. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Rich Felker <dalias@libc.org> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Link: https://lkml.kernel.org/r/20200505134101.248881738@linutronix.de
2020-05-19lockdep: Always inline lockdep_{off,on}()Peter Zijlstra
These functions are called {early,late} in nmi_{enter,exit} and should not be traced or probed. They are also puny, so 'inline' them. Reported-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Link: https://lkml.kernel.org/r/20200505134101.048523500@linutronix.de
2020-05-19printk: Disallow instrumenting print_nmi_enter()Peter Zijlstra
It happens early in nmi_enter(), no tracing, probing or other funnies allowed. Specifically as nmi_enter() will be used in do_debug(), which would cause recursive exceptions when kprobed. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Link: https://lkml.kernel.org/r/20200505134101.139720912@linutronix.de
2020-05-19printk: Prepare for nested printk_nmi_enter()Petr Mladek
There is plenty of space in the printk_context variable. Reserve one byte there for the NMI context to be on the safe side. It should never overflow. The BUG_ON(in_nmi() == NMI_MASK) in nmi_enter() will trigger much earlier. Signed-off-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Link: https://lkml.kernel.org/r/20200505134100.681374113@linutronix.de
2020-05-19Merge tag 'noinstr-lds-2020-05-19' into core/rcuThomas Gleixner
Get the noinstr section and annotation markers to base the RCU parts on.
2020-05-19lockdep: Prepare for noinstr sectionsPeter Zijlstra
Force inlining and prevent instrumentation of all sorts by marking the functions which are invoked from low level entry code with 'noinstr'. Split the irqflags tracking into two parts. One which does the heavy lifting while RCU is watching and the final one which can be invoked after RCU is turned off. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Link: https://lkml.kernel.org/r/20200505134100.484532537@linutronix.de
2020-05-19tracing: Provide lockdep less trace_hardirqs_on/off() variantsThomas Gleixner
trace_hardirqs_on/off() is only partially safe vs. RCU idle. The tracer core itself is safe, but the resulting tracepoints can be utilized by e.g. BPF which is unsafe. Provide variants which do not contain the lockdep invocation so the lockdep and tracer invocations can be split at the call site and placed properly. This is required because lockdep needs to be aware of the state before switching away from RCU idle and after switching to RCU idle because these transitions can take locks. As these code pathes are going to be non-instrumentable the tracer can be invoked after RCU is turned on and before the switch to RCU idle. So for these new variants there is no need to invoke the rcuidle aware tracer functions. Name them so they match the lockdep counterparts. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200505134100.270771162@linutronix.de
2020-05-19proc: proc_pid_ns takes super_block as an argumentAlexey Gladkov
syzbot found that touch /proc/testfile causes NULL pointer dereference at tomoyo_get_local_path() because inode of the dentry is NULL. Before c59f415a7cb6, Tomoyo received pid_ns from proc's s_fs_info directly. Since proc_pid_ns() can only work with inode, using it in the tomoyo_get_local_path() was wrong. To avoid creating more functions for getting proc_ns, change the argument type of the proc_pid_ns() function. Then, Tomoyo can use the existing super_block to get pid_ns. Link: https://lkml.kernel.org/r/0000000000002f0c7505a5b0e04c@google.com Link: https://lkml.kernel.org/r/20200518180738.2939611-1-gladkov.alexey@gmail.com Reported-by: syzbot+c1af344512918c61362c@syzkaller.appspotmail.com Fixes: c59f415a7cb6 ("Use proc_pid_ns() to get pid_namespace from the proc superblock") Signed-off-by: Alexey Gladkov <gladkov.alexey@gmail.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-05-19ARM: 8976/1: module: allow arch overrides for .init section namesVincent Whitchurch
ARM stores unwind information for .init.text in sections named .ARM.extab.init.text and .ARM.exidx.init.text. Since those aren't currently recognized as init sections, they're allocated along with the core section, and relocation fails if the core and the init section are allocated from different regions and can't reach other. final section addresses: ... 0x7f800000 .init.text .. 0xcbb54078 .ARM.exidx.init.text .. section 16 reloc 0 sym '': relocation 42 out of range (0xcbb54078 -> 0x7f800000) Allow architectures to override the section name so that ARM can fix this. Acked-by: Jessica Yu <jeyu@kernel.org> Signed-off-by: Vincent Whitchurch <vincent.whitchurch@axis.com> Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
2020-05-18kgdb: Add kgdb_has_hit_break functionVincent Chen
The break instruction in RISC-V does not have an immediate value field, so the kernel cannot identify the purpose of each trap exception through the opcode. This makes the existing identification schemes in other architecture unsuitable for the RISC-V kernel. To solve this problem, this patch adds kgdb_has_hit_break(), which can help RISC-V kernel identify the KGDB trap exception. Signed-off-by: Vincent Chen <vincent.chen@sifive.com> Reviewed-by: Palmer Dabbelt <palmerdabbelt@google.com> Acked-by: Daniel Thompson <daniel.thompson@linaro.org> Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
2020-05-18kgdboc: Add kgdboc_earlycon to support early kgdb using boot consolesDouglas Anderson
We want to enable kgdb to debug the early parts of the kernel. Unfortunately kgdb normally is a client of the tty API in the kernel and serial drivers don't register to the tty layer until fairly late in the boot process. Serial drivers do, however, commonly register a boot console. Let's enable the kgdboc driver to work with boot consoles to provide early debugging. This change co-opts the existing read() function pointer that's part of "struct console". It's assumed that if a boot console (with the flag CON_BOOT) has implemented read() that both the read() and write() function are polling functions. That means they work without interrupts and read() will return immediately (with 0 bytes read) if there's nothing to read. This should be a safe assumption since it appears that no current boot consoles implement read() right now and there seems no reason to do so unless they wanted to support "kgdboc_earlycon". The normal/expected way to make all this work is to use "kgdboc_earlycon" and "kgdboc" together. You should point them both to the same physical serial connection. At boot time, as the system transitions from the boot console to the normal console (and registers a tty), kgdb will switch over. One awkward part of all this, though, is that there can be a window where the boot console goes away and we can't quite transtion over to the main kgdboc that uses the tty layer. There are two main problems: 1. The act of registering the tty doesn't cause any call into kgdboc so there is a window of time when the tty is there but kgdboc's init code hasn't been called so we can't transition to it. 2. On some serial drivers the normal console inits (and replaces the boot console) quite early in the system. Presumably these drivers were coded up before earlycon worked as well as it does today and probably they don't need to do this anymore, but it causes us problems nontheless. Problem #1 is not too big of a deal somewhat due to the luck of probe ordering. kgdboc is last in the tty/serial/Makefile so its probe gets right after all other tty devices. It's not fun to rely on this, but it does work for the most part. Problem #2 is a big deal, but only for some serial drivers. Other serial drivers end up registering the console (which gets rid of the boot console) and tty at nearly the same time. The way we'll deal with the window when the system has stopped using the boot console and the time when we're setup using the tty is to keep using the boot console. This may sound surprising, but it has been found to work well in practice. If it doesn't work, it shouldn't be too hard for a given serial driver to make it keep working. Specifically, it's expected that the read()/write() function provided in the boot console should be the same (or nearly the same) as the normal kgdb polling functions. That means continuing to use them should work just fine. To make things even more likely to work work we'll also trap the recently added exit() function in the boot console we're using and delay any calls to it until we're all done with the boot console. NOTE: there could be ways to use all this in weird / unexpected ways. If you do something like this, it's a bit of a buyer beware situation. Specifically: - If you specify only "kgdboc_earlycon" but not "kgdboc" then (depending on your serial driver) things will probably work OK, but you'll get a warning printed the first time you use kgdb after the boot console is gone. You'd only be able to do this, of course, if the serial driver you're running atop provided an early boot console. - If your "kgdboc_earlycon" and "kgdboc" devices are not the same device things should work OK, but it'll be your job to switch over which device you're monitoring (including figuring out how to switch over gdb in-flight if you're using it). When trying to enable "kgdboc_earlycon" it should be noted that the names that are registered through the boot console layer and the tty layer are not the same for the same port. For example when debugging on one board I'd need to pass "kgdboc_earlycon=qcom_geni kgdboc=ttyMSM0" to enable things properly. Since digging up the boot console name is a pain and there will rarely be more than one boot console enabled, you can provide the "kgdboc_earlycon" parameter without specifying the name of the boot console. In this case we'll just pick the first boot that implements read() that we find. This new "kgdboc_earlycon" parameter should be contrasted to the existing "ekgdboc" parameter. While both provide a way to debug very early, the usage and mechanisms are quite different. Specifically "kgdboc_earlycon" is meant to be used in tandem with "kgdboc" and there is a transition from one to the other. The "ekgdboc" parameter, on the other hand, replaces the "kgdboc" parameter. It runs the same logic as the "kgdboc" parameter but just relies on your TTY driver being present super early. The only known usage of the old "ekgdboc" parameter is documented as "ekgdboc=kbd earlyprintk=vga". It should be noted that "kbd" has special treatment allowing it to init early as a tty device. Signed-off-by: Douglas Anderson <dianders@chromium.org> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Tested-by: Sumit Garg <sumit.garg@linaro.org> Link: https://lore.kernel.org/r/20200507130644.v4.8.I8fba5961bf452ab92350654aa61957f23ecf0100@changeid Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
2020-05-18kgdb: Prevent infinite recursive entries to the debuggerDouglas Anderson
If we detect that we recursively entered the debugger we should hack our I/O ops to NULL so that the panic() in the next line won't actually cause another recursion into the debugger. The first line of kgdb_panic() will check this and return. Signed-off-by: Douglas Anderson <dianders@chromium.org> Reviewed-by: Daniel Thompson <daniel.thompson@linaro.org> Link: https://lore.kernel.org/r/20200507130644.v4.6.I89de39f68736c9de610e6f241e68d8dbc44bc266@changeid Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
2020-05-18kgdb: Delay "kgdbwait" to dbg_late_init() by defaultDouglas Anderson
Using kgdb requires at least some level of architecture-level initialization. If nothing else, it relies on the architecture to pass breakpoints / crashes onto kgdb. On some architectures this all works super early, specifically it starts working at some point in time before Linux parses early_params's. On other architectures it doesn't. A survey of a few platforms: a) x86: Presumably it all works early since "ekgdboc" is documented to work here. b) arm64: Catching crashes works; with a simple patch breakpoints can also be made to work. c) arm: Nothing in kgdb works until paging_init() -> devicemaps_init() -> early_trap_init() Let's be conservative and, by default, process "kgdbwait" (which tells the kernel to drop into the debugger ASAP at boot) a bit later at dbg_late_init() time. If an architecture has tested it and wants to re-enable super early debugging, they can select the ARCH_HAS_EARLY_DEBUG KConfig option. We'll do this for x86 to start. It should be noted that dbg_late_init() is still called quite early in the system. Note that this patch doesn't affect when kgdb runs its init. If kgdb is set to initialize early it will still initialize when parsing early_param's. This patch _only_ inhibits the initial breakpoint from "kgdbwait". This means: * Without any extra patches arm64 platforms will at least catch crashes after kgdb inits. * arm platforms will catch crashes (and could handle a hardcoded kgdb_breakpoint()) any time after early_trap_init() runs, even before dbg_late_init(). Signed-off-by: Douglas Anderson <dianders@chromium.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Link: https://lore.kernel.org/r/20200507130644.v4.4.I3113aea1b08d8ce36dc3720209392ae8b815201b@changeid Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
2020-05-18scs: Remove references to asm/scs.h from core codeWill Deacon
asm/scs.h is no longer needed by the core code, so remove a redundant header inclusion and update the stale Kconfig text. Tested-by: Sami Tolvanen <samitolvanen@google.com> Reviewed-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Will Deacon <will@kernel.org>
2020-05-18scs: Move scs_overflow_check() out of architecture codeWill Deacon
There is nothing architecture-specific about scs_overflow_check() as it's just a trivial wrapper around scs_corrupted(). For parity with task_stack_end_corrupted(), rename scs_corrupted() to task_scs_end_corrupted() and call it from schedule_debug() when CONFIG_SCHED_STACK_END_CHECK_is enabled, which better reflects its purpose as a debug feature to catch inadvertent overflow of the SCS. Finally, remove the unused scs_overflow_check() function entirely. This has absolutely no impact on architectures that do not support SCS (currently arm64 only). Tested-by: Sami Tolvanen <samitolvanen@google.com> Reviewed-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Will Deacon <will@kernel.org>
2020-05-18scs: Move accounting into alloc/free functionsWill Deacon
There's no need to perform the shadow stack page accounting independently of the lifetime of the underlying allocation, so call the accounting code from the {alloc,free}() functions and simplify the code in the process. Tested-by: Sami Tolvanen <samitolvanen@google.com> Reviewed-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Will Deacon <will@kernel.org>
2020-05-18arm64: scs: Store absolute SCS stack pointer value in thread_infoWill Deacon
Storing the SCS information in thread_info as a {base,offset} pair introduces an additional load instruction on the ret-to-user path, since the SCS stack pointer in x18 has to be converted back to an offset by subtracting the base. Replace the offset with the absolute SCS stack pointer value instead and avoid the redundant load. Tested-by: Sami Tolvanen <samitolvanen@google.com> Reviewed-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Will Deacon <will@kernel.org>
2020-05-18kgdb: Disable WARN_CONSOLE_UNLOCKED for all kgdbDouglas Anderson
In commit 81eaadcae81b ("kgdboc: disable the console lock when in kgdb") we avoided the WARN_CONSOLE_UNLOCKED() yell when we were in kgdboc. That still works fine, but it turns out that we get a similar yell when using other I/O drivers. One example is the "I/O driver" for the kgdb test suite (kgdbts). When I enabled that I again got the same yells. Even though "kgdbts" doesn't actually interact with the user over the console, using it still causes kgdb to print to the consoles. That trips the same warning: con_is_visible+0x60/0x68 con_scroll+0x110/0x1b8 lf+0x4c/0xc8 vt_console_print+0x1b8/0x348 vkdb_printf+0x320/0x89c kdb_printf+0x68/0x90 kdb_main_loop+0x190/0x860 kdb_stub+0x2cc/0x3ec kgdb_cpu_enter+0x268/0x744 kgdb_handle_exception+0x1a4/0x200 kgdb_compiled_brk_fn+0x34/0x44 brk_handler+0x7c/0xb8 do_debug_exception+0x1b4/0x228 Let's increment/decrement the "ignore_console_lock_warning" variable all the time when we enter the debugger. This will allow us to later revert commit 81eaadcae81b ("kgdboc: disable the console lock when in kgdb"). Signed-off-by: Douglas Anderson <dianders@chromium.org> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Daniel Thompson <daniel.thompson@linaro.org> Link: https://lore.kernel.org/r/20200507130644.v4.1.Ied2b058357152ebcc8bf68edd6f20a11d98d7d4e@changeid Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
2020-05-19powerpc/watchpoint: Don't allow concurrent perf and ptrace eventsRavi Bangoria
With Book3s DAWR, ptrace and perf watchpoints on powerpc behaves differently. Ptrace watchpoint works in one-shot mode and generates signal before executing instruction. It's ptrace user's job to single-step the instruction and re-enable the watchpoint. OTOH, in case of perf watchpoint, kernel emulates/single-steps the instruction and then generates event. If perf and ptrace creates two events with same or overlapping address ranges, it's ambiguous to decide who should single-step the instruction. Because of this issue, don't allow perf and ptrace watchpoint at the same time if their address range overlaps. Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Reviewed-by: Michael Neuling <mikey@neuling.org> Link: https://lore.kernel.org/r/20200514111741.97993-15-ravi.bangoria@linux.ibm.com
2020-05-18Revert "docs: sysctl/kernel: document ngroups_max"Jonathan Corbet
This reverts commit 2f4c33063ad713e3a5b63002cf8362846e78bd71. The changes here were fine, but there's a non-documentation change to sysctl.c that makes messes elsewhere; those changes should have been done independently. Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2020-05-18genirq/irq_sim: Simplify the APIBartosz Golaszewski
The interrupt simulator API exposes a lot of custom data structures and functions and doesn't reuse the interfaces already exposed by the irq subsystem. This patch tries to address it. We hide all the simulator-related data structures from users and instead rely on the well-known irq domain. When creating the interrupt simulator the user receives a pointer to a newly created irq_domain and can use it to create mappings for simulated interrupts. It is also possible to pass a handle to fwnode when creating the simulator domain and retrieve it using irq_find_matching_fwnode(). The irq_sim_fire() function is dropped as well. Instead we implement the irq_get/set_irqchip_state interface. We modify the two modules that use the simulator at the same time as adding these changes in order to reduce the intermediate bloat that would result when trying to migrate the drivers in separate patches. Signed-off-by: Bartosz Golaszewski <bgolaszewski@baylibre.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Reviewed-by: Linus Walleij <linus.walleij@linaro.org> Acked-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> #for IIO Link: https://lore.kernel.org/r/20200514083901.23445-3-brgl@bgdev.pl
2020-05-18irqdomain: Make irq_domain_reset_irq_data() available to non-hierarchical usersBartosz Golaszewski
irq_domain_reset_irq_data() doesn't modify the parent data, so it can be made available even if irq domain hierarchy is not being built. We'll subsequently use it in irq_sim code. Signed-off-by: Bartosz Golaszewski <bgolaszewski@baylibre.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Reviewed-by: Linus Walleij <linus.walleij@linaro.org> Link: https://lore.kernel.org/r/20200514083901.23445-2-brgl@bgdev.pl
2020-05-16blktrace: Report pid with note messagesJan Kara
Currently informational messages within block trace do not have PID information of the process reporting the message included. With BFQ it is sometimes useful to have the information and there's no good reason to omit the information from the trace. So just fill in pid information when generating note message. Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Acked-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-16bpf: Fix check_return_code to only allow [0,1] in trace_iter progsDaniel Borkmann
As per 15d83c4d7cef ("bpf: Allow loading of a bpf_iter program") we only allow a range of [0,1] for return codes. Therefore BPF_TRACE_ITER relies on the default tnum_range(0, 1) which is set in range var. On recent merge of net into net-next commit e92888c72fbd ("bpf: Enforce returning 0 for fentry/fexit progs") got pulled in and caused a merge conflict with the changes from 15d83c4d7cef. The resolution had a snall hiccup in that it removed the [0,1] range restriction again so that BPF_TRACE_ITER would have no enforcement. Fix it by adding it back. Fixes: da07f52d3caf ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org>
2020-05-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller
Move the bpf verifier trace check into the new switch statement in HEAD. Resolve the overlapping changes in hinic, where bug fixes overlap the addition of VF support. Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netLinus Torvalds
Pull networking fixes from David Miller: 1) Fix sk_psock reference count leak on receive, from Xiyu Yang. 2) CONFIG_HNS should be invisible, from Geert Uytterhoeven. 3) Don't allow locking route MTUs in ipv6, RFCs actually forbid this, from Maciej Żenczykowski. 4) ipv4 route redirect backoff wasn't actually enforced, from Paolo Abeni. 5) Fix netprio cgroup v2 leak, from Zefan Li. 6) Fix infinite loop on rmmod in conntrack, from Florian Westphal. 7) Fix tcp SO_RCVLOWAT hangs, from Eric Dumazet. 8) Various bpf probe handling fixes, from Daniel Borkmann. * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (68 commits) selftests: mptcp: pm: rm the right tmp file dpaa2-eth: properly handle buffer size restrictions bpf: Restrict bpf_trace_printk()'s %s usage and add %pks, %pus specifier bpf: Add bpf_probe_read_{user, kernel}_str() to do_refine_retval_range bpf: Restrict bpf_probe_read{, str}() only to archs where they work MAINTAINERS: Mark networking drivers as Maintained. ipmr: Add lockdep expression to ipmr_for_each_table macro ipmr: Fix RCU list debugging warning drivers: net: hamradio: Fix suspicious RCU usage warning in bpqether.c net: phy: broadcom: fix BCM54XX_SHD_SCR3_TRDDAPD value for BCM54810 tcp: fix error recovery in tcp_zerocopy_receive() MAINTAINERS: Add Jakub to networking drivers. MAINTAINERS: another add of Karsten Graul for S390 networking drivers: ipa: fix typos for ipa_smp2p structure doc pppoe: only process PADT targeted at local interfaces selftests/bpf: Enforce returning 0 for fentry/fexit programs bpf: Enforce returning 0 for fentry/fexit progs net: stmmac: fix num_por initialization security: Fix the default value of secid_to_secctx hook libbpf: Fix register naming in PT_REGS s390 macros ...
2020-05-15kernel/cpu_pm: Fix uninitted local in cpu_pmDouglas Anderson
cpu_pm_notify() is basically a wrapper of notifier_call_chain(). notifier_call_chain() doesn't initialize *nr_calls to 0 before it starts incrementing it--presumably it's up to the callers to do this. Unfortunately the callers of cpu_pm_notify() don't init *nr_calls. This potentially means you could get too many or two few calls to CPU_PM_ENTER_FAILED or CPU_CLUSTER_PM_ENTER_FAILED depending on the luck of the stack. Let's fix this. Fixes: ab10023e0088 ("cpu_pm: Add cpu power management notifiers") Cc: stable@vger.kernel.org Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Stephen Boyd <swboyd@chromium.org> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Douglas Anderson <dianders@chromium.org> Link: https://lore.kernel.org/r/20200504104917.v6.3.I2d44fc0053d019f239527a4e5829416714b7e299@changeid Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
2020-05-15docs: sysctl/kernel: document ngroups_maxStephen Kitt
This is a read-only export of NGROUPS_MAX, so this patch also changes the declarations in kernel/sysctl.c to const. Signed-off-by: Stephen Kitt <steve@sk2.org> Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20200515160222.7994-1-steve@sk2.org Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2020-05-15scs: Add support for stack usage debuggingSami Tolvanen
Implements CONFIG_DEBUG_STACK_USAGE for shadow stacks. When enabled, also prints out the highest shadow stack usage per process. Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Will Deacon <will@kernel.org> [will: rewrote most of scs_check_usage()] Signed-off-by: Will Deacon <will@kernel.org>
2020-05-15scs: Add page accounting for shadow call stack allocationsSami Tolvanen
This change adds accounting for the memory allocated for shadow stacks. Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Will Deacon <will@kernel.org> Signed-off-by: Will Deacon <will@kernel.org>
2020-05-15scs: Add support for Clang's Shadow Call Stack (SCS)Sami Tolvanen
This change adds generic support for Clang's Shadow Call Stack, which uses a shadow stack to protect return addresses from being overwritten by an attacker. Details are available here: https://clang.llvm.org/docs/ShadowCallStack.html Note that security guarantees in the kernel differ from the ones documented for user space. The kernel must store addresses of shadow stacks in memory, which means an attacker capable reading and writing arbitrary memory may be able to locate them and hijack control flow by modifying the stacks. Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com> [will: Numerous cosmetic changes] Signed-off-by: Will Deacon <will@kernel.org>
2020-05-15bpf: Implement CAP_BPFAlexei Starovoitov
Implement permissions as stated in uapi/linux/capability.h In order to do that the verifier allow_ptr_leaks flag is split into four flags and they are set as: env->allow_ptr_leaks = bpf_allow_ptr_leaks(); env->bypass_spec_v1 = bpf_bypass_spec_v1(); env->bypass_spec_v4 = bpf_bypass_spec_v4(); env->bpf_capable = bpf_capable(); The first three currently equivalent to perfmon_capable(), since leaking kernel pointers and reading kernel memory via side channel attacks is roughly equivalent to reading kernel memory with cap_perfmon. 'bpf_capable' enables bounded loops, precision tracking, bpf to bpf calls and other verifier features. 'allow_ptr_leaks' enable ptr leaks, ptr conversions, subtraction of pointers. 'bypass_spec_v1' disables speculative analysis in the verifier, run time mitigations in bpf array, and enables indirect variable access in bpf programs. 'bypass_spec_v4' disables emission of sanitation code by the verifier. That means that the networking BPF program loaded with CAP_BPF + CAP_NET_ADMIN will have speculative checks done by the verifier and other spectre mitigation applied. Such networking BPF program will not be able to leak kernel pointers and will not be able to access arbitrary kernel memory. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20200513230355.7858-3-alexei.starovoitov@gmail.com
2020-05-15bpf: Restrict bpf_trace_printk()'s %s usage and add %pks, %pus specifierDaniel Borkmann
Usage of plain %s conversion specifier in bpf_trace_printk() suffers from the very same issue as bpf_probe_read{,str}() helpers, that is, it is broken on archs with overlapping address ranges. While the helpers have been addressed through work in 6ae08ae3dea2 ("bpf: Add probe_read_{user, kernel} and probe_read_{user, kernel}_str helpers"), we need an option for bpf_trace_printk() as well to fix it. Similarly as with the helpers, force users to make an explicit choice by adding %pks and %pus specifier to bpf_trace_printk() which will then pick the corresponding strncpy_from_unsafe*() variant to perform the access under KERNEL_DS or USER_DS. The %pk* (kernel specifier) and %pu* (user specifier) can later also be extended for other objects aside strings that are probed and printed under tracing, and reused out of other facilities like bpf_seq_printf() or BTF based type printing. Existing behavior of %s for current users is still kept working for archs where it is not broken and therefore gated through CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE. For archs not having this property we fall-back to pick probing under KERNEL_DS as a sensible default. Fixes: 8d3b7dce8622 ("bpf: add support for %s specifier to bpf_trace_printk()") Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Reported-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Brendan Gregg <brendan.d.gregg@gmail.com> Link: https://lore.kernel.org/bpf/20200515101118.6508-4-daniel@iogearbox.net
2020-05-15bpf: Add bpf_probe_read_{user, kernel}_str() to do_refine_retval_rangeDaniel Borkmann
Given bpf_probe_read{,str}() BPF helpers are now only available under CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE, we need to add the drop-in replacements of bpf_probe_read_{kernel,user}_str() to do_refine_retval_range() as well to avoid hitting the same issue as in 849fa50662fbc ("bpf/verifier: refine retval R0 state for bpf_get_stack helper"). Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20200515101118.6508-3-daniel@iogearbox.net
2020-05-15bpf: Restrict bpf_probe_read{, str}() only to archs where they workDaniel Borkmann
Given the legacy bpf_probe_read{,str}() BPF helpers are broken on archs with overlapping address ranges, we should really take the next step to disable them from BPF use there. To generally fix the situation, we've recently added new helper variants bpf_probe_read_{user,kernel}() and bpf_probe_read_{user,kernel}_str(). For details on them, see 6ae08ae3dea2 ("bpf: Add probe_read_{user, kernel} and probe_read_{user,kernel}_str helpers"). Given bpf_probe_read{,str}() have been around for ~5 years by now, there are plenty of users at least on x86 still relying on them today, so we cannot remove them entirely w/o breaking the BPF tracing ecosystem. However, their use should be restricted to archs with non-overlapping address ranges where they are working in their current form. Therefore, move this behind a CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE and have x86, arm64, arm select it (other archs supporting it can follow-up on it as well). For the remaining archs, they can workaround easily by relying on the feature probe from bpftool which spills out defines that can be used out of BPF C code to implement the drop-in replacement for old/new kernels via: bpftool feature probe macro Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Brendan Gregg <brendan.d.gregg@gmail.com> Cc: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/bpf/20200515101118.6508-2-daniel@iogearbox.net
2020-05-15rcu: constify sysrq_key_opEmil Velikov
With earlier commits, the API no longer discards the const-ness of the sysrq_key_op. As such we can add the notation. Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jiri Slaby <jslaby@suse.com> Cc: linux-kernel@vger.kernel.org Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: rcu@vger.kernel.org Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Emil Velikov <emil.l.velikov@gmail.com> Link: https://lore.kernel.org/r/20200513214351.2138580-11-emil.l.velikov@gmail.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-05-15kernel/power: constify sysrq_key_opEmil Velikov
With earlier commits, the API no longer discards the const-ness of the sysrq_key_op. As such we can add the notation. Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jiri Slaby <jslaby@suse.com> Cc: linux-kernel@vger.kernel.org Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Len Brown <len.brown@intel.com> Cc: linux-pm@vger.kernel.org Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Emil Velikov <emil.l.velikov@gmail.com> Link: https://lore.kernel.org/r/20200513214351.2138580-10-emil.l.velikov@gmail.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-05-15kdb: constify sysrq_key_opEmil Velikov
With earlier commits, the API no longer discards the const-ness of the sysrq_key_op. As such we can add the notation. Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jiri Slaby <jslaby@suse.com> Cc: linux-kernel@vger.kernel.org Cc: Jason Wessel <jason.wessel@windriver.com> Cc: Daniel Thompson <daniel.thompson@linaro.org> Cc: kgdb-bugreport@lists.sourceforge.net Acked-by: Daniel Thompson <daniel.thompson@linaro.org> Signed-off-by: Emil Velikov <emil.l.velikov@gmail.com> Link: https://lore.kernel.org/r/20200513214351.2138580-9-emil.l.velikov@gmail.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-05-14xdp: Cpumap redirect use frame_sz and increase skb_tailroomJesper Dangaard Brouer
Knowing the memory size backing the packet/xdp_frame data area, and knowing it already have reserved room for skb_shared_info, simplifies using build_skb significantly. With this change we no-longer lie about the SKB truesize, but more importantly a significant larger skb_tailroom is now provided, e.g. when drivers uses a full PAGE_SIZE. This extra tailroom (in linear area) can be used by the network stack when coalescing SKBs (e.g. in skb_try_coalesce, see TCP cases where tcp_queue_rcv() can 'eat' skb). Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/bpf/158945337822.97035.13557959180460986059.stgit@firesoul
2020-05-14Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller
Alexei Starovoitov says: ==================== pull-request: bpf-next 2020-05-14 The following pull-request contains BPF updates for your *net-next* tree. The main changes are: 1) Merged tag 'perf-for-bpf-2020-05-06' from tip tree that includes CAP_PERFMON. 2) support for narrow loads in bpf_sock_addr progs and additional helpers in cg-skb progs, from Andrey. 3) bpf benchmark runner, from Andrii. 4) arm and riscv JIT optimizations, from Luke. 5) bpf iterator infrastructure, from Yonghong. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-14bpf: Fix bpf_iter's task iterator logicAndrii Nakryiko
task_seq_get_next might stop prematurely if get_pid_task() fails to get task_struct. Failure to do so doesn't mean that there are no more tasks with higher pids. Procfs's iteration algorithm (see next_tgid in fs/proc/base.c) does a retry in such case. After this fix, instead of stopping prematurely after about 300 tasks on my server, bpf_iter program now returns >4000, which sounds much closer to reality. Fixes: eaaacd23910f ("bpf: Add task and task/file iterator targets") Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20200514055137.1564581-1-andriin@fb.com
2020-05-14bpf: Enforce returning 0 for fentry/fexit progsYonghong Song
Currently, tracing/fentry and tracing/fexit prog return values are not enforced. In trampoline codes, the fentry/fexit prog return values are ignored. Let us enforce it to be 0 to avoid confusion and allows potential future extension. This patch also explicitly added return value checking for tracing/raw_tp, tracing/fmod_ret, and freplace programs such that these program return values can be anything. The purpose are two folds: 1. to make it explicit about return value expectations for these programs in verifier. 2. for tracing prog_type, if a future attach type is added, the default is -ENOTSUPP which will enforce to specify return value ranges explicitly. Fixes: fec56f5890d9 ("bpf: Introduce BPF trampoline") Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20200514053206.1298415-1-yhs@fb.com
2020-05-14bpf: Fix bug in mmap() implementation for BPF array mapAndrii Nakryiko
mmap() subsystem allows user-space application to memory-map region with initial page offset. This wasn't taken into account in initial implementation of BPF array memory-mapping. This would result in wrong pages, not taking into account requested page shift, being memory-mmaped into user-space. This patch fixes this gap and adds a test for such scenario. Fixes: fc9702273e2e ("bpf: Add mmap() support for BPF_MAP_TYPE_ARRAY") Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20200512235925.3817805-1-andriin@fb.com
2020-05-14Merge tag 'for-linus-2020-05-13' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux Pull thread fix from Christian Brauner: "This contains a single fix for all exported legacy fork helpers to block accidental access to clone3() features in the upper 32 bits of their respective flags arguments. I got Cced on a glibc issue where someone reported consistent failures for the legacy clone() syscall on ppc64le when sign extension was performed (since the clone() syscall in glibc exposes the flags argument as an int whereas the kernel uses unsigned long). The legacy clone() syscall is odd in a bunch of ways and here two things interact: - First, legacy clone's flag argument is word-size dependent, i.e. it's an unsigned long whereas most system calls with flag arguments use int or unsigned int. - Second, legacy clone() ignores unknown and deprecated flags. The two of them taken together means that users on 64bit systems can pass garbage for the upper 32bit of the clone() syscall since forever and things would just work fine. The following program compiled on a 64bit kernel prior to v5.7-rc1 will succeed and will fail post v5.7-rc1 with EBADF: int main(int argc, char *argv[]) { pid_t pid; /* Note that legacy clone() has different argument ordering on * different architectures so this won't work everywhere. * * Only set the upper 32 bits. */ pid = syscall(__NR_clone, 0xffffffff00000000 | SIGCHLD, NULL, NULL, NULL, NULL); if (pid < 0) exit(EXIT_FAILURE); if (pid == 0) exit(EXIT_SUCCESS); if (wait(NULL) != pid) exit(EXIT_FAILURE); exit(EXIT_SUCCESS); } Since legacy clone() couldn't be extended this was not a problem so far and nobody really noticed or cared since nothing in the kernel ever bothered to look at the upper 32 bits. But once we introduced clone3() and expanded the flag argument in struct clone_args to 64 bit we opened this can of worms. With the first flag-based extension to clone3() making use of the upper 32 bits of the flag argument we've effectively made it possible for the legacy clone() syscall to reach clone3() only flags on accident. The sign extension scenario is just the odd corner-case that we needed to figure this out. The reason we just realized this now and not already when we introduced CLONE_CLEAR_SIGHAND was that CLONE_INTO_CGROUP assumes that a valid cgroup file descriptor has been given - whereas CLONE_CLEAR_SIGHAND doesn't need to verify anything. It just silently resets the signal handlers to SIG_DFL. So the sign extension (or the user accidently passing garbage for the upper 32 bits) caused the CLONE_INTO_CGROUP bit to be raised and the kernel to error out when it didn't find a valid cgroup file descriptor. Note, I'm also capping kernel_thread()'s flag argument mainly because none of the new features make sense for kernel_thread() and we shouldn't risk them being accidently activated however unlikely. If we wanted to, we could even make kernel_thread() yell when an unknown flag has been set which it doesn't do right now. But it's not worth risking this in a bugfix imho" * tag 'for-linus-2020-05-13' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: fork: prevent accidental access to clone3 features
2020-05-14Merge tag 'trace-v5.7-rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull more tracing fixes from Steven Rostedt: "Various tracing fixes: - Fix a crash when having function tracing and function stack tracing on the command line. The ftrace trampolines are created as executable and read only. But the stack tracer tries to modify them with text_poke() which expects all kernel text to still be writable at boot. Keep the trampolines writable at boot, and convert them to read-only with the rest of the kernel. - A selftest was triggering in the ring buffer iterator code, that is no longer valid with the update of keeping the ring buffer writable while a iterator is reading. Just bail after three failed attempts to get an event and remove the warning and disabling of the ring buffer. - While modifying the ring buffer code, decided to remove all the unnecessary BUG() calls" * tag 'trace-v5.7-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: ring-buffer: Remove all BUG() calls ring-buffer: Don't deactivate the ring buffer on failed iterator reads x86/ftrace: Have ftrace trampolines turn read-only at the end of system boot up