summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2020-09-15printk: reimplement log_cont using record extensionJohn Ogness
Use the record extending feature of the ringbuffer to implement continuous messages. This preserves the existing continuous message behavior. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20200914123354.832-7-john.ogness@linutronix.de
2020-09-15printk: ringbuffer: add finalization/extension supportJohn Ogness
Add support for extending the newest data block. For this, introduce a new finalization state (desc_finalized) denoting a committed descriptor that cannot be extended. Until a record is finalized, a writer can reopen that record to append new data. Reopening a record means transitioning from the desc_committed state back to the desc_reserved state. A writer can explicitly finalize a record if there is no intention of extending it. Also, records are automatically finalized when a new record is reserved. This relieves writers of needing to explicitly finalize while also making such records available to readers sooner. (Readers can only traverse finalized records.) Four new memory barrier pairs are introduced. Two of them are insignificant additions (data_realloc:A/desc_read:D and data_realloc:A/data_push_tail:B) because they are alternate path memory barriers that exactly match the purpose, pairing, and context of the two existing memory barrier pairs they provide an alternate path for. The other two new memory barrier pairs are significant additions: desc_reopen_last:A / _prb_commit:B - When reopening a descriptor, ensure the state transitions back to desc_reserved before fully trusting the descriptor data. _prb_commit:B / desc_reserve:D - When committing a descriptor, ensure the state transitions to desc_committed before checking the head ID to see if the descriptor needs to be finalized. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20200914123354.832-6-john.ogness@linutronix.de
2020-09-15printk: ringbuffer: change representation of statesJohn Ogness
Rather than deriving the state by evaluating bits within the flags area of the state variable, assign the states explicit values and set those values in the flags area. Introduce macros to make it simple to read and write state values for the state variable. Although the functionality is preserved, the binary representation for the states is changed. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20200914123354.832-5-john.ogness@linutronix.de
2020-09-15printk: ringbuffer: clear initial reserved fieldsJohn Ogness
prb_reserve() will set some meta data values and leave others uninitialized (or rather, containing the values of the previous wrap). Simplify the API by always clearing out all the fields. Only the sequence number is filled in. The caller is now responsible for filling in the rest of the meta data fields. In particular, for correctly filling in text and dict lengths. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20200914123354.832-4-john.ogness@linutronix.de
2020-09-15printk: ringbuffer: add BLK_DATALESS() macroJohn Ogness
Rather than continually needing to explicitly check @begin and @next to identify a dataless block, introduce and use a BLK_DATALESS() macro. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20200914123354.832-3-john.ogness@linutronix.de
2020-09-15printk: ringbuffer: relocate get_data()John Ogness
Move the internal get_data() function as-is above prb_reserve() so that a later change can make use of the static function. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20200914123354.832-2-john.ogness@linutronix.de
2020-09-15printk: ringbuffer: avoid memcpy() on state_varJohn Ogness
@state_var is copied as part of the descriptor copying via memcpy(). This is not allowed because @state_var is an atomic type, which in some implementations may contain a spinlock. Avoid using memcpy() with @state_var by explicitly copying the other fields of the descriptor. @state_var is set using atomic set operator before returning. Fixes: b6cf8b3f3312 ("printk: add lockless ringbuffer") Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20200914094803.27365-2-john.ogness@linutronix.de
2020-09-15printk: ringbuffer: fix setting state in desc_read()John Ogness
It is expected that desc_read() will always set at least the @state_var field. However, if the descriptor is in an inconsistent state, no fields are set. Also, the second load of @state_var is not stored in @desc_out and so might not match the state value that is returned. Always set the last loaded @state_var into @desc_out, regardless of the descriptor consistency. Fixes: b6cf8b3f3312 ("printk: add lockless ringbuffer") Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20200914094803.27365-1-john.ogness@linutronix.de
2020-09-14core/entry: Report syscall correctly for trace and auditKees Cook
On v5.8 when doing seccomp syscall rewrites (e.g. getpid into getppid as seen in the seccomp selftests), trace (and audit) correctly see the rewritten syscall on entry and exit: seccomp_bpf-1307 [000] .... 22974.874393: sys_enter: NR 110 (... seccomp_bpf-1307 [000] .N.. 22974.874401: sys_exit: NR 110 = 1304 With mainline we see a mismatched enter and exit (the original syscall is incorrectly visible on entry): seccomp_bpf-1030 [000] .... 21.806766: sys_enter: NR 39 (... seccomp_bpf-1030 [000] .... 21.806767: sys_exit: NR 110 = 1027 When ptrace or seccomp change the syscall, this needs to be visible to trace and audit at that time as well. Update the syscall earlier so they see the correct value. Fixes: d88d59b64ca3 ("core/entry: Respect syscall number rewrites") Reported-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20200912005826.586171-1-keescook@chromium.org
2020-09-14Merge v5.9-rc5 into drm-nextDaniel Vetter
Paul needs 1a21e5b930e8 ("drm/ingenic: Fix leak of device_node pointer") and 3b5b005ef7d9 ("drm/ingenic: Fix driver not probing when IPU port is missing") from -fixes to be able to merge further ingenic patches into -next. Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
2020-09-14kprobes: Fix to check probe enabled before disarm_kprobe_ftrace()Masami Hiramatsu
Commit: 0cb2f1372baa ("kprobes: Fix NULL pointer dereference at kprobe_ftrace_handler") fixed one bug but the underlying bugs are not completely fixed yet. If we run a kprobe_module.tc of ftracetest, a warning triggers: # ./ftracetest test.d/kprobe/kprobe_module.tc === Ftrace unit tests === [1] Kprobe dynamic event - probing module ... ------------[ cut here ]------------ Failed to disarm kprobe-ftrace at trace_printk_irq_work+0x0/0x7e [trace_printk] (-2) WARNING: CPU: 7 PID: 200 at kernel/kprobes.c:1091 __disarm_kprobe_ftrace.isra.0+0x7e/0xa0 This is because the kill_kprobe() calls disarm_kprobe_ftrace() even if the given probe is not enabled. In that case, ftrace_set_filter_ip() fails because the given probe point is not registered to ftrace. Fix to check the given (going) probe is enabled before invoking disarm_kprobe_ftrace(). Fixes: 0cb2f1372baa ("kprobes: Fix NULL pointer dereference at kprobe_ftrace_handler") Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/159888672694.1411785.5987998076694782591.stgit@devnote2
2020-09-14lockdep: fix order in trace_hardirqs_off_caller()Sven Schnelle
Switch order so that locking state is consistent even if the IRQ tracer calls into lockdep again. Acked-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2020-09-13genirq: Allow interrupts to be excluded from /proc/interruptsMarc Zyngier
A number of architectures implement IPI statistics directly, duplicating the core kstat_irqs accounting. As we move IPIs to being actual IRQs, we would end-up with a confusing display in /proc/interrupts (where the IPIs would appear twice). In order to solve this, allow interrupts to be flagged as "hidden", which excludes them from /proc/interrupts. Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Marc Zyngier <maz@kernel.org>
2020-09-13genirq: Add fasteoi IPI flowMarc Zyngier
For irqchips using the fasteoi flow, IPIs are a bit special. They need to be EOI'd early (before calling the handler), as funny things may happen in the handler (they do not necessarily behave like a normal interrupt). Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Marc Zyngier <maz@kernel.org>
2020-09-12Merge tag 'seccomp-v5.9-rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull seccomp fixes from Kees Cook: "This fixes a rare race condition in seccomp when using TSYNC and USER_NOTIF together where a memory allocation would not get freed (found by syzkaller, fixed by Tycho). Additionally updates Tycho's MAINTAINERS and .mailmap entries for his new address" * tag 'seccomp-v5.9-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: seccomp: don't leave dangling ->notif if file allocation fails mailmap, MAINTAINERS: move to tycho.pizza seccomp: don't leak memory when filter install races
2020-09-11gcov: add support for GCC 10.1Peter Oberparleiter
Using gcov to collect coverage data for kernels compiled with GCC 10.1 causes random malfunctions and kernel crashes. This is the result of a changed GCOV_COUNTERS value in GCC 10.1 that causes a mismatch between the layout of the gcov_info structure created by GCC profiling code and the related structure used by the kernel. Fix this by updating the in-kernel GCOV_COUNTERS value. Also re-enable config GCOV_KERNEL for use with GCC 10. Reported-by: Colin Ian King <colin.king@canonical.com> Reported-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Peter Oberparleiter <oberpar@linux.ibm.com> Tested-by: Leon Romanovsky <leonro@nvidia.com> Tested-and-Acked-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-11kernel/debug: Fix spelling mistake in debug_core.cYouling Tang
Fix typo: "notifiter" --> "notifier" "overriden" --> "overridden" Signed-off-by: Youling Tang <tangyouling@loongson.cn> Link: https://lore.kernel.org/r/1596793480-22559-1-git-send-email-tangyouling@loongson.cn Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
2020-09-11dma-mapping: move the dma_declare_coherent_memory documentationChristoph Hellwig
dma_declare_coherent_memory should not be in a DMA API guide aimed at driver writers (that is consumers of the API). Move it to a comment near the function instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Robin Murphy <robin.murphy@arm.com>
2020-09-11dma-mapping: move dma_common_{mmap,get_sgtable} out of mapping.cChristoph Hellwig
Add a new file that contains helpers for misc DMA ops, which is only built when CONFIG_DMA_OPS is set. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Robin Murphy <robin.murphy@arm.com>
2020-09-11dma-direct: rename and cleanup __phys_to_dmaChristoph Hellwig
The __phys_to_dma vs phys_to_dma distinction isn't exactly obvious. Try to improve the situation by renaming __phys_to_dma to phys_to_dma_unencryped, and not forcing architectures that want to override phys_to_dma to actually provide __phys_to_dma. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Robin Murphy <robin.murphy@arm.com>
2020-09-11dma-direct: remove __dma_to_physChristoph Hellwig
There is no harm in just always clearing the SME encryption bit, while significantly simplifying the interface. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Robin Murphy <robin.murphy@arm.com>
2020-09-11dma-direct: use phys_to_dma_direct in dma_direct_allocChristoph Hellwig
Replace the currently open code copy. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Robin Murphy <robin.murphy@arm.com>
2020-09-11dma-direct: lift gfp_t manipulation out of__dma_direct_alloc_pagesChristoph Hellwig
Move the detailed gfp_t setup from __dma_direct_alloc_pages into the caller to clean things up a little. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Robin Murphy <robin.murphy@arm.com>
2020-09-11dma-direct: remove dma_direct_{alloc,free}_pagesChristoph Hellwig
Just merge these helpers into the main dma_direct_{alloc,free} routines, as the additional checks are always false for the two callers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Robin Murphy <robin.murphy@arm.com>
2020-09-11dma-mapping: add (back) arch_dma_mark_clean for ia64Christoph Hellwig
Add back a hook to optimize dcache flushing after reading executable code using DMA. This gets ia64 out of the business of pretending to be dma incoherent just for this optimization. Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-09-11dma-mapping: fix DMA_OPS dependenciesChristoph Hellwig
Driver that select DMA_OPS need to depend on HAS_DMA support to work. The vop driver was missing that dependency, so add it, and also add a another depends in DMA_OPS itself. That won't fix the issue due to how the Kconfig dependencies work, but at least produce a warning about unmet dependencies. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Robin Murphy <robin.murphy@arm.com>
2020-09-11dma-debug: remove most exportsChristoph Hellwig
Now that the main dma mapping entry points are out of line most of the symbols in dma-debug.c can only be called from built-in code. Remove the unused exports. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Robin Murphy <robin.murphy@arm.com>
2020-09-11dma-mapping: remove the dma_dummy_ops exportChristoph Hellwig
dma_dummy_ops is only used by the ACPI code, which can't be modular. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Robin Murphy <robin.murphy@arm.com>
2020-09-10objtool: Rename frame.h -> objtool.hJulien Thierry
Header frame.h is getting more code annotations to help objtool analyze object files. Rename the file to objtool.h. [ jpoimboe: add objtool.h to MAINTAINERS ] Signed-off-by: Julien Thierry <jthierry@redhat.com> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
2020-09-10swiotlb: Mark max_segment with static keywordAndy Shevchenko
Sparse is not happy about max_segment declaration: CHECK kernel/dma/swiotlb.c kernel/dma/swiotlb.c:96:14: warning: symbol 'max_segment' was not declared. Should it be static? Mark it static as suggested. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2020-09-10swiotlb: Use %pa to print phys_addr_t variablesAndy Shevchenko
There is an extension to a %p to print phys_addr_t type of variables. Use it here. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Fabio Estevam <festevam@gmail.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2020-09-10perf/core: Pull pmu::sched_task() into perf_event_context_sched_out()Kan Liang
The pmu::sched_task() is a context switch callback. It passes the cpuctx->task_ctx as a parameter to the lower code. To find the cpuctx->task_ctx, the current code iterates a cpuctx list. The same context will iterated in perf_event_context_sched_out() soon. Share the cpuctx->task_ctx can avoid the unnecessary iteration of the cpuctx list. The pmu::sched_task() is also required for the optimization case for equivalent contexts. The task_ctx_sched_out() will eventually disable and reenable the PMU when schedule out events. Add perf_pmu_disable() and perf_pmu_enable() around task_ctx_sched_out() don't break anything. Drop the cpuctx->ctx.lock for the pmu::sched_task(). The lock is for per-CPU context, which is not necessary for the per-task context schedule. No one uses sched_cb_entry, perf_sched_cb_usages, sched_cb_list, and perf_pmu_sched_task() any more. Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200821195754.20159-2-kan.liang@linux.intel.com
2020-09-10perf/core: Pull pmu::sched_task() into perf_event_context_sched_in()Kan Liang
The pmu::sched_task() is a context switch callback. It passes the cpuctx->task_ctx as a parameter to the lower code. To find the cpuctx->task_ctx, the current code iterates a cpuctx list. The same context was just iterated in perf_event_context_sched_in(), which is invoked right before the pmu::sched_task(). Reuse the cpuctx->task_ctx from perf_event_context_sched_in() can avoid the unnecessary iteration of the cpuctx list. Both pmu::sched_task and perf_event_context_sched_in() have to disable PMU. Pull the pmu::sched_task into perf_event_context_sched_in() can also save the overhead from the PMU disable and reenable. The new and old tasks may have equivalent contexts. The current code optimize this case by swapping the context, which avoids the scheduling. For this case, pmu::sched_task() is still required, e.g., restore the LBR content. Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200821195754.20159-1-kan.liang@linux.intel.com
2020-09-10timekeeping: Use seqcount_latch_tAhmed S. Darwish
Latch sequence counters are a multiversion concurrency control mechanism where the seqcount_t counter even/odd value is used to switch between two data storage copies. This allows the seqcount_t read path to safely interrupt its write side critical section (e.g. from NMIs). Initially, latch sequence counters were implemented as a single write function, raw_write_seqcount_latch(), above plain seqcount_t. The read path was expected to use plain seqcount_t raw_read_seqcount(). A specialized read function was later added, raw_read_seqcount_latch(), and became the standardized way for latch read paths. Having unique read and write APIs meant that latch sequence counters are basically a data type of their own -- just inappropriately overloading plain seqcount_t. The seqcount_latch_t data type was thus introduced at seqlock.h. Use that new data type instead of seqcount_raw_spinlock_t. This ensures that only latch-safe APIs are to be used with the sequence counter. Note that the use of seqcount_raw_spinlock_t was not very useful in the first place. Only the "raw_" subset of seqcount_t APIs were used at timekeeping.c. This subset was created for contexts where lockdep cannot be used. seqcount_LOCKTYPE_t's raison d'être -- verifying that the seqcount_t writer serialization lock is held -- cannot thus be done. References: 0c3351d451ae ("seqlock: Use raw_ prefix instead of _no_lockdep") References: 55f3560df975 ("seqlock: Extend seqcount API with associated locks") Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200827114044.11173-6-a.darwish@linutronix.de
2020-09-10time/sched_clock: Use seqcount_latch_tAhmed S. Darwish
Latch sequence counters have unique read and write APIs, and thus seqcount_latch_t was recently introduced at seqlock.h. Use that new data type instead of plain seqcount_t. This adds the necessary type-safety and ensures only latching-safe seqcount APIs are to be used. Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200827114044.11173-5-a.darwish@linutronix.de
2020-09-10time/sched_clock: Use raw_read_seqcount_latch() during suspendAhmed S. Darwish
sched_clock uses seqcount_t latching to switch between two storage places protected by the sequence counter. This allows it to have interruptible, NMI-safe, seqcount_t write side critical sections. Since 7fc26327b756 ("seqlock: Introduce raw_read_seqcount_latch()"), raw_read_seqcount_latch() became the standardized way for seqcount_t latch read paths. Due to the dependent load, it has one read memory barrier less than the currently used raw_read_seqcount() API. Use raw_read_seqcount_latch() for the suspend path. Commit aadd6e5caaac ("time/sched_clock: Use raw_read_seqcount_latch()") missed changing that instance of raw_read_seqcount(). References: 1809bfa44e10 ("timers, sched/clock: Avoid deadlock during read from NMI") Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200715092345.GA231464@debian-buster-darwi.lab.linutronix.de
2020-09-09Merge branch 'linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 Pull crypto fix from Herbert Xu: "This fixes a regression in padata" * 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: padata: fix possible padata_works_lock deadlock
2020-09-09sched/topology: Move sd_flag_debug out of #ifdef CONFIG_SYSCTLValentin Schneider
The last sd_flag_debug shuffle inadvertently moved its definition within an #ifdef CONFIG_SYSCTL region. While CONFIG_SYSCTL is indeed required to produce the sched domain ctl interface (which uses sd_flag_debug to output flag names), it isn't required to run any assertion on the sched_domain hierarchy itself. Move the definition of sd_flag_debug to a CONFIG_SCHED_DEBUG region of topology.c. Now at long last we have: - sd_flag_debug declared in include/linux/sched/topology.h iff CONFIG_SCHED_DEBUG=y - sd_flag_debug defined in kernel/sched/topology.c, conditioned by: - CONFIG_SCHED_DEBUG, with an explicit #ifdef block - CONFIG_SMP, as a requirement to compile topology.c With this change, all symbols pertaining to SD flag metadata (with the exception of __SD_FLAG_CNT) are now defined exclusively within topology.c Fixes: 8fca9494d4b4 ("sched/topology: Move sd_flag_debug out of linux/sched/topology.h") Reported-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20200908184956.23369-1-valentin.schneider@arm.com
2020-09-08sysctl: Convert to iter interfacesMatthew Wilcox (Oracle)
Using the read_iter/write_iter interfaces allows for in-kernel users to set sysctls without using set_fs(). Also, the buffer is a string, so give it the real type of 'char *', not void *. [AV: Christoph's fixup folded in] Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-09-08bpf: Permit map_ptr arithmetic with opcode add and offset 0Yonghong Song
Commit 41c48f3a98231 ("bpf: Support access to bpf map fields") added support to access map fields with CORE support. For example, struct bpf_map { __u32 max_entries; } __attribute__((preserve_access_index)); struct bpf_array { struct bpf_map map; __u32 elem_size; } __attribute__((preserve_access_index)); struct { __uint(type, BPF_MAP_TYPE_ARRAY); __uint(max_entries, 4); __type(key, __u32); __type(value, __u32); } m_array SEC(".maps"); SEC("cgroup_skb/egress") int cg_skb(void *ctx) { struct bpf_array *array = (struct bpf_array *)&m_array; /* .. array->map.max_entries .. */ } In kernel, bpf_htab has similar structure, struct bpf_htab { struct bpf_map map; ... } In the above cg_skb(), to access array->map.max_entries, with CORE, the clang will generate two builtin's. base = &m_array; /* access array.map */ map_addr = __builtin_preserve_struct_access_info(base, 0, 0); /* access array.map.max_entries */ max_entries_addr = __builtin_preserve_struct_access_info(map_addr, 0, 0); max_entries = *max_entries_addr; In the current llvm, if two builtin's are in the same function or in the same function after inlining, the compiler is smart enough to chain them together and generates like below: base = &m_array; max_entries = *(base + reloc_offset); /* reloc_offset = 0 in this case */ and we are fine. But if we force no inlining for one of functions in test_map_ptr() selftest, e.g., check_default(), the above two __builtin_preserve_* will be in two different functions. In this case, we will have code like: func check_hash(): reloc_offset_map = 0; base = &m_array; map_base = base + reloc_offset_map; check_default(map_base, ...) func check_default(map_base, ...): max_entries = *(map_base + reloc_offset_max_entries); In kernel, map_ptr (CONST_PTR_TO_MAP) does not allow any arithmetic. The above "map_base = base + reloc_offset_map" will trigger a verifier failure. ; VERIFY(check_default(&hash->map, map)); 0: (18) r7 = 0xffffb4fe8018a004 2: (b4) w1 = 110 3: (63) *(u32 *)(r7 +0) = r1 R1_w=invP110 R7_w=map_value(id=0,off=4,ks=4,vs=8,imm=0) R10=fp0 ; VERIFY_TYPE(BPF_MAP_TYPE_HASH, check_hash); 4: (18) r1 = 0xffffb4fe8018a000 6: (b4) w2 = 1 7: (63) *(u32 *)(r1 +0) = r2 R1_w=map_value(id=0,off=0,ks=4,vs=8,imm=0) R2_w=invP1 R7_w=map_value(id=0,off=4,ks=4,vs=8,imm=0) R10=fp0 8: (b7) r2 = 0 9: (18) r8 = 0xffff90bcb500c000 11: (18) r1 = 0xffff90bcb500c000 13: (0f) r1 += r2 R1 pointer arithmetic on map_ptr prohibited To fix the issue, let us permit map_ptr + 0 arithmetic which will result in exactly the same map_ptr. Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20200908175702.2463625-1-yhs@fb.com
2020-09-08seccomp: Use current_pt_regs() instead of task_pt_regs(current)Denis Efremov
As described in commit a3460a59747c ("new helper: current_pt_regs()"): - arch versions are "optimized versions". - some architectures have task_pt_regs() working only for traced tasks blocked on signal delivery. current_pt_regs() needs to work for *all* processes. In preparation for adding a coccinelle rule for using current_*(), instead of raw accesses to current members, modify seccomp_do_user_notification(), __seccomp_filter(), __secure_computing() to use current_pt_regs(). Signed-off-by: Denis Efremov <efremov@linux.com> Link: https://lore.kernel.org/r/20200824125921.488311-1-efremov@linux.com [kees: Reworded commit log, add comment to populate_seccomp_data()] Signed-off-by: Kees Cook <keescook@chromium.org>
2020-09-08seccomp: kill process instead of thread for unknown actionsRich Felker
Asynchronous termination of a thread outside of the userspace thread library's knowledge is an unsafe operation that leaves the process in an inconsistent, corrupt, and possibly unrecoverable state. In order to make new actions that may be added in the future safe on kernels not aware of them, change the default action from SECCOMP_RET_KILL_THREAD to SECCOMP_RET_KILL_PROCESS. Signed-off-by: Rich Felker <dalias@libc.org> Link: https://lore.kernel.org/r/20200829015609.GA32566@brightrain.aerifal.cx [kees: Fixed up coredump selection logic to match] Signed-off-by: Kees Cook <keescook@chromium.org>
2020-09-08seccomp: don't leave dangling ->notif if file allocation failsTycho Andersen
Christian and Kees both pointed out that this is a bit sloppy to open-code both places, and Christian points out that we leave a dangling pointer to ->notif if file allocation fails. Since we check ->notif for null in order to determine if it's ok to install a filter, this means people won't be able to install a filter if the file allocation fails for some reason, even if they subsequently should be able to. To fix this, let's hoist this free+null into its own little helper and use it. Reported-by: Kees Cook <keescook@chromium.org> Reported-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Tycho Andersen <tycho@tycho.pizza> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Link: https://lore.kernel.org/r/20200902140953.1201956-1-tycho@tycho.pizza Signed-off-by: Kees Cook <keescook@chromium.org>
2020-09-08seccomp: don't leak memory when filter install racesTycho Andersen
In seccomp_set_mode_filter() with TSYNC | NEW_LISTENER, we first initialize the listener fd, then check to see if we can actually use it later in seccomp_may_assign_mode(), which can fail if anyone else in our thread group has installed a filter and caused some divergence. If we can't, we partially clean up the newly allocated file: we put the fd, put the file, but don't actually clean up the *memory* that was allocated at filter->notif. Let's clean that up too. To accomplish this, let's hoist the actual "detach a notifier from a filter" code to its own helper out of seccomp_notify_release(), so that in case anyone adds stuff to init_listener(), they only have to add the cleanup code in one spot. This does a bit of extra locking and such on the failure path when the filter is not attached, but it's a slow failure path anyway. Fixes: 51891498f2da ("seccomp: allow TSYNC and USER_NOTIF together") Reported-by: syzbot+3ad9614a12f80994c32e@syzkaller.appspotmail.com Signed-off-by: Tycho Andersen <tycho@tycho.pizza> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Link: https://lore.kernel.org/r/20200902014017.934315-1-tycho@tycho.pizza Signed-off-by: Kees Cook <keescook@chromium.org>
2020-09-08kdb: Use newer api for tasklist scanningDavidlohr Bueso
This kills using the do_each_thread/while_each_thread combo to iterate all threads and uses for_each_process_thread() instead, maintaining semantics. while_each_thread() is ultimately racy and deprecated; although in this particular case there is no concurrency so it doesn't matter. Still lets trivially get rid of two more users. Acked-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Link: https://lore.kernel.org/r/20200907203206.21293-1-dave@stgolabs.net Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
2020-09-08kgdb: Make "kgdbcon" work properly with "kgdb_earlycon"Douglas Anderson
On my system the kernel processes the "kgdb_earlycon" parameter before the "kgdbcon" parameter. When we setup "kgdb_earlycon" we'll end up in kgdb_register_callbacks() and "kgdb_use_con" won't have been set yet so we'll never get around to starting "kgdbcon". Let's remedy this by detecting that the IO module was already registered when setting "kgdb_use_con" and registering the console then. As part of this, to avoid pre-declaring things, move the handling of the "kgdbcon" further down in the file. Signed-off-by: Douglas Anderson <dianders@chromium.org> Link: https://lore.kernel.org/r/20200630151422.1.I4aa062751ff5e281f5116655c976dff545c09a46@changeid Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
2020-09-08kdb: remove unnecessary null check of dbg_io_opsCengiz Can
`kdb_msg_write` operates on a global `struct kgdb_io *` called `dbg_io_ops`. It's initialized in `debug_core.c` and checked throughout the debug flow. There's a null check in `kdb_msg_write` which triggers static analyzers and gives the (almost entirely wrong) impression that it can be null. Coverity scanner caught this as CID 1465042. I have removed the unnecessary null check and eliminated false-positive forward null dereference warning. Signed-off-by: Cengiz Can <cengiz@kernel.wtf> Link: https://lore.kernel.org/r/20200630082922.28672-1-cengiz@kernel.wtf Reviewed-by: Sumit Garg <sumit.garg@linaro.org> Reviewed-by: Douglas Anderson <dianders@chromium.org> Tested-by: Douglas Anderson <dianders@chromium.org> Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
2020-09-08kprobes: Make local functions staticMasami Hiramatsu
Since we unified the kretprobe trampoline handler from arch/* code, some functions and objects do not need to be exported anymore. Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/159870618256.1229682.8692046612635810882.stgit@devnote2
2020-09-08kprobes: Free kretprobe_instance with RCU callbackMasami Hiramatsu
Free kretprobe_instance with RCU callback instead of directly freeing the object in the kretprobe handler context. This will make kretprobe run safer in NMI context. Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/159870616685.1229682.11978742048709542226.stgit@devnote2
2020-09-08kprobes: Remove NMI context checkMasami Hiramatsu
The in_nmi() check in pre_handler_kretprobe() is meant to avoid recursion, and blindly assumes that anything NMI is recursive. However, since commit: 9b38cc704e84 ("kretprobe: Prevent triggering kretprobe from within kprobe_flush_task") there is a better way to detect and avoid actual recursion. By setting a dummy kprobe, any actual exceptions will terminate early (by trying to handle the dummy kprobe), and recursion will not happen. Employ this to avoid the kretprobe_table_lock() recursion, replacing the over-eager in_nmi() check. Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/159870615628.1229682.6087311596892125907.stgit@devnote2