summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2021-10-01Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextJakub Kicinski
Daniel Borkmann says: ==================== bpf-next 2021-10-02 We've added 85 non-merge commits during the last 15 day(s) which contain a total of 132 files changed, 13779 insertions(+), 6724 deletions(-). The main changes are: 1) Massive update on test_bpf.ko coverage for JITs as preparatory work for an upcoming MIPS eBPF JIT, from Johan Almbladh. 2) Add a batched interface for RX buffer allocation in AF_XDP buffer pool, with driver support for i40e and ice from Magnus Karlsson. 3) Add legacy uprobe support to libbpf to complement recently merged legacy kprobe support, from Andrii Nakryiko. 4) Add bpf_trace_vprintk() as variadic printk helper, from Dave Marchevsky. 5) Support saving the register state in verifier when spilling <8byte bounded scalar to the stack, from Martin Lau. 6) Add libbpf opt-in for stricter BPF program section name handling as part of libbpf 1.0 effort, from Andrii Nakryiko. 7) Add a document to help clarifying BPF licensing, from Alexei Starovoitov. 8) Fix skel_internal.h to propagate errno if the loader indicates an internal error, from Kumar Kartikeya Dwivedi. 9) Fix build warnings with -Wcast-function-type so that the option can later be enabled by default for the kernel, from Kees Cook. 10) Fix libbpf to ignore STT_SECTION symbols in legacy map definitions as it otherwise errors out when encountering them, from Toke Høiland-Jørgensen. 11) Teach libbpf to recognize specialized maps (such as for perf RB) and internally remove BTF type IDs when creating them, from Hengqi Chen. 12) Various fixes and improvements to BPF selftests. ==================== Link: https://lore.kernel.org/r/20211002001327.15169-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-01audit: add support for the openat2 syscallRichard Guy Briggs
The openat2(2) syscall was added in kernel v5.6 with commit fddb5d430ad9 ("open: introduce openat2(2) syscall"). Add the openat2(2) syscall to the audit syscall classifier. Link: https://github.com/linux-audit/audit-kernel/issues/67 Link: https://lore.kernel.org/r/f5f1a4d8699613f8c02ce762807228c841c2e26f.1621363275.git.rgb@redhat.com Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> [PM: merge fuzz due to previous header rename, commit line wraps] Signed-off-by: Paul Moore <paul@paul-moore.com>
2021-10-01audit: replace magic audit syscall class numbers with macrosRichard Guy Briggs
Replace audit syscall class magic numbers with macros. This required putting the macros into new header file include/linux/audit_arch.h since the syscall macros were included for both 64 bit and 32 bit in any compat code, causing redefinition warnings. Link: https://lore.kernel.org/r/2300b1083a32aade7ae7efb95826e8f3f260b1df.1621363275.git.rgb@redhat.com Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> [PM: renamed header to audit_arch.h after consulting with Richard] Signed-off-by: Paul Moore <paul@paul-moore.com>
2021-10-01dma-mapping: remove bogus test for pfn_valid from dma_map_resourceMike Rapoport
dma_map_resource() uses pfn_valid() to ensure the range is not RAM. However, pfn_valid() only checks for availability of the memory map for a PFN but it does not ensure that the PFN is actually backed by RAM. As dma_map_resource() is the only method in DMA mapping APIs that has this check, simply drop the pfn_valid() test from dma_map_resource(). Link: https://lore.kernel.org/all/20210824173741.GC623@arm.com/ Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: David Hildenbrand <david@redhat.com> Link: https://lore.kernel.org/r/20210930013039.11260-2-rppt@kernel.org Signed-off-by: Will Deacon <will@kernel.org>
2021-10-01sched/fair: Null terminate buffer when updating tunable_scalingMel Gorman
This patch null-terminates the temporary buffer in sched_scaling_write() so kstrtouint() does not return failure and checks the value is valid. Before: $ cat /sys/kernel/debug/sched/tunable_scaling 1 $ echo 0 > /sys/kernel/debug/sched/tunable_scaling -bash: echo: write error: Invalid argument $ cat /sys/kernel/debug/sched/tunable_scaling 1 After: $ cat /sys/kernel/debug/sched/tunable_scaling 1 $ echo 0 > /sys/kernel/debug/sched/tunable_scaling $ cat /sys/kernel/debug/sched/tunable_scaling 0 $ echo 3 > /sys/kernel/debug/sched/tunable_scaling -bash: echo: write error: Invalid argument Fixes: 8a99b6833c88 ("sched: Move SCHED_DEBUG sysctl to debugfs") Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20210927114635.GH3959@techsingularity.net
2021-10-01sched/fair: Add ancestors of unthrottled undecayed cfs_rqMichal Koutný
Since commit a7b359fc6a37 ("sched/fair: Correctly insert cfs_rq's to list on unthrottle") we add cfs_rqs with no runnable tasks but not fully decayed into the load (leaf) list. We may ignore adding some ancestors and therefore breaking tmp_alone_branch invariant. This broke LTP test cfs_bandwidth01 and it was partially fixed in commit fdaba61ef8a2 ("sched/fair: Ensure that the CFS parent is added after unthrottling"). I noticed the named test still fails even with the fix (but with low probability, 1 in ~1000 executions of the test). The reason is when bailing out of unthrottle_cfs_rq early, we may miss adding ancestors of the unthrottled cfs_rq, thus, not joining tmp_alone_branch properly. Fix this by adding ancestors if we notice the unthrottled cfs_rq was added to the load list. Fixes: a7b359fc6a37 ("sched/fair: Correctly insert cfs_rq's to list on unthrottle") Signed-off-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Odin Ugedal <odin@uged.al> Link: https://lore.kernel.org/r/20210917153037.11176-1-mkoutny@suse.com
2021-10-01perf/core: fix userpage->time_enabled of inactive eventsSong Liu
Users of rdpmc rely on the mmapped user page to calculate accurate time_enabled. Currently, userpage->time_enabled is only updated when the event is added to the pmu. As a result, inactive event (due to counter multiplexing) does not have accurate userpage->time_enabled. This can be reproduced with something like: /* open 20 task perf_event "cycles", to create multiplexing */ fd = perf_event_open(); /* open task perf_event "cycles" */ userpage = mmap(fd); /* use mmap and rdmpc */ while (true) { time_enabled_mmap = xxx; /* use logic in perf_event_mmap_page */ time_enabled_read = read(fd).time_enabled; if (time_enabled_mmap > time_enabled_read) BUG(); } Fix this by updating userpage for inactive events in merge_sched_in. Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reported-and-tested-by: Lucian Grijincu <lucian@fb.com> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210929194313.2398474-1-songliubraving@fb.com
2021-10-01rtmutex: Wake up the waiters lockless while dropping the read lock.Thomas Gleixner
The rw_semaphore and rwlock_t implementation both wake the waiter while holding the rt_mutex_base::wait_lock acquired. This can be optimized by waking the waiter lockless outside of the locked section to avoid a needless contention on the rt_mutex_base::wait_lock lock. Extend rt_mutex_wake_q_add() to also accept task and state and use it in __rwbase_read_unlock(). Suggested-by: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210928150006.597310-3-bigeasy@linutronix.de
2021-10-01rtmutex: Check explicit for TASK_RTLOCK_WAIT.Sebastian Andrzej Siewior
rt_mutex_wake_q_add() needs to need to distiguish between sleeping locks (TASK_RTLOCK_WAIT) and normal locks which use TASK_NORMAL to use the proper wake mechanism. Instead of checking for != TASK_NORMAL make it more robust and check explicit for TASK_RTLOCK_WAIT which is the reason why a different wake mechanism is used. No functional change. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210928150006.597310-2-bigeasy@linutronix.de
2021-10-01locking/rt: Take RCU nesting into account for __might_resched()Thomas Gleixner
The general rule that rcu_read_lock() held sections cannot voluntary sleep does apply even on RT kernels. Though the substitution of spin/rw locks on RT enabled kernels has to be exempt from that rule. On !RT a spin_lock() can obviously nest inside a RCU read side critical section as the lock acquisition is not going to block, but on RT this is not longer the case due to the 'sleeping' spinlock substitution. The RT patches contained a cheap hack to ignore the RCU nesting depth in might_sleep() checks, which was a pragmatic but incorrect workaround. Instead of generally ignoring the RCU nesting depth in __might_sleep() and __might_resched() checks, pass the rcu_preempt_depth() via the offsets argument to __might_resched() from spin/read/write_lock() which makes the checks work correctly even in RCU read side critical sections. The actual blocking on such a substituted lock within a RCU read side critical section is already handled correctly in __schedule() by treating it as a "preemption" of the RCU read side critical section. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210923165358.368305497@linutronix.de
2021-10-01sched: Make RCU nest depth distinct in __might_resched()Thomas Gleixner
For !RT kernels RCU nest depth in __might_resched() is always expected to be 0, but on RT kernels it can be non zero while the preempt count is expected to be always 0. Instead of playing magic games in interpreting the 'preempt_offset' argument, rename it to 'offsets' and use the lower 8 bits for the expected preempt count, allow to hand in the expected RCU nest depth in the upper bits and adopt the __might_resched() code and related checks and printks. The affected call sites are updated in subsequent steps. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210923165358.243232823@linutronix.de
2021-10-01sched: Make might_sleep() output less confusingThomas Gleixner
might_sleep() output is pretty informative, but can be confusing at times especially with PREEMPT_RCU when the check triggers due to a voluntary sleep inside a RCU read side critical section: BUG: sleeping function called from invalid context at kernel/test.c:110 in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 415, name: kworker/u112:52 Preemption disabled at: migrate_disable+0x33/0xa0 in_atomic() is 0, but it still tells that preemption was disabled at migrate_disable(), which is completely useless because preemption is not disabled. But the interesting information to decode the above, i.e. the RCU nesting depth, is not printed. That becomes even more confusing when might_sleep() is invoked from cond_resched_lock() within a RCU read side critical section. Here the expected preemption count is 1 and not 0. BUG: sleeping function called from invalid context at kernel/test.c:131 in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 415, name: kworker/u112:52 Preemption disabled at: test_cond_lock+0xf3/0x1c0 So in_atomic() is set, which is expected as the caller holds a spinlock, but it's unclear why this is broken and the preempt disable IP is just pointing at the correct place, i.e. spin_lock(), which is obviously not helpful either. Make that more useful in general: - Print preempt_count() and the expected value and for the CONFIG_PREEMPT_RCU case: - Print the RCU read side critical section nesting depth - Print the preempt disable IP only when preempt count does not have the expected value. So the might_sleep() dump from a within a preemptible RCU read side critical section becomes: BUG: sleeping function called from invalid context at kernel/test.c:110 in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 415, name: kworker/u112:52 preempt_count: 0, expected: 0 RCU nest depth: 1, expected: 0 and the cond_resched_lock() case becomes: BUG: sleeping function called from invalid context at kernel/test.c:141 in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 415, name: kworker/u112:52 preempt_count: 1, expected: 1 RCU nest depth: 1, expected: 0 which makes is pretty obvious what's going on. For all other cases the preempt disable IP is still printed as before: BUG: sleeping function called from invalid context at kernel/test.c: 156 in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1, name: swapper/0 preempt_count: 1, expected: 0 RCU nest depth: 0, expected: 0 Preemption disabled at: [<ffffffff82b48326>] test_might_sleep+0xbe/0xf8 BUG: sleeping function called from invalid context at kernel/test.c: 163 in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1, name: swapper/0 preempt_count: 1, expected: 0 RCU nest depth: 1, expected: 0 Preemption disabled at: [<ffffffff82b48326>] test_might_sleep+0x1e4/0x280 This also prepares to provide a better debugging output for RT enabled kernels and their spinlock substitutions. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210923165358.181022656@linutronix.de
2021-10-01sched: Cleanup might_sleep() printksThomas Gleixner
Convert them to pr_*(). No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210923165358.117496067@linutronix.de
2021-10-01sched: Remove preempt_offset argument from __might_sleep()Thomas Gleixner
All callers hand in 0 and never will hand in anything else. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210923165358.054321586@linutronix.de
2021-10-01sched: Clean up the might_sleep() underscore zooThomas Gleixner
__might_sleep() vs. ___might_sleep() is hard to distinguish. Aside of that the three underscore variant is exposed to provide a checkpoint for rescheduling points which are distinct from blocking points. They are semantically a preemption point which means that scheduling is state preserving. A real blocking operation, e.g. mutex_lock(), wait*(), which cannot preserve a task state which is not equal to RUNNING. While technically blocking on a "sleeping" spinlock in RT enabled kernels falls into the voluntary scheduling category because it has to wait until the contended spin/rw lock becomes available, the RT lock substitution code can semantically be mapped to a voluntary preemption because the RT lock substitution code and the scheduler are providing mechanisms to preserve the task state and to take regular non-lock related wakeups into account. Rename ___might_sleep() to __might_resched() to make the distinction of these functions clear. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210923165357.928693482@linutronix.de
2021-10-01locking/ww-mutex: Fix uninitialized use of ret in test_aa()Nathan Chancellor
Clang warns: kernel/locking/test-ww_mutex.c:138:7: error: variable 'ret' is used uninitialized whenever 'if' condition is true [-Werror,-Wsometimes-uninitialized] if (!ww_mutex_trylock(&mutex, &ctx)) { ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ kernel/locking/test-ww_mutex.c:172:9: note: uninitialized use occurs here return ret; ^~~ kernel/locking/test-ww_mutex.c:138:3: note: remove the 'if' if its condition is always false if (!ww_mutex_trylock(&mutex, &ctx)) { ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ kernel/locking/test-ww_mutex.c:125:9: note: initialize the variable 'ret' to silence this warning int ret; ^ = 0 1 error generated. Assign !ww_mutex_trylock(...) to ret so that it is always initialized. Fixes: 12235da8c80a ("kernel/locking: Add context to ww_mutex_trylock()") Reported-by: "kernelci.org bot" <bot@kernelci.org> Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Waiman Long <longman@redhat.com> Link: https://lore.kernel.org/r/20210922145822.3935141-1-nathan@kernel.org
2021-09-30x86/kprobes: Fixup return address in generic trampoline handlerMasami Hiramatsu
In x86, the fake return address on the stack saved by __kretprobe_trampoline() will be replaced with the real return address after returning from trampoline_handler(). Before fixing the return address, the real return address can be found in the 'current->kretprobe_instances'. However, since there is a window between updating the 'current->kretprobe_instances' and fixing the address on the stack, if an interrupt happens at that timing and the interrupt handler does stacktrace, it may fail to unwind because it can not get the correct return address from 'current->kretprobe_instances'. This will eliminate that window by fixing the return address right before updating 'current->kretprobe_instances'. Link: https://lkml.kernel.org/r/163163057094.489837.9044470370440745866.stgit@devnote2 Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Tested-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-09-30tracing: Show kretprobe unknown indicator only for kretprobe_trampolineMasami Hiramatsu
ftrace shows "[unknown/kretprobe'd]" indicator all addresses in the kretprobe_trampoline, but the modified address by kretprobe should be only kretprobe_trampoline+0. Link: https://lkml.kernel.org/r/163163056044.489837.794883849706638013.stgit@devnote2 Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Tested-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-09-30kprobes: Enable stacktrace from pt_regs in kretprobe handlerMasami Hiramatsu
Since the ORC unwinder from pt_regs requires setting up regs->ip correctly, set the correct return address to the regs->ip before calling user kretprobe handler. This allows the kretrprobe handler to trace stack from the kretprobe's pt_regs by stack_trace_save_regs() (eBPF will do this), instead of stack tracing from the handler context by stack_trace_save() (ftrace will do this). Link: https://lkml.kernel.org/r/163163053237.489837.4272653874525136832.stgit@devnote2 Suggested-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Tested-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-09-30kprobes: Add kretprobe_find_ret_addr() for searching return addressMasami Hiramatsu
Introduce kretprobe_find_ret_addr() and is_kretprobe_trampoline(). These APIs will be used by the ORC stack unwinder and ftrace, so that they can check whether the given address points kretprobe trampoline code and query the correct return address in that case. Link: https://lkml.kernel.org/r/163163046461.489837.1044778356430293962.stgit@devnote2 Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Tested-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-09-30kprobes: treewide: Make it harder to refer kretprobe_trampoline directlyMasami Hiramatsu
Since now there is kretprobe_trampoline_addr() for referring the address of kretprobe trampoline code, we don't need to access kretprobe_trampoline directly. Make it harder to refer by renaming it to __kretprobe_trampoline(). Link: https://lkml.kernel.org/r/163163045446.489837.14510577516938803097.stgit@devnote2 Suggested-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-09-30kprobes: treewide: Remove trampoline_address from kretprobe_trampoline_handler()Masami Hiramatsu
The __kretprobe_trampoline_handler() callback, called from low level arch kprobes methods, has the 'trampoline_address' parameter, which is entirely superfluous as it basically just replicates: dereference_kernel_function_descriptor(kretprobe_trampoline) In fact we had bugs in arch code where it wasn't replicated correctly. So remove this superfluous parameter and use kretprobe_trampoline_addr() instead. Link: https://lkml.kernel.org/r/163163044546.489837.13505751885476015002.stgit@devnote2 Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Tested-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-09-30kprobes: treewide: Replace arch_deref_entry_point() with ↵Masami Hiramatsu
dereference_symbol_descriptor() ~15 years ago kprobes grew the 'arch_deref_entry_point()' __weak function: 3d7e33825d87: ("jprobes: make jprobes a little safer for users") But this is just open-coded dereference_symbol_descriptor() in essence, and its obscure nature was causing bugs. Just use the real thing and remove arch_deref_entry_point(). Link: https://lkml.kernel.org/r/163163043630.489837.7924988885652708696.stgit@devnote2 Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Tested-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-09-30kprobes: Use bool type for functions which returns boolean valueMasami Hiramatsu
Use the 'bool' type instead of 'int' for the functions which returns a boolean value, because this makes clear that those functions don't return any error code. Link: https://lkml.kernel.org/r/163163041649.489837.17311187321419747536.stgit@devnote2 Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-09-30kprobes: treewide: Use 'kprobe_opcode_t *' for the code address in ↵Masami Hiramatsu
get_optimized_kprobe() Since get_optimized_kprobe() is only used inside kprobes, it doesn't need to use 'unsigned long' type for 'addr' parameter. Make it use 'kprobe_opcode_t *' for the 'addr' parameter and subsequent call of arch_within_optimized_kprobe() also should use 'kprobe_opcode_t *'. Note that MAX_OPTIMIZED_LENGTH and RELATIVEJUMP_SIZE are defined by byte-size, but the size of 'kprobe_opcode_t' depends on the architecture. Therefore, we must be careful when calculating addresses using those macros. Link: https://lkml.kernel.org/r/163163040680.489837.12133032364499833736.stgit@devnote2 Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-09-30kprobes: Add assertions for required lockMasami Hiramatsu
Add assertions for required locks instead of comment it so that the lockdep can inspect locks automatically. Link: https://lkml.kernel.org/r/163163039572.489837.18011973177537476885.stgit@devnote2 Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-09-30kprobes: Fix coding style issuesMasami Hiramatsu
Fix coding style issues reported by checkpatch.pl and update comments to quote variable names and add "()" to function name. One TODO comment in __disarm_kprobe() is removed because it has been done by following commit. Link: https://lkml.kernel.org/r/163163037468.489837.4282347782492003960.stgit@devnote2 Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-09-30kprobes: treewide: Cleanup the error messages for kprobesMasami Hiramatsu
This clean up the error/notification messages in kprobes related code. Basically this defines 'pr_fmt()' macros for each files and update the messages which describes - what happened, - what is the kernel going to do or not do, - is the kernel fine, - what can the user do about it. Also, if the message is not needed (e.g. the function returns unique error code, or other error message is already shown.) remove it, and replace the message with WARN_*() macros if suitable. Link: https://lkml.kernel.org/r/163163036568.489837.14085396178727185469.stgit@devnote2 Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-09-30kprobes: Make arch_check_ftrace_location staticPunit Agrawal
arch_check_ftrace_location() was introduced as a weak function in commit f7f242ff004499 ("kprobes: introduce weak arch_check_ftrace_location() helper function") to allow architectures to handle kprobes call site on their own. Recently, the only architecture (csky) to implement arch_check_ftrace_location() was migrated to using the common version. As a result, further cleanup the code to drop the weak attribute and rename the function to remove the architecture specific implementation. Link: https://lkml.kernel.org/r/163163035673.489837.2367816318195254104.stgit@devnote2 Signed-off-by: Punit Agrawal <punitagrawal@gmail.com> Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-09-30kprobe: Simplify prepare_kprobe() by dropping redundant versionPunit Agrawal
The function prepare_kprobe() is called during kprobe registration and is responsible for ensuring any architecture related preparation for the kprobe is done before returning. One of two versions of prepare_kprobe() is chosen depending on the availability of KPROBE_ON_FTRACE in the kernel configuration. Simplify the code by dropping the version when KPROBE_ON_FTRACE is not selected - instead relying on kprobe_ftrace() to return false when KPROBE_ON_FTRACE is not set. No functional change. Link: https://lkml.kernel.org/r/163163033696.489837.9264661820279300788.stgit@devnote2 Signed-off-by: Punit Agrawal <punitagrawal@gmail.com> Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-09-30kprobes: Use helper to parse boolean input from userspacePunit Agrawal
The "enabled" file provides a debugfs interface to arm / disarm kprobes in the kernel. In order to parse the buffer containing the values written from userspace, the callback manually parses the user input to convert it to a boolean value. As taking a string value from userspace and converting it to boolean is a common operation, a helper kstrtobool_from_user() is already available in the kernel. Update the callback to use the common helper to parse the write buffer from userspace. Link: https://lkml.kernel.org/r/163163032637.489837.10678039554832855327.stgit@devnote2 Signed-off-by: Punit Agrawal <punitagrawal@gmail.com> Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-09-30kprobes: Do not use local variable when creating debugfs filePunit Agrawal
debugfs_create_file() takes a pointer argument that can be used during file operation callbacks (accessible via i_private in the inode structure). An obvious requirement is for the pointer to refer to valid memory when used. When creating the debugfs file to dynamically enable / disable kprobes, a pointer to local variable is passed to debugfs_create_file(); which will go out of scope when the init function returns. The reason this hasn't triggered random memory corruption is because the pointer is not accessed during the debugfs file callbacks. Since the enabled state is managed by the kprobes_all_disabled global variable, the local variable is not needed. Fix the incorrect (and unnecessary) usage of local variable during debugfs_file_create() by passing NULL instead. Link: https://lkml.kernel.org/r/163163031686.489837.4476867635937014973.stgit@devnote2 Fixes: bf8f6e5b3e51 ("Kprobes: The ON/OFF knob thru debugfs") Signed-off-by: Punit Agrawal <punitagrawal@gmail.com> Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-09-30Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
drivers/net/phy/bcm7xxx.c d88fd1b546ff ("net: phy: bcm7xxx: Fixed indirect MMD operations") f68d08c437f9 ("net: phy: bcm7xxx: Add EPHY entry for 72165") net/sched/sch_api.c b193e15ac69d ("net: prevent user from passing illegal stab size") 69508d43334e ("net_sched: Use struct_size() and flex_array_size() helpers") Both cases trivial - adjacent code additions. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-09-30Merge tag 'net-5.15-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Networking fixes, including fixes from mac80211, netfilter and bpf. Current release - regressions: - bpf, cgroup: assign cgroup in cgroup_sk_alloc when called from interrupt - mdio: revert mechanical patches which broke handling of optional resources - dev_addr_list: prevent address duplication Previous releases - regressions: - sctp: break out if skb_header_pointer returns NULL in sctp_rcv_ootb (NULL deref) - Revert "mac80211: do not use low data rates for data frames with no ack flag", fixing broadcast transmissions - mac80211: fix use-after-free in CCMP/GCMP RX - netfilter: include zone id in tuple hash again, minimize collisions - netfilter: nf_tables: unlink table before deleting it (race -> UAF) - netfilter: log: work around missing softdep backend module - mptcp: don't return sockets in foreign netns - sched: flower: protect fl_walk() with rcu (race -> UAF) - ixgbe: fix NULL pointer dereference in ixgbe_xdp_setup - smsc95xx: fix stalled rx after link change - enetc: fix the incorrect clearing of IF_MODE bits - ipv4: fix rtnexthop len when RTA_FLOW is present - dsa: mv88e6xxx: 6161: use correct MAX MTU config method for this SKU - e100: fix length calculation & buffer overrun in ethtool::get_regs Previous releases - always broken: - mac80211: fix using stale frag_tail skb pointer in A-MSDU tx - mac80211: drop frames from invalid MAC address in ad-hoc mode - af_unix: fix races in sk_peer_pid and sk_peer_cred accesses (race -> UAF) - bpf, x86: Fix bpf mapping of atomic fetch implementation - bpf: handle return value of BPF_PROG_TYPE_STRUCT_OPS prog - netfilter: ip6_tables: zero-initialize fragment offset - mhi: fix error path in mhi_net_newlink - af_unix: return errno instead of NULL in unix_create1() when over the fs.file-max limit Misc: - bpf: exempt CAP_BPF from checks against bpf_jit_limit - netfilter: conntrack: make max chain length random, prevent guessing buckets by attackers - netfilter: nf_nat_masquerade: make async masq_inet6_event handling generic, defer conntrack walk to work queue (prevent hogging RTNL lock)" * tag 'net-5.15-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (77 commits) af_unix: fix races in sk_peer_pid and sk_peer_cred accesses net: stmmac: fix EEE init issue when paired with EEE capable PHYs net: dev_addr_list: handle first address in __hw_addr_add_ex net: sched: flower: protect fl_walk() with rcu net: introduce and use lock_sock_fast_nested() net: phy: bcm7xxx: Fixed indirect MMD operations net: hns3: disable firmware compatible features when uninstall PF net: hns3: fix always enable rx vlan filter problem after selftest net: hns3: PF enable promisc for VF when mac table is overflow net: hns3: fix show wrong state when add existing uc mac address net: hns3: fix mixed flag HCLGE_FLAG_MQPRIO_ENABLE and HCLGE_FLAG_DCB_ENABLE net: hns3: don't rollback when destroy mqprio fail net: hns3: remove tc enable checking net: hns3: do not allow call hns3_nic_net_open repeatedly ixgbe: Fix NULL pointer dereference in ixgbe_xdp_setup net: bridge: mcast: Associate the seqcount with its protecting lock. net: mdio-ipq4019: Fix the error for an optional regs resource net: hns3: fix hclge_dbg_dump_tm_pg() stack usage net: mdio: mscc-miim: Fix the mdio controller af_unix: Return errno instead of NULL in unix_create1(). ...
2021-09-30bpf: Fix integer overflow in prealloc_elems_and_freelist()Tatsuhiko Yasumatsu
In prealloc_elems_and_freelist(), the multiplication to calculate the size passed to bpf_map_area_alloc() could lead to an integer overflow. As a result, out-of-bounds write could occur in pcpu_freelist_populate() as reported by KASAN: [...] [ 16.968613] BUG: KASAN: slab-out-of-bounds in pcpu_freelist_populate+0xd9/0x100 [ 16.969408] Write of size 8 at addr ffff888104fc6ea0 by task crash/78 [ 16.970038] [ 16.970195] CPU: 0 PID: 78 Comm: crash Not tainted 5.15.0-rc2+ #1 [ 16.970878] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014 [ 16.972026] Call Trace: [ 16.972306] dump_stack_lvl+0x34/0x44 [ 16.972687] print_address_description.constprop.0+0x21/0x140 [ 16.973297] ? pcpu_freelist_populate+0xd9/0x100 [ 16.973777] ? pcpu_freelist_populate+0xd9/0x100 [ 16.974257] kasan_report.cold+0x7f/0x11b [ 16.974681] ? pcpu_freelist_populate+0xd9/0x100 [ 16.975190] pcpu_freelist_populate+0xd9/0x100 [ 16.975669] stack_map_alloc+0x209/0x2a0 [ 16.976106] __sys_bpf+0xd83/0x2ce0 [...] The possibility of this overflow was originally discussed in [0], but was overlooked. Fix the integer overflow by changing elem_size to u64 from u32. [0] https://lore.kernel.org/bpf/728b238e-a481-eb50-98e9-b0f430ab01e7@gmail.com/ Fixes: 557c0c6e7df8 ("bpf: convert stackmap to pre-allocation") Signed-off-by: Tatsuhiko Yasumatsu <th.yasumatsu@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20210930135545.173698-1-th.yasumatsu@gmail.com
2021-09-30sched: move CPU field back into thread_info if THREAD_INFO_IN_TASK=yArd Biesheuvel
THREAD_INFO_IN_TASK moved the CPU field out of thread_info, but this causes some issues on architectures that define raw_smp_processor_id() in terms of this field, due to the fact that #include'ing linux/sched.h to get at struct task_struct is problematic in terms of circular dependencies. Given that thread_info and task_struct are the same data structure anyway when THREAD_INFO_IN_TASK=y, let's move it back so that having access to the type definition of struct thread_info is sufficient to reference the CPU number of the current task. Note that this requires THREAD_INFO_IN_TASK's definition of the task_thread_info() helper to be updated, as task_cpu() takes a pointer-to-const, whereas task_thread_info() (which is used to generate lvalues as well), needs a non-const pointer. So make it a macro instead. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Mark Rutland <mark.rutland@arm.com> Acked-by: Michael Ellerman <mpe@ellerman.id.au>
2021-09-30scs: Release kasan vmalloc poison in scs_free processYee Lee
Since scs allocation is moved to vmalloc region, the shadow stack is protected by kasan_posion_vmalloc. However, the vfree_atomic operation needs to access its context for scs_free process and causes kasan error as the dump info below. This patch Adds kasan_unpoison_vmalloc() before vfree_atomic, which aligns to the prior flow as using kmem_cache. The vmalloc region will go back posioned in the following vumap() operations. ================================================================== BUG: KASAN: vmalloc-out-of-bounds in llist_add_batch+0x60/0xd4 Write of size 8 at addr ffff8000100b9000 by task kthreadd/2 CPU: 0 PID: 2 Comm: kthreadd Not tainted 5.15.0-rc2-11681-g92477dd1faa6-dirty #1 Hardware name: linux,dummy-virt (DT) Call trace: dump_backtrace+0x0/0x43c show_stack+0x1c/0x2c dump_stack_lvl+0x68/0x84 print_address_description+0x80/0x394 kasan_report+0x180/0x1dc __asan_report_store8_noabort+0x48/0x58 llist_add_batch+0x60/0xd4 vfree_atomic+0x60/0xe0 scs_free+0x1dc/0x1fc scs_release+0xa4/0xd4 free_task+0x30/0xe4 __put_task_struct+0x1ec/0x2e0 delayed_put_task_struct+0x5c/0xa0 rcu_do_batch+0x62c/0x8a0 rcu_core+0x60c/0xc14 rcu_core_si+0x14/0x24 __do_softirq+0x19c/0x68c irq_exit+0x118/0x2dc handle_domain_irq+0xcc/0x134 gic_handle_irq+0x7c/0x1bc call_on_irq_stack+0x40/0x70 do_interrupt_handler+0x78/0x9c el1_interrupt+0x34/0x60 el1h_64_irq_handler+0x1c/0x2c el1h_64_irq+0x78/0x7c _raw_spin_unlock_irqrestore+0x40/0xcc sched_fork+0x4f0/0xb00 copy_process+0xacc/0x3648 kernel_clone+0x168/0x534 kernel_thread+0x13c/0x1b0 kthreadd+0x2bc/0x400 ret_from_fork+0x10/0x20 Memory state around the buggy address: ffff8000100b8f00: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 ffff8000100b8f80: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 >ffff8000100b9000: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 ^ ffff8000100b9080: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 ffff8000100b9100: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 ================================================================== Suggested-by: Kuan-Ying Lee <kuan-ying.lee@mediatek.com> Acked-by: Will Deacon <will@kernel.org> Tested-by: Will Deacon <will@kernel.org> Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Signed-off-by: Yee Lee <yee.lee@mediatek.com> Fixes: a2abe7cbd8fe ("scs: switch to vmapped shadow stacks") Link: https://lore.kernel.org/r/20210930081619.30091-1-yee.lee@mediatek.com Signed-off-by: Will Deacon <will@kernel.org>
2021-09-29uapi/linux/prctl: provide macro definitions for the PR_SCHED_CORE type argumentEugene Syromiatnikov
Commit 7ac592aa35a684ff ("sched: prctl() core-scheduling interface") made use of enum pid_type in prctl's arg4; this type and the associated enumeration definitions are not exposed to userspace. Christian has suggested to provide additional macro definitions that convey the meaning of the type argument more in alignment with its actual usage, and this patch does exactly that. Link: https://lore.kernel.org/r/20210825170613.GA3884@asgard.redhat.com Suggested-by: Christian Brauner <christian.brauner@ubuntu.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Eugene Syromiatnikov <esyr@redhat.com> Complements: 7ac592aa35a684ff ("sched: prctl() core-scheduling interface") Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-09-29swiotlb: Support aligned swiotlb buffersDavid Stevens
Add an argument to swiotlb_tbl_map_single that specifies the desired alignment of the allocated buffer. This is used by dma-iommu to ensure the buffer is aligned to the iova granule size when using swiotlb with untrusted sub-granule mappings. This addresses an issue where adjacent slots could be exposed to the untrusted device if IO_TLB_SIZE < iova granule < PAGE_SIZE. Signed-off-by: David Stevens <stevensd@chromium.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210929023300.335969-7-stevensd@google.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2021-09-28bpf: Replace callers of BPF_CAST_CALL with proper function typedefKees Cook
In order to keep ahead of cases in the kernel where Control Flow Integrity (CFI) may trip over function call casts, enabling -Wcast-function-type is helpful. To that end, BPF_CAST_CALL causes various warnings and is one of the last places in the kernel triggering this warning. For actual function calls, replace BPF_CAST_CALL() with a typedef, which captures the same details about the given function pointers. This change results in no object code difference. Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Gustavo A. R. Silva <gustavoars@kernel.org> Link: https://github.com/KSPP/linux/issues/20 Link: https://lore.kernel.org/lkml/CAEf4Bzb46=-J5Fxc3mMZ8JQPtK1uoE0q6+g6WPz53Cvx=CBEhw@mail.gmail.com Link: https://lore.kernel.org/bpf/20210928230946.4062144-3-keescook@chromium.org
2021-09-28bpf: Replace "want address" users of BPF_CAST_CALL with BPF_CALL_IMMKees Cook
In order to keep ahead of cases in the kernel where Control Flow Integrity (CFI) may trip over function call casts, enabling -Wcast-function-type is helpful. To that end, BPF_CAST_CALL causes various warnings and is one of the last places in the kernel triggering this warning. Most places using BPF_CAST_CALL actually just want a void * to perform math on. It's not actually performing a call, so just use a different helper to get the void *, by way of the new BPF_CALL_IMM() helper, which can clean up a common copy/paste idiom as well. This change results in no object code difference. Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://github.com/KSPP/linux/issues/20 Link: https://lore.kernel.org/lkml/CAEf4Bzb46=-J5Fxc3mMZ8JQPtK1uoE0q6+g6WPz53Cvx=CBEhw@mail.gmail.com Link: https://lore.kernel.org/bpf/20210928230946.4062144-2-keescook@chromium.org
2021-09-28Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller
Daniel Borkmann says: ==================== pull-request: bpf 2021-09-28 The following pull-request contains BPF updates for your *net* tree. We've added 10 non-merge commits during the last 14 day(s) which contain a total of 11 files changed, 139 insertions(+), 53 deletions(-). The main changes are: 1) Fix MIPS JIT jump code emission for too large offsets, from Piotr Krysiuk. 2) Fix x86 JIT atomic/fetch emission when dst reg maps to rax, from Johan Almbladh. 3) Fix cgroup_sk_alloc corner case when called from interrupt, from Daniel Borkmann. 4) Fix segfault in libbpf's linker for objects without BTF, from Kumar Kartikeya Dwivedi. 5) Fix bpf_jit_charge_modmem for applications with CAP_BPF, from Lorenz Bauer. 6) Fix return value handling for struct_ops BPF programs, from Hou Tao. 7) Various fixes to BPF selftests, from Jiri Benc. ==================== Signed-off-by: David S. Miller <davem@davemloft.net> ,
2021-09-28module: fix clang CFI with MODULE_UNLOAD=nArnd Bergmann
When CONFIG_MODULE_UNLOAD is disabled, the module->exit member is not defined, causing a build failure: kernel/module.c:4493:8: error: no member named 'exit' in 'struct module' mod->exit = *exit; add an #ifdef block around this. Fixes: cf68fffb66d6 ("add support for Clang CFI") Acked-by: Kees Cook <keescook@chromium.org> Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Reviewed-by: Miroslav Benes <mbenes@suse.cz> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Jessica Yu <jeyu@kernel.org>
2021-09-28bpf, cgroup: Assign cgroup in cgroup_sk_alloc when called from interruptDaniel Borkmann
If cgroup_sk_alloc() is called from interrupt context, then just assign the root cgroup to skcd->cgroup. Prior to commit 8520e224f547 ("bpf, cgroups: Fix cgroup v2 fallback on v1/v2 mixed mode") we would just return, and later on in sock_cgroup_ptr(), we were NULL-testing the cgroup in fast-path, and iff indeed NULL returning the root cgroup (v ?: &cgrp_dfl_root.cgrp). Rather than re-adding the NULL-test to the fast-path we can just assign it once from cgroup_sk_alloc() given v1/v2 handling has been simplified. The migration from NULL test with returning &cgrp_dfl_root.cgrp to assigning &cgrp_dfl_root.cgrp directly does /not/ change behavior for callers of sock_cgroup_ptr(). syzkaller was able to trigger a splat in the legacy netrom code base, where the RX handler in nr_rx_frame() calls nr_make_new() which calls sk_alloc() and therefore cgroup_sk_alloc() with in_interrupt() condition. Thus the NULL skcd->cgroup, where it trips over on cgroup_sk_free() side given it expects a non-NULL object. There are a few other candidates aside from netrom which have similar pattern where in their accept-like implementation, they just call to sk_alloc() and thus cgroup_sk_alloc() instead of sk_clone_lock() with the corresponding cgroup_sk_clone() which then inherits the cgroup from the parent socket. None of them are related to core protocols where BPF cgroup programs are running from. However, in future, they should follow to implement a similar inheritance mechanism. Additionally, with a !CONFIG_CGROUP_NET_PRIO and !CONFIG_CGROUP_NET_CLASSID configuration, the same issue was exposed also prior to 8520e224f547 due to commit e876ecc67db8 ("cgroup: memcg: net: do not associate sock with unrelated cgroup") which added the early in_interrupt() return back then. Fixes: 8520e224f547 ("bpf, cgroups: Fix cgroup v2 fallback on v1/v2 mixed mode") Fixes: e876ecc67db8 ("cgroup: memcg: net: do not associate sock with unrelated cgroup") Reported-by: syzbot+df709157a4ecaf192b03@syzkaller.appspotmail.com Reported-by: syzbot+533f389d4026d86a2a95@syzkaller.appspotmail.com Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Tested-by: syzbot+df709157a4ecaf192b03@syzkaller.appspotmail.com Tested-by: syzbot+533f389d4026d86a2a95@syzkaller.appspotmail.com Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/bpf/20210927123921.21535-1-daniel@iogearbox.net
2021-09-28bpf: Exempt CAP_BPF from checks against bpf_jit_limitLorenz Bauer
When introducing CAP_BPF, bpf_jit_charge_modmem() was not changed to treat programs with CAP_BPF as privileged for the purpose of JIT memory allocation. This means that a program without CAP_BPF can block a program with CAP_BPF from loading a program. Fix this by checking bpf_capable() in bpf_jit_charge_modmem(). Fixes: 2c78ee898d8f ("bpf: Implement CAP_BPF") Signed-off-by: Lorenz Bauer <lmb@cloudflare.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20210922111153.19843-1-lmb@cloudflare.com
2021-09-27mm/memcg: Convert mem_cgroup_charge() to take a folioMatthew Wilcox (Oracle)
Convert all callers of mem_cgroup_charge() to call page_folio() on the page they're currently passing in. Many of them will be converted to use folios themselves soon. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Howells <dhowells@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz>
2021-09-26bpf: Support <8-byte scalar spill and refillMartin KaFai Lau
The verifier currently does not save the reg state when spilling <8byte bounded scalar to the stack. The bpf program will be incorrectly rejected when this scalar is refilled to the reg and then used to offset into a packet header. The later patch has a simplified bpf prog from a real use case to demonstrate this case. The current work around is to reparse the packet again such that this offset scalar is close to where the packet data will be accessed to avoid the spill. Thus, the header is parsed twice. The llvm patch [1] will align the <8bytes spill to the 8-byte stack address. This can simplify the verifier support by avoiding to store multiple reg states for each 8 byte stack slot. This patch changes the verifier to save the reg state when spilling <8bytes scalar to the stack. This reg state saving is limited to spill aligned to the 8-byte stack address. The current refill logic has already called coerce_reg_to_size(), so coerce_reg_to_size() is not called on state->stack[spi].spilled_ptr during spill. When refilling in check_stack_read_fixed_off(), it checks the refill size is the same as the number of bytes marked with STACK_SPILL before restoring the reg state. When restoring the reg state to state->regs[dst_regno], it needs to avoid the state->regs[dst_regno].subreg_def being over written because it has been marked by the check_reg_arg() earlier [check_mem_access() is called after check_reg_arg() in do_check()]. Reordering check_mem_access() and check_reg_arg() will need a lot of changes in test_verifier's tests because of the difference in verifier's error message. Thus, the patch here is to save the state->regs[dst_regno].subreg_def first in check_stack_read_fixed_off(). There are cases that the verifier needs to scrub the spilled slot from STACK_SPILL to STACK_MISC. After this patch the spill is not always in 8 bytes now, so it can no longer assume the other 7 bytes are always marked as STACK_SPILL. In particular, the scrub needs to avoid marking an uninitialized byte from STACK_INVALID to STACK_MISC. Otherwise, the verifier will incorrectly accept bpf program reading uninitialized bytes from the stack. A new helper scrub_spilled_slot() is created for this purpose. [1]: https://reviews.llvm.org/D109073 Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20210922004941.625398-1-kafai@fb.com
2021-09-26bpf: Check the other end of slot_type for STACK_SPILLMartin KaFai Lau
Every 8 bytes of the stack is tracked by a bpf_stack_state. Within each bpf_stack_state, there is a 'u8 slot_type[8]' to track the type of each byte. Verifier tests slot_type[0] == STACK_SPILL to decide if the spilled reg state is saved. Verifier currently only saves the reg state if the whole 8 bytes are spilled to the stack, so checking the slot_type[7] is the same as checking slot_type[0]. The later patch will allow verifier to save the bounded scalar reg also for <8 bytes spill. There is a llvm patch [1] to ensure the <8 bytes spill will be 8-byte aligned, so checking slot_type[7] instead of slot_type[0] is required. While at it, this patch refactors the slot_type[0] == STACK_SPILL test into a new function is_spilled_reg() and change the slot_type[0] check to slot_type[7] check in there also. [1] https://reviews.llvm.org/D109073 Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20210922004934.624194-1-kafai@fb.com
2021-09-26Merge tag 'timers-urgent-2021-09-26' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fix from Thomas Gleixner: "A single fix for the recently introduced regression in posix CPU timers which failed to stop the timer when requested. That caused unexpected signals to be sent to the process/thread causing malfunction" * tag 'timers-urgent-2021-09-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: posix-cpu-timers: Prevent spuriously armed 0-value itimer
2021-09-26Merge tag 'irq-urgent-2021-09-26' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq fixes from Thomas Gleixner: "A set of fixes for interrupt chip drivers: - Work around a bad GIC integration on a Renesas platform which can't handle byte-sized MMIO access - Plug a potential memory leak in the GICv4 driver - Fix a regression in the Armada 370-XP IPI code which was caused by issuing EOI instack of ACK. - A couple of small fixes here and there" * tag 'irq-urgent-2021-09-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: irqchip/gic: Work around broken Renesas integration irqchip/renesas-rza1: Use semicolons instead of commas irqchip/gic-v3-its: Fix potential VPE leak on error irqchip/goldfish-pic: Select GENERIC_IRQ_CHIP to fix build irqchip/mbigen: Repair non-kernel-doc notation irqdomain: Change the type of 'size' in __irq_domain_add() to be consistent irqchip/armada-370-xp: Fix ack/eoi breakage Documentation: Fix irq-domain.rst build warning