summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2020-10-09bpf: Track spill/fill of bounded scalars.Yonghong Song
Under register pressure the llvm may spill registers with bounds into the stack. The verifier has to track them through spill/fill otherwise many kinds of bound errors will be seen. The spill/fill of induction variables was already happening. This patch extends this logic from tracking spill/fill of a constant into any bounded register. There is no need to track spill/fill of unbounded, since no new information will be retrieved from the stack during register fill. Though extra stack difference could cause state pruning to be less effective, no adverse affects were seen from this patch on selftests and on cilium programs. Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20201009011240.48506-3-alexei.starovoitov@gmail.com
2020-10-09bpf: Propagate scalar ranges through register assignments.Alexei Starovoitov
The llvm register allocator may use two different registers representing the same virtual register. In such case the following pattern can be observed: 1047: (bf) r9 = r6 1048: (a5) if r6 < 0x1000 goto pc+1 1050: ... 1051: (a5) if r9 < 0x2 goto pc+66 1052: ... 1053: (bf) r2 = r9 /* r2 needs to have upper and lower bounds */ This is normal behavior of greedy register allocator. The slides 137+ explain why regalloc introduces such register copy: http://llvm.org/devmtg/2018-04/slides/Yatsina-LLVM%20Greedy%20Register%20Allocator.pdf There is no way to tell llvm 'not to do this'. Hence the verifier has to recognize such patterns. In order to track this information without backtracking allocate ID for scalars in a similar way as it's done for find_good_pkt_pointers(). When the verifier encounters r9 = r6 assignment it will assign the same ID to both registers. Later if either register range is narrowed via conditional jump propagate the register state into the other register. Clear register ID in adjust_reg_min_max_vals() for any alu instruction. The register ID is ignored for scalars in regsafe() and doesn't affect state pruning. mark_reg_unknown() clears the ID. It's used to process call, endian and other instructions. Hence ID is explicitly cleared only in adjust_reg_min_max_vals() and in 32-bit mov. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20201009011240.48506-2-alexei.starovoitov@gmail.com
2020-10-09Merge branch 'kcsan' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into locking/core Pull KCSAN updates for v5.10 from Paul E. McKenney: - Improve kernel messages. - Be more permissive with bitops races under KCSAN_ASSUME_PLAIN_WRITES_ATOMIC=y. - Optimize debugfs stat counters. - Introduce the instrument_*read_write() annotations, to provide a finer description of certain ops - using KCSAN's compound instrumentation. Use them for atomic RNW and bitops, where appropriate. Doing this might find new races. (Depends on the compiler having tsan-compound-read-before-write=1 support.) - Support atomic built-ins, which will help certain architectures, such as s390. - Misc enhancements and smaller fixes. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-10-09Merge branch 'locking/urgent' into locking/core, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-10-09lockdep: Fix lockdep recursionPeter Zijlstra
Steve reported that lockdep_assert*irq*(), when nested inside lockdep itself, will trigger a false-positive. One example is the stack-trace code, as called from inside lockdep, triggering tracing, which in turn calls RCU, which then uses lockdep_assert_irqs_disabled(). Fixes: a21ee6055c30 ("lockdep: Change hardirq{s_enabled,_context} to per-cpu variables") Reported-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-10-09lockdep: Fix usage_traceoverflowPeter Zijlstra
Basically print_lock_class_header()'s for loop is out of sync with the the size of of ->usage_traces[]. Also clean things up a bit while at it, to avoid such mishaps in the future. Fixes: 23870f122768 ("locking/lockdep: Fix "USED" <- "IN-NMI" inversions") Reported-by: Qian Cai <cai@redhat.com> Debugged-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Qian Cai <cai@redhat.com> Link: https://lkml.kernel.org/r/20200930094937.GE2651@hirez.programming.kicks-ass.net
2020-10-09Merge branch 'for-mingo' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu Pull v5.10 RCU changes from Paul E. McKenney: - Debugging for smp_call_function(). - Strict grace periods for KASAN. The point of this series is to find RCU-usage bugs, so the corresponding new RCU_STRICT_GRACE_PERIOD Kconfig option depends on both DEBUG_KERNEL and RCU_EXPERT, and is further disabled by dfefault. Finally, the help text includes a goodly list of scary caveats. - New smp_call_function() torture test. - Torture-test updates. - Documentation updates. - Miscellaneous fixes. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-10-09perf: Fix task_function_call() error handlingKajol Jain
The error handling introduced by commit: 2ed6edd33a21 ("perf: Add cond_resched() to task_function_call()") looses any return value from smp_call_function_single() that is not {0, -EINVAL}. This is a problem because it will return -EXNIO when the target CPU is offline. Worse, in that case it'll turn into an infinite loop. Fixes: 2ed6edd33a21 ("perf: Add cond_resched() to task_function_call()") Reported-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Kajol Jain <kjain@linux.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Barret Rhoden <brho@google.com> Tested-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Link: https://lkml.kernel.org/r/20200827064732.20860-1-kjain@linux.ibm.com
2020-10-08Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Small conflict around locking in rxrpc_process_event() - channel_lock moved to bundle in next, while state lock needs _bh() from net. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-08Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netLinus Torvalds
Pull networking fixes from Jakub Kicinski: "One more set of fixes from the networking tree: - add missing input validation in nl80211_del_key(), preventing out-of-bounds access - last minute fix / improvement of a MRP netlink (uAPI) interface introduced in 5.9 (current) release - fix "unresolved symbol" build error under CONFIG_NET w/o CONFIG_INET due to missing tcp_timewait_sock and inet_timewait_sock BTF. - fix 32 bit sub-register bounds tracking in the bpf verifier for OR case - tcp: fix receive window update in tcp_add_backlog() - openvswitch: handle DNAT tuple collision in conntrack-related code - r8169: wait for potential PHY reset to finish after applying a FW file, avoiding unexpected PHY behaviour and failures later on - mscc: fix tail dropping watermarks for Ocelot switches - avoid use-after-free in macsec code after a call to the GRO layer - avoid use-after-free in sctp error paths - add a device id for Cellient MPL200 WWAN card - rxrpc fixes: - fix the xdr encoding of the contents read from an rxrpc key - fix a BUG() for a unsupported encoding type. - fix missing _bh lock annotations. - fix acceptance handling for an incoming call where the incoming call is encrypted. - the server token keyring isn't network namespaced - it belongs to the server, so there's no need. Namespacing it means that request_key() fails to find it. - fix a leak of the server keyring" * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (21 commits) net: usb: qmi_wwan: add Cellient MPL200 card macsec: avoid use-after-free in macsec_handle_frame() r8169: consider that PHY reset may still be in progress after applying firmware openvswitch: handle DNAT tuple collision sctp: fix sctp_auth_init_hmacs() error path bridge: Netlink interface fix. net: wireless: nl80211: fix out-of-bounds access in nl80211_del_key() bpf: Fix scalar32_min_max_or bounds tracking tcp: fix receive window update in tcp_add_backlog() net: usb: rtl8150: set random MAC address when set_ethernet_addr() fails mptcp: more DATA FIN fixes net: mscc: ocelot: warn when encoding an out-of-bounds watermark value net: mscc: ocelot: divide watermark value by 60 when writing to SYS_ATOP net: qrtr: ns: Fix the incorrect usage of rcu_read_lock() rxrpc: Fix server keyring leak rxrpc: The server keyring isn't network-namespaced rxrpc: Fix accept on a connection that need securing rxrpc: Fix some missing _bh annotations on locking conn->state_lock rxrpc: Downgrade the BUG() for unsupported token type in rxrpc_read() rxrpc: Fix rxkad token xdr encoding ...
2020-10-08seccomp: Make duplicate listener detection non-racyJann Horn
Currently, init_listener() tries to prevent adding a filter with SECCOMP_FILTER_FLAG_NEW_LISTENER if one of the existing filters already has a listener. However, this check happens without holding any lock that would prevent another thread from concurrently installing a new filter (potentially with a listener) on top of the ones we already have. Theoretically, this is also a data race: The plain load from current->seccomp.filter can race with concurrent writes to the same location. Fix it by moving the check into the region that holds the siglock to guard against concurrent TSYNC. (The "Fixes" tag points to the commit that introduced the theoretical data race; concurrent installation of another filter with TSYNC only became possible later, in commit 51891498f2da ("seccomp: allow TSYNC and USER_NOTIF together").) Fixes: 6a21cc50f0c7 ("seccomp: add a return code to trap to userspace") Reviewed-by: Tycho Andersen <tycho@tycho.pizza> Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Kees Cook <keescook@chromium.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20201005014401.490175-1-jannh@google.com
2020-10-08tracing: Fix synthetic print fmt check for use of __get_str()Steven Rostedt (VMware)
A cut and paste error had the check to use __get_str() test "is_dynamic" twice, instead of checking "is_string && is_dynamic". Link: https://lore.kernel.org/r/d34dccd5-96ba-a2d9-46ea-de8807525deb@canonical.com Reported-by: Colin Ian King <colin.king@canonical.com> Acked-by: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-10-08tracing: Remove a pointless assignmentSudip Mukherjee
The variable 'len' has been assigned a value but is not used after that. So, remove the assignement. Link: https://lkml.kernel.org/r/20200930184303.22896-1-sudipm.mukherjee@gmail.com Signed-off-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-10-08ftrace: Format variable declarations of ftrace_allocate_recordsSteven Rostedt (VMware)
I hate when unrelated variables are declared on the same line. Split them. Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-10-08ftrace: Simplify the calculation of page number for ftrace_page->recordsWei Yang
Based on the following two reasones, we could simplify the calculation: - If the number after roundup count is not power of 2, we would definitely have more than 1 empty page with a higher order. - get_count_order() just return current order, so one lower order could meet the requirement. The calculation could be simplified by lower one order level when pages are not power of 2. Link: https://lkml.kernel.org/r/20200831031104.23322-5-richard.weiyang@linux.alibaba.com Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-10-08ftrace: Simplify the hash calculationSteven Rostedt (VMware)
No need to add a check to subtract the number of bits if bits is zero after fls(). Just divide the size by two before calling it. This does give the same answer for size of 0 and 1, but that's fine. Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-10-08ftrace: Use fls() to get the bits for dup_hash()Wei Yang
The effect here is to get the number of bits, lets use fls() to do this job. Link: https://lkml.kernel.org/r/20200831031104.23322-3-richard.weiyang@linux.alibaba.com Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-10-08tracing: Change synthetic event string format to limit printed lengthSteven Rostedt (VMware)
Change the format for printing synthetic field strings to limit the length of the string printed even if it's not correctly terminated. Link: https://lore.kernel.org/r/20201002210036.0200371b@oasis.local.home Link: https://lkml.kernel.org/r/b6bdb34e70d970e8026daa3503db6b8e5cdad524.1601848695.git.zanussi@kernel.org Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-10-08tracing: Add README information for synthetic_events fileTom Zanussi
Add an entry with a basic description of events/synthetic_events along with a simple example. Link: https://lkml.kernel.org/r/3c7f178cf95aaeebc01eda7d95600dd937233eb7.1601848695.git.zanussi@kernel.org Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-10-08Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfJakub Kicinski
Daniel Borkmann says: ==================== pull-request: bpf 2020-10-08 The main changes are: 1) Fix "unresolved symbol" build error under CONFIG_NET w/o CONFIG_INET due to missing tcp_timewait_sock and inet_timewait_sock BTF, from Yonghong Song. 2) Fix 32 bit sub-register bounds tracking for OR case, from Daniel Borkmann. ==================== Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-08bpf: Fix scalar32_min_max_or bounds trackingDaniel Borkmann
Simon reported an issue with the current scalar32_min_max_or() implementation. That is, compared to the other 32 bit subreg tracking functions, the code in scalar32_min_max_or() stands out that it's using the 64 bit registers instead of 32 bit ones. This leads to bounds tracking issues, for example: [...] 8: R0=map_value(id=0,off=0,ks=4,vs=48,imm=0) R10=fp0 fp-8=mmmmmmmm 8: (79) r1 = *(u64 *)(r0 +0) R0=map_value(id=0,off=0,ks=4,vs=48,imm=0) R10=fp0 fp-8=mmmmmmmm 9: R0=map_value(id=0,off=0,ks=4,vs=48,imm=0) R1_w=inv(id=0) R10=fp0 fp-8=mmmmmmmm 9: (b7) r0 = 1 10: R0_w=inv1 R1_w=inv(id=0) R10=fp0 fp-8=mmmmmmmm 10: (18) r2 = 0x600000002 12: R0_w=inv1 R1_w=inv(id=0) R2_w=inv25769803778 R10=fp0 fp-8=mmmmmmmm 12: (ad) if r1 < r2 goto pc+1 R0_w=inv1 R1_w=inv(id=0,umin_value=25769803778) R2_w=inv25769803778 R10=fp0 fp-8=mmmmmmmm 13: R0_w=inv1 R1_w=inv(id=0,umin_value=25769803778) R2_w=inv25769803778 R10=fp0 fp-8=mmmmmmmm 13: (95) exit 14: R0_w=inv1 R1_w=inv(id=0,umax_value=25769803777,var_off=(0x0; 0x7ffffffff)) R2_w=inv25769803778 R10=fp0 fp-8=mmmmmmmm 14: (25) if r1 > 0x0 goto pc+1 R0_w=inv1 R1_w=inv(id=0,umax_value=0,var_off=(0x0; 0x7fffffff),u32_max_value=2147483647) R2_w=inv25769803778 R10=fp0 fp-8=mmmmmmmm 15: R0_w=inv1 R1_w=inv(id=0,umax_value=0,var_off=(0x0; 0x7fffffff),u32_max_value=2147483647) R2_w=inv25769803778 R10=fp0 fp-8=mmmmmmmm 15: (95) exit 16: R0_w=inv1 R1_w=inv(id=0,umin_value=1,umax_value=25769803777,var_off=(0x0; 0x77fffffff),u32_max_value=2147483647) R2_w=inv25769803778 R10=fp0 fp-8=mmmmmmmm 16: (47) r1 |= 0 17: R0_w=inv1 R1_w=inv(id=0,umin_value=1,umax_value=32212254719,var_off=(0x1; 0x700000000),s32_max_value=1,u32_max_value=1) R2_w=inv25769803778 R10=fp0 fp-8=mmmmmmmm [...] The bound tests on the map value force the upper unsigned bound to be 25769803777 in 64 bit (0b11000000000000000000000000000000001) and then lower one to be 1. By using OR they are truncated and thus result in the range [1,1] for the 32 bit reg tracker. This is incorrect given the only thing we know is that the value must be positive and thus 2147483647 (0b1111111111111111111111111111111) at max for the subregs. Fix it by using the {u,s}32_{min,max}_value vars instead. This also makes sense, for example, for the case where we update dst_reg->s32_{min,max}_value in the else branch we need to use the newly computed dst_reg->u32_{min,max}_value as we know that these are positive. Previously, in the else branch the 64 bit values of umin_value=1 and umax_value=32212254719 were used and latter got truncated to be 1 as upper bound there. After the fix the subreg range is now correct: [...] 8: R0=map_value(id=0,off=0,ks=4,vs=48,imm=0) R10=fp0 fp-8=mmmmmmmm 8: (79) r1 = *(u64 *)(r0 +0) R0=map_value(id=0,off=0,ks=4,vs=48,imm=0) R10=fp0 fp-8=mmmmmmmm 9: R0=map_value(id=0,off=0,ks=4,vs=48,imm=0) R1_w=inv(id=0) R10=fp0 fp-8=mmmmmmmm 9: (b7) r0 = 1 10: R0_w=inv1 R1_w=inv(id=0) R10=fp0 fp-8=mmmmmmmm 10: (18) r2 = 0x600000002 12: R0_w=inv1 R1_w=inv(id=0) R2_w=inv25769803778 R10=fp0 fp-8=mmmmmmmm 12: (ad) if r1 < r2 goto pc+1 R0_w=inv1 R1_w=inv(id=0,umin_value=25769803778) R2_w=inv25769803778 R10=fp0 fp-8=mmmmmmmm 13: R0_w=inv1 R1_w=inv(id=0,umin_value=25769803778) R2_w=inv25769803778 R10=fp0 fp-8=mmmmmmmm 13: (95) exit 14: R0_w=inv1 R1_w=inv(id=0,umax_value=25769803777,var_off=(0x0; 0x7ffffffff)) R2_w=inv25769803778 R10=fp0 fp-8=mmmmmmmm 14: (25) if r1 > 0x0 goto pc+1 R0_w=inv1 R1_w=inv(id=0,umax_value=0,var_off=(0x0; 0x7fffffff),u32_max_value=2147483647) R2_w=inv25769803778 R10=fp0 fp-8=mmmmmmmm 15: R0_w=inv1 R1_w=inv(id=0,umax_value=0,var_off=(0x0; 0x7fffffff),u32_max_value=2147483647) R2_w=inv25769803778 R10=fp0 fp-8=mmmmmmmm 15: (95) exit 16: R0_w=inv1 R1_w=inv(id=0,umin_value=1,umax_value=25769803777,var_off=(0x0; 0x77fffffff),u32_max_value=2147483647) R2_w=inv25769803778 R10=fp0 fp-8=mmmmmmmm 16: (47) r1 |= 0 17: R0_w=inv1 R1_w=inv(id=0,umin_value=1,umax_value=32212254719,var_off=(0x0; 0x77fffffff),u32_max_value=2147483647) R2_w=inv25769803778 R10=fp0 fp-8=mmmmmmmm [...] Fixes: 3f50f132d840 ("bpf: Verifier, do explicit ALU32 bounds tracking") Reported-by: Simon Scannell <scannell.smn@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Alexei Starovoitov <ast@kernel.org>
2020-10-07bpf: Fix build failure for kernel/trace/bpf_trace.c with CONFIG_NET=nYonghong Song
When CONFIG_NET is not defined, I hit the following build error: kernel/trace/bpf_trace.o:(.rodata+0x110): undefined reference to `bpf_prog_test_run_raw_tp' Commit 1b4d60ec162f ("bpf: Enable BPF_PROG_TEST_RUN for raw_tracepoint") added test_run support for raw_tracepoint in /kernel/trace/bpf_trace.c. But the test_run function bpf_prog_test_run_raw_tp is defined in net/bpf/test_run.c, only available with CONFIG_NET=y. Adding a CONFIG_NET guard for .test_run = bpf_prog_test_run_raw_tp; fixed the above build issue. Fixes: 1b4d60ec162f ("bpf: Enable BPF_PROG_TEST_RUN for raw_tracepoint") Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20201007062933.3425899-1-yhs@fb.com
2020-10-07kernel/bpf/verifier: Fix build when NET is not enabledRandy Dunlap
Fix build errors in kernel/bpf/verifier.c when CONFIG_NET is not enabled. ../kernel/bpf/verifier.c:3995:13: error: ‘btf_sock_ids’ undeclared here (not in a function); did you mean ‘bpf_sock_ops’? .btf_id = &btf_sock_ids[BTF_SOCK_TYPE_SOCK_COMMON], ../kernel/bpf/verifier.c:3995:26: error: ‘BTF_SOCK_TYPE_SOCK_COMMON’ undeclared here (not in a function); did you mean ‘PTR_TO_SOCK_COMMON’? .btf_id = &btf_sock_ids[BTF_SOCK_TYPE_SOCK_COMMON], Fixes: 1df8f55a37bd ("bpf: Enable bpf_skc_to_* sock casting helper to networking prog type") Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20201007021613.13646-1-rdunlap@infradead.org
2020-10-07cpufreq: schedutil: Simplify sugov_fast_switch()Rafael J. Wysocki
Drop a redundant local variable definition from sugov_fast_switch() and rearrange the code in there to avoid the redundant logical negation. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2020-10-07dma-direct: simplify the DMA_ATTR_NO_KERNEL_MAPPING handlingChristoph Hellwig
Use and entirely separate code path for the DMA_ATTR_NO_KERNEL_MAPPING path. This avoids any confusion about the ret type, and avoids lots of attr checks and helpers that can be significantly simplified now. It also ensures that common handling is applied to architetures still using the arch alloc/free hooks. Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-10-07dma-direct: factor out a dma_direct_alloc_from_pool helperChristoph Hellwig
This ensures dma_direct_alloc_pages will use the right gfp mask, as well as keeping the code for that common between the two allocators. Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-10-07dma-direct check for highmem pages in dma_direct_alloc_pagesChristoph Hellwig
Check for highmem pages from CMA, just like in the dma_direct_alloc path. Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-10-06usermodehelper: reset umask to default before executing user processLinus Torvalds
Kernel threads intentionally do CLONE_FS in order to follow any changes that 'init' does to set up the root directory (or cwd). It is admittedly a bit odd, but it avoids the situation where 'init' does some extensive setup to initialize the system environment, and then we execute a usermode helper program, and it uses the original FS setup from boot time that may be very limited and incomplete. [ Both Al Viro and Eric Biederman point out that 'pivot_root()' will follow the root regardless, since it fixes up other users of root (see chroot_fs_refs() for details), but overmounting root and doing a chroot() would not. ] However, Vegard Nossum noticed that the CLONE_FS not only means that we follow the root and current working directories, it also means we share umask with whatever init changed it to. That wasn't intentional. Just reset umask to the original default (0022) before actually starting the usermode helper program. Reported-by: Vegard Nossum <vegard.nossum@oracle.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Acked-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-06Merge branch 'irq/qcom-pdc-wakeup' into irq/irqchip-nextMarc Zyngier
Signed-off-by: Marc Zyngier <maz@kernel.org>
2020-10-06genirq/PM: Introduce IRQCHIP_ENABLE_WAKEUP_ON_SUSPEND flagMaulik Shah
An interrupt that is disabled/masked but set for wakeup may still need to be able to wake up the system from sleep states like "suspend to RAM". To that effect, introduce the IRQCHIP_ENABLE_WAKEUP_ON_SUSPEND flag. If the irqchip have this flag set, the irq PM code will enable/unmask the irqs that are marked for wakeup, but that are in a disabled state. On resume, such irqs will be restored back to their disabled state. Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Maulik Shah <mkshah@codeaurora.org> [maz: commit message fix-up] Signed-off-by: Marc Zyngier <maz@kernel.org> Tested-by: Stephen Boyd <swboyd@chromium.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Douglas Anderson <dianders@chromium.org> Link: https://lore.kernel.org/r/1601267524-20199-4-git-send-email-mkshah@codeaurora.org
2020-10-06dma-mapping: merge <linux/dma-noncoherent.h> into <linux/dma-map-ops.h>Christoph Hellwig
Move more nitty gritty DMA implementation details into the common internal header. Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-10-06dma-mapping: move large parts of <linux/dma-direct.h> to kernel/dmaChristoph Hellwig
Most of the dma_direct symbols should only be used by direct.c and mapping.c, so move them to kernel/dma. In fact more of dma-direct.h should eventually move, but that will require more coordination with other subsystems. Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-10-06dma-mapping: move dma-debug.h to kernel/dma/Christoph Hellwig
Most of dma-debug.h is not required by anything outside of kernel/dma. Move the four declarations needed by dma-mappin.h or dma-ops providers into dma-mapping.h and dma-map-ops.h, and move the remainder of the file to kernel/dma/debug.h. Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-10-06dma-mapping: remove <asm/dma-contiguous.h>Christoph Hellwig
Just provide a weak default definition of dma_contiguous_early_fixup and let arm override it. Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-10-06dma-mapping: merge <linux/dma-contiguous.h> into <linux/dma-map-ops.h>Christoph Hellwig
Merge dma-contiguous.h into dma-map-ops.h, after removing the comment describing the contiguous allocator into kernel/dma/contigous.c. Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-10-06dma-contiguous: remove dma_contiguous_set_defaultChristoph Hellwig
dma_contiguous_set_default contains a trivial assignment, and has a single caller that is compiled if CONFIG_CMA_DMA is enabled. Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-10-06dma-contiguous: remove dev_set_cma_areaChristoph Hellwig
dev_set_cma_area contains a trivial assignment. It has just three callers that all have a non-NULL device and depend on CONFIG_DMA_CMA, so remove the wrapper. Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-10-06dma-mapping: split <linux/dma-mapping.h>Christoph Hellwig
Split out all the bits that are purely for dma_map_ops implementations and related code into a new <linux/dma-map-ops.h> header so that they don't get pulled into all the drivers. That also means the architecture specific <asm/dma-mapping.h> is not pulled in by <linux/dma-mapping.h> any more, which leads to a missing includes that were pulled in by the x86 or arm versions in a few not overly portable drivers. Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-10-06cma: decrease CMA_ALIGNMENT lower limit to 2Paul Cercueil
On an embedded system with a tiny (1 MiB) CMA area for video memory, and a simple enough video pipeline, we can decrease the CMA_ALIGNMENT by a factor of 2 to avoid wasting memory, as all the allocations for video buffers will be of the exact same size (dictated by the size of the screen). Signed-off-by: Paul Cercueil <paul@crapouillou.net> Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-10-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller
Rejecting non-native endian BTF overlapped with the addition of support for it. The rest were more simple overlapping changes, except the renesas ravb binding update, which had to follow a file move as well as a YAML conversion. Signed-off-by: David S. Miller <davem@davemloft.net>
2020-10-05tracing: Add support for dynamic strings to synthetic eventsTom Zanussi
Currently, sythetic events only support static string fields such as: # echo 'test_latency u64 lat; char somename[32]' > /sys/kernel/debug/tracing/synthetic_events Which is fine, but wastes a lot of space in the event. It also prevents the most commonly-defined strings in the existing trace events e.g. those defined using __string(), from being passed to synthetic events via the trace() action. With this change, synthetic events with dynamic fields can be defined: # echo 'test_latency u64 lat; char somename[]' > /sys/kernel/debug/tracing/synthetic_events And the trace() action can be used to generate events using either dynamic or static strings: # echo 'hist:keys=name:lat=common_timestamp.usecs-$ts0:onmatch(sys.event).test_latency($lat,name)' > /sys/kernel/debug/tracing/events The synthetic event dynamic strings are implemented in the same way as the existing __data_loc strings and appear as such in the format file. [ <rostedt@goodmis.org>: added __set_synth_event_print_fmt() changes: I added the following to make it work with trace-cmd. Dynamic strings must have __get_str() for events in the print_fmt otherwise it can't be parsed correctly. ] Link: https://lore.kernel.org/r/cover.1601588066.git.zanussi@kernel.org Link: https://lkml.kernel.org/r/3ed35b6d0e390f5b94cb4a9ba1cc18f5982ab277.1601848695.git.zanussi@kernel.org Tested-by: Axel Rasmussen <axelrasmussen@google.com> Signed-off-by: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-10-05tracing: Save normal string variablesTom Zanussi
String variables created as field variables and save variables are already handled properly by having their values copied when set. The same isn't done for normal variables, but needs to be - simply saving a pointer to a string contained in an old event isn't sufficient, since that event's data may quickly become overwritten and therefore a string pointer to it could yield garbage. This change uses the same mechanism as field variables and simply appends the new strings to the existing per-element field_var_str[] array allocated for that purpose. Link: https://lkml.kernel.org/r/1c1a03798b02e67307412a0c719d1bfb69b13007.1601848695.git.zanussi@kernel.org Fixes: 02205a6752f2 (tracing: Add support for 'field variables') Signed-off-by: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-10-05tracing: Fix parse_synth_field() error handlingTom Zanussi
synth_field_size() returns either a positive size or an error (zero or a negative value). However, the existing code assumes the only error value is 0. It doesn't handle negative error codes, as it assigns directly to field->size (a size_t; unsigned), thereby interpreting the error code as a valid size instead. Do the test before assignment to field->size. [ axelrasmussen@google.com: changelog addition, first paragraph above ] Link: https://lkml.kernel.org/r/9b6946d9776b2eeb43227678158196de1c3c6e1d.1601848695.git.zanussi@kernel.org Fixes: 4b147936fa50 (tracing: Add support for 'synthetic' events) Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org> Tested-by: Axel Rasmussen <axelrasmussen@google.com> Signed-off-by: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-10-05tracing: Change STR_VAR_MAX_LENTom Zanussi
32 is too small for this value, and anyway it makes more sense to use MAX_FILTER_STR_VAL, as this is also the value used for variable-length __strings. Link: https://lkml.kernel.org/r/6adfd1668ac1fd8670bd58206944a762061a5559.1601848695.git.zanussi@kernel.org Tested-by: Axel Rasmussen <axelrasmussen@google.com> Signed-off-by: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-10-06bpf: Use raw_spin_trylock() for pcpu_freelist_push/pop in NMISong Liu
Recent improvements in LOCKDEP highlighted a potential A-A deadlock with pcpu_freelist in NMI: ./tools/testing/selftests/bpf/test_progs -t stacktrace_build_id_nmi [ 18.984807] ================================ [ 18.984807] WARNING: inconsistent lock state [ 18.984808] 5.9.0-rc6-01771-g1466de1330e1 #2967 Not tainted [ 18.984809] -------------------------------- [ 18.984809] inconsistent {INITIAL USE} -> {IN-NMI} usage. [ 18.984810] test_progs/1990 [HC2[2]:SC0[0]:HE0:SE1] takes: [ 18.984810] ffffe8ffffc219c0 (&head->lock){....}-{2:2}, at: __pcpu_freelist_pop+0xe3/0x180 [ 18.984813] {INITIAL USE} state was registered at: [ 18.984814] lock_acquire+0x175/0x7c0 [ 18.984814] _raw_spin_lock+0x2c/0x40 [ 18.984815] __pcpu_freelist_pop+0xe3/0x180 [ 18.984815] pcpu_freelist_pop+0x31/0x40 [ 18.984816] htab_map_alloc+0xbbf/0xf40 [ 18.984816] __do_sys_bpf+0x5aa/0x3ed0 [ 18.984817] do_syscall_64+0x2d/0x40 [ 18.984818] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 18.984818] irq event stamp: 12 [...] [ 18.984822] other info that might help us debug this: [ 18.984823] Possible unsafe locking scenario: [ 18.984823] [ 18.984824] CPU0 [ 18.984824] ---- [ 18.984824] lock(&head->lock); [ 18.984826] <Interrupt> [ 18.984826] lock(&head->lock); [ 18.984827] [ 18.984828] *** DEADLOCK *** [ 18.984828] [ 18.984829] 2 locks held by test_progs/1990: [...] [ 18.984838] <NMI> [ 18.984838] dump_stack+0x9a/0xd0 [ 18.984839] lock_acquire+0x5c9/0x7c0 [ 18.984839] ? lock_release+0x6f0/0x6f0 [ 18.984840] ? __pcpu_freelist_pop+0xe3/0x180 [ 18.984840] _raw_spin_lock+0x2c/0x40 [ 18.984841] ? __pcpu_freelist_pop+0xe3/0x180 [ 18.984841] __pcpu_freelist_pop+0xe3/0x180 [ 18.984842] pcpu_freelist_pop+0x17/0x40 [ 18.984842] ? lock_release+0x6f0/0x6f0 [ 18.984843] __bpf_get_stackid+0x534/0xaf0 [ 18.984843] bpf_prog_1fd9e30e1438d3c5_oncpu+0x73/0x350 [ 18.984844] bpf_overflow_handler+0x12f/0x3f0 This is because pcpu_freelist_head.lock is accessed in both NMI and non-NMI context. Fix this issue by using raw_spin_trylock() in NMI. Since NMI interrupts non-NMI context, when NMI context tries to lock the raw_spinlock, non-NMI context of the same CPU may already have locked a lock and is blocked from unlocking the lock. For a system with N CPUs, there could be N NMIs at the same time, and they may block N non-NMI raw_spinlocks. This is tricky for pcpu_freelist_push(), where unlike _pop(), failing _push() means leaking memory. This issue is more likely to trigger in non-SMP system. Fix this issue with an extra list, pcpu_freelist.extralist. The extralist is primarily used to take _push() when raw_spin_trylock() failed on all the per CPU lists. It should be empty most of the time. The following table summarizes the behavior of pcpu_freelist in NMI and non-NMI: non-NMI pop(): use _lock(); check per CPU lists first; if all per CPU lists are empty, check extralist; if extralist is empty, return NULL. non-NMI push(): use _lock(); only push to per CPU lists. NMI pop(): use _trylock(); check per CPU lists first; if all per CPU lists are locked or empty, check extralist; if extralist is locked or empty, return NULL. NMI push(): use _trylock(); check per CPU lists first; if all per CPU lists are locked; try push to extralist; if extralist is also locked, keep trying on per CPU lists. Reported-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20201005165838.3735218-1-songliubraving@fb.com
2020-10-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netLinus Torvalds
Pull networking fixes from David Miller: 1) Make sure SKB control block is in the proper state during IPSEC ESP-in-TCP encapsulation. From Sabrina Dubroca. 2) Various kinds of attributes were not being cloned properly when we build new xfrm_state objects from existing ones. Fix from Antony Antony. 3) Make sure to keep BTF sections, from Tony Ambardar. 4) TX DMA channels need proper locking in lantiq driver, from Hauke Mehrtens. 5) Honour route MTU during forwarding, always. From Maciej Żenczykowski. 6) Fix races in kTLS which can result in crashes, from Rohit Maheshwari. 7) Skip TCP DSACKs with rediculous sequence ranges, from Priyaranjan Jha. 8) Use correct address family in xfrm state lookups, from Herbert Xu. 9) A bridge FDB flush should not clear out user managed fdb entries with the ext_learn flag set, from Nikolay Aleksandrov. 10) Fix nested locking of netdev address lists, from Taehee Yoo. 11) Fix handling of 32-bit DATA_FIN values in mptcp, from Mat Martineau. 12) Fix r8169 data corruptions on RTL8402 chips, from Heiner Kallweit. 13) Don't free command entries in mlx5 while comp handler could still be running, from Eran Ben Elisha. 14) Error flow of request_irq() in mlx5 is busted, due to an off by one we try to free and IRQ never allocated. From Maor Gottlieb. 15) Fix leak when dumping netlink policies, from Johannes Berg. 16) Sendpage cannot be performed when a page is a slab page, or the page count is < 1. Some subsystems such as nvme were doing so. Create a "sendpage_ok()" helper and use it as needed, from Coly Li. 17) Don't leak request socket when using syncookes with mptcp, from Paolo Abeni. * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (111 commits) net/core: check length before updating Ethertype in skb_mpls_{push,pop} net: mvneta: fix double free of txq->buf net_sched: check error pointer in tcf_dump_walker() net: team: fix memory leak in __team_options_register net: typhoon: Fix a typo Typoon --> Typhoon net: hinic: fix DEVLINK build errors net: stmmac: Modify configuration method of EEE timers tcp: fix syn cookied MPTCP request socket leak libceph: use sendpage_ok() in ceph_tcp_sendpage() scsi: libiscsi: use sendpage_ok() in iscsi_tcp_segment_map() drbd: code cleanup by using sendpage_ok() to check page for kernel_sendpage() tcp: use sendpage_ok() to detect misused .sendpage nvme-tcp: check page by sendpage_ok() before calling kernel_sendpage() net: add WARN_ONCE in kernel_sendpage() for improper zero-copy send net: introduce helper sendpage_ok() in include/linux/net.h net: usb: pegasus: Proper error handing when setting pegasus' MAC address net: core: document two new elements of struct net_device netlink: fix policy dump leak net/mlx5e: Fix race condition on nhe->n pointer in neigh update net/mlx5e: Fix VLAN create flow ...
2020-10-05PM: hibernate: remove the bogus call to get_gendisk() in software_resume()Christoph Hellwig
get_gendisk grabs a reference on the disk and file operation, so this code will leak both of them while having absolutely no use for the gendisk itself. This effectively reverts commit 2df83fa4bce421f ("PM / Hibernate: Use get_gendisk to verify partition if resume_file is integer format") Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-10-05block: add a bdget_part helperChristoph Hellwig
All remaining callers of bdget() outside of fs/block_dev.c want to get a reference to the struct block_device for a given struct hd_struct. Add a helper just for that and then mark bdget static. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-05printk: Use fallthrough pseudo-keywordGustavo A. R. Silva
Replace /* FALL THRU */ comment with the new pseudo-keyword macro fallthrough[1]. [1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20201002224627.GA30475@embeddedor
2020-10-05bpf, verifier: Use fallthrough pseudo-keywordGustavo A. R. Silva
Replace /* fallthrough */ comments with the new pseudo-keyword macro fallthrough [1]. [1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20201002234217.GA12280@embeddedor