summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2020-06-25Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netLinus Torvalds
Pull networking fixes from David Miller: 1) Don't insert ESP trailer twice in IPSEC code, from Huy Nguyen. 2) The default crypto algorithm selection in Kconfig for IPSEC is out of touch with modern reality, fix this up. From Eric Biggers. 3) bpftool is missing an entry for BPF_MAP_TYPE_RINGBUF, from Andrii Nakryiko. 4) Missing init of ->frame_sz in xdp_convert_zc_to_xdp_frame(), from Hangbin Liu. 5) Adjust packet alignment handling in ax88179_178a driver to match what the hardware actually does. From Jeremy Kerr. 6) register_netdevice can leak in the case one of the notifiers fail, from Yang Yingliang. 7) Use after free in ip_tunnel_lookup(), from Taehee Yoo. 8) VLAN checks in sja1105 DSA driver need adjustments, from Vladimir Oltean. 9) tg3 driver can sleep forever when we get enough EEH errors, fix from David Christensen. 10) Missing {READ,WRITE}_ONCE() annotations in various Intel ethernet drivers, from Ciara Loftus. 11) Fix scanning loop break condition in of_mdiobus_register(), from Florian Fainelli. 12) MTU limit is incorrect in ibmveth driver, from Thomas Falcon. 13) Endianness fix in mlxsw, from Ido Schimmel. 14) Use after free in smsc95xx usbnet driver, from Tuomas Tynkkynen. 15) Missing bridge mrp configuration validation, from Horatiu Vultur. 16) Fix circular netns references in wireguard, from Jason A. Donenfeld. 17) PTP initialization on recovery is not done properly in qed driver, from Alexander Lobakin. 18) Endian conversion of L4 ports in filters of cxgb4 driver is wrong, from Rahul Lakkireddy. 19) Don't clear bound device TX queue of socket prematurely otherwise we get problems with ktls hw offloading, from Tariq Toukan. 20) ipset can do atomics on unaligned memory, fix from Russell King. 21) Align ethernet addresses properly in bridging code, from Thomas Martitz. 22) Don't advertise ipv4 addresses on SCTP sockets having ipv6only set, from Marcelo Ricardo Leitner. * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (149 commits) rds: transport module should be auto loaded when transport is set sch_cake: fix a few style nits sch_cake: don't call diffserv parsing code when it is not needed sch_cake: don't try to reallocate or unshare skb unconditionally ethtool: fix error handling in linkstate_prepare_data() wil6210: account for napi_gro_receive never returning GRO_DROP hns: do not cast return value of napi_gro_receive to null socionext: account for napi_gro_receive never returning GRO_DROP wireguard: receive: account for napi_gro_receive never returning GRO_DROP vxlan: fix last fdb index during dump of fdb with nhid sctp: Don't advertise IPv4 addresses if ipv6only is set on the socket tc-testing: avoid action cookies with odd length. bpf: tcp: bpf_cubic: fix spurious HYSTART_DELAY exit upon drop in min RTT tcp_cubic: fix spurious HYSTART_DELAY exit upon drop in min RTT net: dsa: sja1105: fix tc-gate schedule with single element net: dsa: sja1105: recalculate gating subschedule after deleting tc-gate rules net: dsa: sja1105: unconditionally free old gating config net: dsa: sja1105: move sja1105_compose_gating_subschedule at the top net: macb: free resources on failure path of at91ether_open() net: macb: call pm_runtime_put_sync on failure path ...
2020-06-25Merge tag 'trace-v5.8-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fixes from Steven Rostedt: "Four small fixes: - Fix a ringbuffer bug for nested events having time go backwards - Fix a config dependency for boot time tracing to depend on synthetic events instead of histograms. - Fix trigger format parsing to handle multiple spaces - Fix bootconfig to handle failures in multiple events" * tag 'trace-v5.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing/boottime: Fix kprobe multiple events tracing: Fix event trigger to accept redundant spaces tracing/boot: Fix config dependency for synthedic event ring-buffer: Zero out time extend if it is nested and not absolute
2020-06-26fork: annotate data race in copy_process()Weilong Chen
KCSAN reported data race reading and writing nr_threads and max_threads. The data race is intentional and benign. This is obvious from the comment above it and based on general consensus when discussing this issue. So there's no need for any heavy atomic or *_ONCE() machinery here. In accordance with the newly introduced data_race() annotation consensus, mark the offending line with data_race(). Here it's actually useful not just to silence KCSAN but to also clearly communicate that the race is intentional. This is especially helpful since nr_threads is otherwise protected by tasklist_lock. BUG: KCSAN: data-race in copy_process / copy_process write to 0xffffffff86205cf8 of 4 bytes by task 14779 on cpu 1: copy_process+0x2eba/0x3c40 kernel/fork.c:2273 _do_fork+0xfe/0x7a0 kernel/fork.c:2421 __do_sys_clone kernel/fork.c:2576 [inline] __se_sys_clone kernel/fork.c:2557 [inline] __x64_sys_clone+0x130/0x170 kernel/fork.c:2557 do_syscall_64+0xcc/0x3a0 arch/x86/entry/common.c:294 entry_SYSCALL_64_after_hwframe+0x44/0xa9 read to 0xffffffff86205cf8 of 4 bytes by task 6944 on cpu 0: copy_process+0x94d/0x3c40 kernel/fork.c:1954 _do_fork+0xfe/0x7a0 kernel/fork.c:2421 __do_sys_clone kernel/fork.c:2576 [inline] __se_sys_clone kernel/fork.c:2557 [inline] __x64_sys_clone+0x130/0x170 kernel/fork.c:2557 do_syscall_64+0xcc/0x3a0 arch/x86/entry/common.c:294 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Link: https://groups.google.com/forum/#!msg/syzkaller-upstream-mo deration/thvp7AHs5Ew/aPdYLXfYBQAJ Reported-by: syzbot+52fced2d288f8ecd2b20@syzkaller.appspotmail.com Signed-off-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Weilong Chen <chenweilong@huawei.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Cc: Qian Cai <cai@lca.pw> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Marco Elver <elver@google.com> [christian.brauner@ubuntu.com: rewrite commit message] Link: https://lore.kernel.org/r/20200623041240.154294-1-chenweilong@huawei.com Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2020-06-25rcu: Fixup noinstr warningsPeter Zijlstra
A KCSAN build revealed we have explicit annoations through atomic_*() usage, switch to arch_atomic_*() for the respective functions. vmlinux.o: warning: objtool: rcu_nmi_exit()+0x4d: call to __kcsan_check_access() leaves .noinstr.text section vmlinux.o: warning: objtool: rcu_dynticks_eqs_enter()+0x25: call to __kcsan_check_access() leaves .noinstr.text section vmlinux.o: warning: objtool: rcu_nmi_enter()+0x4f: call to __kcsan_check_access() leaves .noinstr.text section vmlinux.o: warning: objtool: rcu_dynticks_eqs_exit()+0x2a: call to __kcsan_check_access() leaves .noinstr.text section vmlinux.o: warning: objtool: __rcu_is_watching()+0x25: call to __kcsan_check_access() leaves .noinstr.text section Additionally, without the NOP in instrumentation_begin(), objtool would not detect the lack of the 'else instrumentation_begin();' branch in rcu_nmi_enter(). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-25bpf: Do not allow btf_ctx_access with __int128 typesJohn Fastabend
To ensure btf_ctx_access() is safe the verifier checks that the BTF arg type is an int, enum, or pointer. When the function does the BTF arg lookup it uses the calculation 'arg = off / 8' using the fact that registers are 8B. This requires that the first arg is in the first reg, the second in the second, and so on. However, for __int128 the arg will consume two registers by default LLVM implementation. So this will cause the arg layout assumed by the 'arg = off / 8' calculation to be incorrect. Because __int128 is uncommon this patch applies the easiest fix and will force int types to be sizeof(u64) or smaller so that they will fit in a single register. v2: remove unneeded parens per Andrii's feedback Fixes: 9e15db66136a1 ("bpf: Implement accurate raw_tp context access via BTF") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/159303723962.11287.13309537171132420717.stgit@john-Precision-5820-Tower
2020-06-25console: Fix trivia typo 'change' -> 'chance'Andy Shevchenko
I bet the word 'chance' has to be used in 'had a chance to be called', but, alas, I'm not native speaker... Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Petr Mladek <pmladek@suse.com> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20200618164751.56828-7-andriy.shevchenko@linux.intel.com
2020-06-25console: Propagate error code from console ->setup()Andy Shevchenko
Since console ->setup() hook returns meaningful error codes, propagate it to the caller of try_enable_new_console(). Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Petr Mladek <pmladek@suse.com> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20200618164751.56828-6-andriy.shevchenko@linux.intel.com
2020-06-25cpuidle: Rearrange s2idle-specific idle state entry codeRafael J. Wysocki
Implement call_cpuidle_s2idle() in analogy with call_cpuidle() for the s2idle-specific idle state entry and invoke it from cpuidle_idle_call() to make the s2idle-specific idle entry code path look more similar to the "regular" idle entry one. No intentional functional impact. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Chen Yu <yu.c.chen@intel.com>
2020-06-25sched/fair: Optimize dequeue_task_fair()Peng Wang
While looking at enqueue_task_fair and dequeue_task_fair, it occurred to me that dequeue_task_fair can also be optimized as Vincent described in commit 7d148be69e3a ("sched/fair: Optimize enqueue_task_fair()"). When encountering throttled cfs_rq, dequeue_throttle label can ensure se not to be NULL, and rq->nr_running remains unchanged, so we can also skip the early balance check. Signed-off-by: Peng Wang <rocking@linux.alibaba.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/701eef9a40de93dcf5fe7063fd607bca5db38e05.1592287263.git.rocking@linux.alibaba.com
2020-06-25sched: Micro optimization in pick_next_task() and in check_preempt_curr()Kirill Tkhai
This introduces an optimization based on xxx_sched_class addresses in two hot scheduler functions: pick_next_task() and check_preempt_curr(). It is possible to compare pointers to sched classes to check, which of them has a higher priority, instead of current iterations using for_each_class(). One more result of the patch is that size of object file becomes a little less (excluding added BUG_ON(), which goes in __init section): $size kernel/sched/core.o text data bss dec hex filename before: 66446 18957 676 86079 1503f kernel/sched/core.o after: 66398 18957 676 86031 1500f kernel/sched/core.o Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: http://lkml.kernel.org/r/711a9c4b-ff32-1136-b848-17c622d548f3@yandex.ru
2020-06-25sched: Remove struct sched_class::next fieldSteven Rostedt (VMware)
Now that the sched_class descriptors are defined in order via the linker script vmlinux.lds.h, there's no reason to have a "next" pointer to the previous priroity structure. The order of the sturctures can be aligned as an array, and used to index and find the next sched_class descriptor. Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20191219214558.845353593@goodmis.org
2020-06-25sched: Have sched_class_highest define by vmlinux.lds.hSteven Rostedt (VMware)
Now that the sched_class descriptors are defined by the linker script, and this needs to be aware of the existance of stop_sched_class when SMP is enabled or not, as it is used as the "highest" priority when defined. Move the declaration of sched_class_highest to the same location in the linker script that inserts stop_sched_class, and this will also make it easier to see what should be defined as the highest class, as this linker script location defines the priorities as well. Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20191219214558.682913590@goodmis.org
2020-06-25sched: Force the address order of each sched class descriptorSteven Rostedt (VMware)
In order to make a micro optimization in pick_next_task(), the order of the sched class descriptor address must be in the same order as their priority to each other. That is: &idle_sched_class < &fair_sched_class < &rt_sched_class < &dl_sched_class < &stop_sched_class In order to guarantee this order of the sched class descriptors, add each one into their own data section and force the order in the linker script. Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/157675913272.349305.8936736338884044103.stgit@localhost.localdomain
2020-06-25kdb: Make kdb_printf() console handling more robustSumit Garg
While rounding up CPUs via NMIs, its possible that a rounded up CPU maybe holding a console port lock leading to kgdb master CPU stuck in a deadlock during invocation of console write operations. A similar deadlock could also be possible while using synchronous breakpoints. So in order to avoid such a deadlock, set oops_in_progress to encourage the console drivers to disregard their internal spin locks: in the current calling context the risk of deadlock is a bigger problem than risks due to re-entering the console driver. We operate directly on oops_in_progress rather than using bust_spinlocks() because the calls bust_spinlocks() makes on exit are not appropriate for this calling context. Suggested-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Sumit Garg <sumit.garg@linaro.org> Reviewed-by: Douglas Anderson <dianders@chromium.org> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/1591264879-25920-4-git-send-email-sumit.garg@linaro.org Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
2020-06-25kdb: Check status of console prior to invoking handlersSumit Garg
Check if a console is enabled prior to invoking corresponding write handler. Suggested-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Sumit Garg <sumit.garg@linaro.org> Reviewed-by: Daniel Thompson <daniel.thompson@linaro.org> Reviewed-by: Douglas Anderson <dianders@chromium.org> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/1591264879-25920-3-git-send-email-sumit.garg@linaro.org Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
2020-06-25kdb: Re-factor kdb_printf() message write codeSumit Garg
Re-factor kdb_printf() message write code in order to avoid duplication of code and thereby increase readability. Signed-off-by: Sumit Garg <sumit.garg@linaro.org> Reviewed-by: Douglas Anderson <dianders@chromium.org> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/1591264879-25920-2-git-send-email-sumit.garg@linaro.org Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
2020-06-24bpf: Add bpf_skc_to_udp6_sock() helperYonghong Song
The helper is used in tracing programs to cast a socket pointer to a udp6_sock pointer. The return value could be NULL if the casting is illegal. Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Cc: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/bpf/20200623230815.3988481-1-yhs@fb.com
2020-06-24bpf: Add bpf_skc_to_{tcp, tcp_timewait, tcp_request}_sock() helpersYonghong Song
Three more helpers are added to cast a sock_common pointer to an tcp_sock, tcp_timewait_sock or a tcp_request_sock for tracing programs. Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20200623230811.3988277-1-yhs@fb.com
2020-06-24bpf: Add bpf_skc_to_tcp6_sock() helperYonghong Song
The helper is used in tracing programs to cast a socket pointer to a tcp6_sock pointer. The return value could be NULL if the casting is illegal. A new helper return type RET_PTR_TO_BTF_ID_OR_NULL is added so the verifier is able to deduce proper return types for the helper. Different from the previous BTF_ID based helpers, the bpf_skc_to_tcp6_sock() argument can be several possible btf_ids. More specifically, all possible socket data structures with sock_common appearing in the first in the memory layout. This patch only added socket types related to tcp and udp. All possible argument btf_id and return value btf_id for helper bpf_skc_to_tcp6_sock() are pre-calculcated and cached. In the future, it is even possible to precompute these btf_id's at kernel build time. Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20200623230809.3988195-1-yhs@fb.com
2020-06-24bpf: Allow tracing programs to use bpf_jiffies64() helperYonghong Song
/proc/net/tcp{4,6} uses jiffies for various computations. Let us add bpf_jiffies64() helper to tracing program so bpf_iter and other programs can use it. Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20200623230808.3988073-1-yhs@fb.com
2020-06-24bpf: Support 'X' in bpf_seq_printf() helperYonghong Song
'X' tells kernel to print hex with upper case letters. /proc/net/tcp{4,6} seq_file show() used this, and supports it in bpf_seq_printf() helper too. Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20200623230807.3988014-1-yhs@fb.com
2020-06-24Merge tag 'for-linus-2020-06-24' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux Pull thread fix from Christian Brauner: "This fixes a regression introduced with 303cc571d107 ("nsproxy: attach to namespaces via pidfds"). The LTP testsuite reported a regression where users would now see EBADF returned instead of EINVAL when an fd was passed that referred to an open file but the file was not a namespace file. Fix this by continuing to report EINVAL and add a regression test" * tag 'for-linus-2020-06-24' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: tests: test for setns() EINVAL regression nsproxy: restore EINVAL for non-namespace file descriptor
2020-06-24PM / EM: change name of em_pd_energy to em_cpu_energyLukasz Luba
Energy Model framework now supports other devices than CPUs. Refactor some of the functions in order to prevent wrong usage. The old function em_pd_energy has to generic name. It must not be used without proper cpumask pointer, which is possible only for CPU devices. Thus, rename it and add proper description to warn of potential wrong usage for other devices. Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org> Acked-by: Quentin Perret <qperret@google.com> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-06-24PM / EM: remove em_register_perf_domainLukasz Luba
Remove old function em_register_perf_domain which is no longer needed. There is em_dev_register_perf_domain that covers old use cases and new as well. Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org> Acked-by: Quentin Perret <qperret@google.com> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-06-24PM / EM: add support for other devices than CPUs in Energy ModelLukasz Luba
Add support for other devices than CPUs. The registration function does not require a valid cpumask pointer and is ready to handle new devices. Some of the internal structures has been reorganized in order to keep consistent view (like removing per_cpu pd pointers). Signed-off-by: Lukasz Luba <lukasz.luba@arm.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-06-24block: create the request_queue debugfs_dir on registrationLuis Chamberlain
We were only creating the request_queue debugfs_dir only for make_request block drivers (multiqueue), but never for request-based block drivers. We did this as we were only creating non-blktrace additional debugfs files on that directory for make_request drivers. However, since blktrace *always* creates that directory anyway, we special-case the use of that directory on blktrace. Other than this being an eye-sore, this exposes request-based block drivers to the same debugfs fragile race that used to exist with make_request block drivers where if we start adding files onto that directory we can later run a race with a double removal of dentries on the directory if we don't deal with this carefully on blktrace. Instead, just simplify things by always creating the request_queue debugfs_dir on request_queue registration. Rename the mutex also to reflect the fact that this is used outside of the blktrace context. Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-24blktrace: ensure our debugfs dir existsLuis Chamberlain
We make an assumption that a debugfs directory exists, but since this can fail ensure it exists before allowing blktrace setup to complete. Otherwise we end up stuffing blktrace files on the debugfs root directory. In the worst case scenario this *in theory* can create an eventual panic *iff* in the future a similarly named file is created prior on the debugfs root directory. This theoretical crash can happen due to a recursive removal followed by a specific dentry removal. This doesn't fix any known crash, however I have seen the files go into the main debugfs root directory in cases where the debugfs directory was not created due to other internal bugs with blktrace now fixed. blktrace is also completely useless without this directory, so this ensures to userspace we only setup blktrace if the kernel can stuff files where they are supposed to go into. debugfs directory creations typically aren't checked for, and we have maintainers doing sweep removals of these checks, but since we need this check to ensure proper userspace blktrace functionality we make sure to annotate the justification for the check. Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-24blktrace: fix debugfs use after freeLuis Chamberlain
On commit 6ac93117ab00 ("blktrace: use existing disk debugfs directory") merged on v4.12 Omar fixed the original blktrace code for request-based drivers (multiqueue). This however left in place a possible crash, if you happen to abuse blktrace while racing to remove / add a device. We used to use asynchronous removal of the request_queue, and with that the issue was easier to reproduce. Now that we have reverted to synchronous removal of the request_queue, the issue is still possible to reproduce, its however just a bit more difficult. We essentially run two instances of break-blktrace which add/remove a loop device, and setup a blktrace and just never tear the blktrace down. We do this twice in parallel. This is easily reproduced with the script run_0004.sh from break-blktrace [0]. We can end up with two types of panics each reflecting where we race, one a failed blktrace setup: [ 252.426751] debugfs: Directory 'loop0' with parent 'block' already present! [ 252.432265] BUG: kernel NULL pointer dereference, address: 00000000000000a0 [ 252.436592] #PF: supervisor write access in kernel mode [ 252.439822] #PF: error_code(0x0002) - not-present page [ 252.442967] PGD 0 P4D 0 [ 252.444656] Oops: 0002 [#1] SMP NOPTI [ 252.446972] CPU: 10 PID: 1153 Comm: break-blktrace Tainted: G E 5.7.0-rc2-next-20200420+ #164 [ 252.452673] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014 [ 252.456343] RIP: 0010:down_write+0x15/0x40 [ 252.458146] Code: eb ca e8 ae 22 8d ff cc cc cc cc cc cc cc cc cc cc cc cc cc cc 0f 1f 44 00 00 55 48 89 fd e8 52 db ff ff 31 c0 ba 01 00 00 00 <f0> 48 0f b1 55 00 75 0f 48 8b 04 25 c0 8b 01 00 48 89 45 08 5d [ 252.463638] RSP: 0018:ffffa626415abcc8 EFLAGS: 00010246 [ 252.464950] RAX: 0000000000000000 RBX: ffff958c25f0f5c0 RCX: ffffff8100000000 [ 252.466727] RDX: 0000000000000001 RSI: ffffff8100000000 RDI: 00000000000000a0 [ 252.468482] RBP: 00000000000000a0 R08: 0000000000000000 R09: 0000000000000001 [ 252.470014] R10: 0000000000000000 R11: ffff958d1f9227ff R12: 0000000000000000 [ 252.471473] R13: ffff958c25ea5380 R14: ffffffff8cce15f1 R15: 00000000000000a0 [ 252.473346] FS: 00007f2e69dee540(0000) GS:ffff958c2fc80000(0000) knlGS:0000000000000000 [ 252.475225] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 252.476267] CR2: 00000000000000a0 CR3: 0000000427d10004 CR4: 0000000000360ee0 [ 252.477526] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 252.478776] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 252.479866] Call Trace: [ 252.480322] simple_recursive_removal+0x4e/0x2e0 [ 252.481078] ? debugfs_remove+0x60/0x60 [ 252.481725] ? relay_destroy_buf+0x77/0xb0 [ 252.482662] debugfs_remove+0x40/0x60 [ 252.483518] blk_remove_buf_file_callback+0x5/0x10 [ 252.484328] relay_close_buf+0x2e/0x60 [ 252.484930] relay_open+0x1ce/0x2c0 [ 252.485520] do_blk_trace_setup+0x14f/0x2b0 [ 252.486187] __blk_trace_setup+0x54/0xb0 [ 252.486803] blk_trace_ioctl+0x90/0x140 [ 252.487423] ? do_sys_openat2+0x1ab/0x2d0 [ 252.488053] blkdev_ioctl+0x4d/0x260 [ 252.488636] block_ioctl+0x39/0x40 [ 252.489139] ksys_ioctl+0x87/0xc0 [ 252.489675] __x64_sys_ioctl+0x16/0x20 [ 252.490380] do_syscall_64+0x52/0x180 [ 252.491032] entry_SYSCALL_64_after_hwframe+0x44/0xa9 And the other on the device removal: [ 128.528940] debugfs: Directory 'loop0' with parent 'block' already present! [ 128.615325] BUG: kernel NULL pointer dereference, address: 00000000000000a0 [ 128.619537] #PF: supervisor write access in kernel mode [ 128.622700] #PF: error_code(0x0002) - not-present page [ 128.625842] PGD 0 P4D 0 [ 128.627585] Oops: 0002 [#1] SMP NOPTI [ 128.629871] CPU: 12 PID: 544 Comm: break-blktrace Tainted: G E 5.7.0-rc2-next-20200420+ #164 [ 128.635595] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014 [ 128.640471] RIP: 0010:down_write+0x15/0x40 [ 128.643041] Code: eb ca e8 ae 22 8d ff cc cc cc cc cc cc cc cc cc cc cc cc cc cc 0f 1f 44 00 00 55 48 89 fd e8 52 db ff ff 31 c0 ba 01 00 00 00 <f0> 48 0f b1 55 00 75 0f 65 48 8b 04 25 c0 8b 01 00 48 89 45 08 5d [ 128.650180] RSP: 0018:ffffa9c3c05ebd78 EFLAGS: 00010246 [ 128.651820] RAX: 0000000000000000 RBX: ffff8ae9a6370240 RCX: ffffff8100000000 [ 128.653942] RDX: 0000000000000001 RSI: ffffff8100000000 RDI: 00000000000000a0 [ 128.655720] RBP: 00000000000000a0 R08: 0000000000000002 R09: ffff8ae9afd2d3d0 [ 128.657400] R10: 0000000000000056 R11: 0000000000000000 R12: 0000000000000000 [ 128.659099] R13: 0000000000000000 R14: 0000000000000003 R15: 00000000000000a0 [ 128.660500] FS: 00007febfd995540(0000) GS:ffff8ae9afd00000(0000) knlGS:0000000000000000 [ 128.662204] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 128.663426] CR2: 00000000000000a0 CR3: 0000000420042003 CR4: 0000000000360ee0 [ 128.664776] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 128.666022] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 128.667282] Call Trace: [ 128.667801] simple_recursive_removal+0x4e/0x2e0 [ 128.668663] ? debugfs_remove+0x60/0x60 [ 128.669368] debugfs_remove+0x40/0x60 [ 128.669985] blk_trace_free+0xd/0x50 [ 128.670593] __blk_trace_remove+0x27/0x40 [ 128.671274] blk_trace_shutdown+0x30/0x40 [ 128.671935] blk_release_queue+0x95/0xf0 [ 128.672589] kobject_put+0xa5/0x1b0 [ 128.673188] disk_release+0xa2/0xc0 [ 128.673786] device_release+0x28/0x80 [ 128.674376] kobject_put+0xa5/0x1b0 [ 128.674915] loop_remove+0x39/0x50 [loop] [ 128.675511] loop_control_ioctl+0x113/0x130 [loop] [ 128.676199] ksys_ioctl+0x87/0xc0 [ 128.676708] __x64_sys_ioctl+0x16/0x20 [ 128.677274] do_syscall_64+0x52/0x180 [ 128.677823] entry_SYSCALL_64_after_hwframe+0x44/0xa9 The common theme here is: debugfs: Directory 'loop0' with parent 'block' already present This crash happens because of how blktrace uses the debugfs directory where it places its files. Upon init we always create the same directory which would be needed by blktrace but we only do this for make_request drivers (multiqueue) block drivers. When you race a removal of these devices with a blktrace setup you end up in a situation where the make_request recursive debugfs removal will sweep away the blktrace files and then later blktrace will also try to remove individual dentries which are already NULL. The inverse is also possible and hence the two types of use after frees. We don't create the block debugfs directory on init for these types of block devices: * request-based block driver block devices * every possible partition * scsi-generic And so, this race should in theory only be possible with make_request drivers. We can fix the UAF by simply re-using the debugfs directory for make_request drivers (multiqueue) and only creating the ephemeral directory for the other type of block devices. The new clarifications on relying on the q->blk_trace_mutex *and* also checking for q->blk_trace *prior* to processing a blktrace ensures the debugfs directories are only created if no possible directory name clashes are possible. This goes tested with: o nvme partitions o ISCSI with tgt, and blktracing against scsi-generic with: o block o tape o cdrom o media changer o blktests This patch is part of the work which disputes the severity of CVE-2019-19770 which shows this issue is not a core debugfs issue, but a misuse of debugfs within blktace. Fixes: 6ac93117ab00 ("blktrace: use existing disk debugfs directory") Reported-by: syzbot+603294af2d01acfdd6da@syzkaller.appspotmail.com Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Bart Van Assche <bvanassche@acm.org> Cc: Omar Sandoval <osandov@fb.com> Cc: Hannes Reinecke <hare@suse.com> Cc: Nicolai Stange <nstange@suse.de> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: "Martin K. Petersen" <martin.petersen@oracle.com> Cc: "James E.J. Bottomley" <jejb@linux.ibm.com> Cc: yu kuai <yukuai3@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-24blktrace: annotate required lock on do_blk_trace_setup()Luis Chamberlain
Ensure it is clear which lock is required on do_blk_trace_setup(). Suggested-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-24PM / EM: update callback structure and add device pointerLukasz Luba
The Energy Model framework is going to support devices other that CPUs. In order to make this happen change the callback function and add pointer to a device as an argument. Update the related users to use new function and new callback from the Energy Model. Acked-by: Quentin Perret <qperret@google.com> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com> Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-06-24PM / EM: introduce em_dev_register_perf_domain functionLukasz Luba
Add now function in the Energy Model framework which is going to support new devices. This function will help in transition and make it smoother. For now it still checks if the cpumask is a valid pointer, which will be removed later when the new structures and infrastructure will be ready. Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org> Acked-by: Quentin Perret <qperret@google.com> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-06-24PM / EM: change naming convention from 'capacity' to 'performance'Lukasz Luba
The Energy Model uses concept of performance domain and capacity states in order to calculate power used by CPUs. Change naming convention from capacity to performance state would enable wider usage in future, e.g. upcoming support for other devices other than CPUs. Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org> Acked-by: Quentin Perret <qperret@google.com> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-06-24gcc-plugins/stackleak: Use asm instrumentation to avoid useless register savingAlexander Popov
The kernel code instrumentation in stackleak gcc plugin works in two stages. At first, stack tracking is added to GIMPLE representation of every function (except some special cases). And later, when stack frame size info is available, stack tracking is removed from the RTL representation of the functions with small stack frame. There is an unwanted side-effect for these functions: some of them do useless work with caller-saved registers. As an example of such case, proc_sys_write without() instrumentation: 55 push %rbp 41 b8 01 00 00 00 mov $0x1,%r8d 48 89 e5 mov %rsp,%rbp e8 11 ff ff ff callq ffffffff81284610 <proc_sys_call_handler> 5d pop %rbp c3 retq 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1) 00 00 00 proc_sys_write() with instrumentation: 55 push %rbp 48 89 e5 mov %rsp,%rbp 41 56 push %r14 41 55 push %r13 41 54 push %r12 53 push %rbx 49 89 f4 mov %rsi,%r12 48 89 fb mov %rdi,%rbx 49 89 d5 mov %rdx,%r13 49 89 ce mov %rcx,%r14 4c 89 f1 mov %r14,%rcx 4c 89 ea mov %r13,%rdx 4c 89 e6 mov %r12,%rsi 48 89 df mov %rbx,%rdi 41 b8 01 00 00 00 mov $0x1,%r8d e8 f2 fe ff ff callq ffffffff81298e80 <proc_sys_call_handler> 5b pop %rbx 41 5c pop %r12 41 5d pop %r13 41 5e pop %r14 5d pop %rbp c3 retq 66 0f 1f 84 00 00 00 nopw 0x0(%rax,%rax,1) 00 00 Let's improve the instrumentation to avoid this: 1. Make stackleak_track_stack() save all register that it works with. Use no_caller_saved_registers attribute for that function. This attribute is available for x86_64 and i386 starting from gcc-7. 2. Insert calling stackleak_track_stack() in asm: asm volatile("call stackleak_track_stack" :: "r" (current_stack_pointer)) Here we use ASM_CALL_CONSTRAINT trick from arch/x86/include/asm/asm.h. The input constraint is taken into account during gcc shrink-wrapping optimization. It is needed to be sure that stackleak_track_stack() call is inserted after the prologue of the containing function, when the stack frame is prepared. This work is a deep reengineering of the idea described on grsecurity blog https://grsecurity.net/resolving_an_unfortunate_stackleak_interaction Signed-off-by: Alexander Popov <alex.popov@linux.com> Acked-by: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com> Link: https://lore.kernel.org/r/20200624123330.83226-5-alex.popov@linux.com Signed-off-by: Kees Cook <keescook@chromium.org>
2020-06-24gcc-plugins/stackleak: Don't instrument itselfAlexander Popov
There is no need to try instrumenting functions in kernel/stackleak.c. Otherwise that can cause issues if the cleanup pass of stackleak gcc plugin is disabled. Signed-off-by: Alexander Popov <alex.popov@linux.com> Link: https://lore.kernel.org/r/20200624123330.83226-2-alex.popov@linux.com Signed-off-by: Kees Cook <keescook@chromium.org>
2020-06-23tracing/boottime: Fix kprobe multiple eventsSascha Ortmann
Fix boottime kprobe events to report and abort after each failure when adding probes. As an example, when we try to set multiprobe kprobe events in bootconfig like this: ftrace.event.kprobes.vfsevents { probes = "vfs_read $arg1 $arg2,, !error! not reported;?", // leads to error "vfs_write $arg1 $arg2" } This will not work as expected. After commit da0f1f4167e3af69e ("tracing/boottime: Fix kprobe event API usage"), the function trace_boot_add_kprobe_event will not produce any error message when adding a probe fails at kprobe_event_gen_cmd_start. Furthermore, we continue to add probes when kprobe_event_gen_cmd_end fails (and kprobe_event_gen_cmd_start did not fail). In this case the function even returns successfully when the last call to kprobe_event_gen_cmd_end is successful. The behaviour of reporting and aborting after failures is not consistent. The function trace_boot_add_kprobe_event now reports each failure and stops adding probes immediately. Link: https://lkml.kernel.org/r/20200618163301.25854-1-sascha.ortmann@stud.uni-hannover.de Cc: stable@vger.kernel.org Cc: linux-kernel@i4.cs.fau.de Co-developed-by: Maximilian Werner <maximilian.werner96@gmail.com> Fixes: da0f1f4167e3 ("tracing/boottime: Fix kprobe event API usage") Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Maximilian Werner <maximilian.werner96@gmail.com> Signed-off-by: Sascha Ortmann <sascha.ortmann@stud.uni-hannover.de> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-06-23tracing: Fix event trigger to accept redundant spacesMasami Hiramatsu
Fix the event trigger to accept redundant spaces in the trigger input. For example, these return -EINVAL echo " traceon" > events/ftrace/print/trigger echo "traceon if common_pid == 0" > events/ftrace/print/trigger echo "disable_event:kmem:kmalloc " > events/ftrace/print/trigger But these are hard to find what is wrong. To fix this issue, use skip_spaces() to remove spaces in front of actual tokens, and set NULL if there is no token. Link: http://lkml.kernel.org/r/159262476352.185015.5261566783045364186.stgit@devnote2 Cc: Tom Zanussi <zanussi@kernel.org> Cc: stable@vger.kernel.org Fixes: 85f2b08268c0 ("tracing: Add basic event trigger framework") Reviewed-by: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-06-23tracing/boot: Fix config dependency for synthedic eventMasami Hiramatsu
Since commit 726721a51838 ("tracing: Move synthetic events to a separate file") decoupled synthetic event from histogram, boot-time tracing also has to check CONFIG_SYNTH_EVENT instead of CONFIG_HIST_TRIGGERS. Link: http://lkml.kernel.org/r/159262475441.185015.5300725180746017555.stgit@devnote2 Fixes: 726721a51838 ("tracing: Move synthetic events to a separate file") Reviewed-by: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-06-23bpf: Restore behaviour of CAP_SYS_ADMIN allowing the loading of networking ↵Maciej Żenczykowski
bpf programs This is a fix for a regression in commit 2c78ee898d8f ("bpf: Implement CAP_BPF"). Before the above commit it was possible to load network bpf programs with just the CAP_SYS_ADMIN privilege. The Android bpfloader happens to run in such a configuration (it has SYS_ADMIN but not NET_ADMIN) and creates maps and loads bpf programs for later use by Android's netd (which has NET_ADMIN but not SYS_ADMIN). Fixes: 2c78ee898d8f ("bpf: Implement CAP_BPF") Reported-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Maciej Żenczykowski <maze@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Tested-by: John Stultz <john.stultz@linaro.org> Link: https://lore.kernel.org/bpf/20200620212616.93894-1-zenczykowski@gmail.com
2020-06-23bpf: Set the number of exception entries properly for subprogramsYonghong Song
Currently, if a bpf program has more than one subprograms, each program will be jitted separately. For programs with bpf-to-bpf calls the prog->aux->num_exentries is not setup properly. For example, with bpf_iter_netlink.c modified to force one function to be not inlined and with CONFIG_BPF_JIT_ALWAYS_ON the following error is seen: $ ./test_progs -n 3/3 ... libbpf: failed to load program 'iter/netlink' libbpf: failed to load object 'bpf_iter_netlink' libbpf: failed to load BPF skeleton 'bpf_iter_netlink': -4007 test_netlink:FAIL:bpf_iter_netlink__open_and_load skeleton open_and_load failed #3/3 netlink:FAIL The dmesg shows the following errors: ex gen bug which is triggered by the following code in arch/x86/net/bpf_jit_comp.c: if (excnt >= bpf_prog->aux->num_exentries) { pr_err("ex gen bug\n"); return -EFAULT; } This patch fixes the issue by computing proper num_exentries for each subprogram before calling JIT. Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2020-06-23audit: log nftables configuration change eventsRichard Guy Briggs
iptables, ip6tables, arptables and ebtables table registration, replacement and unregistration configuration events are logged for the native (legacy) iptables setsockopt api, but not for the nftables netlink api which is used by the nft-variant of iptables in addition to nftables itself. Add calls to log the configuration actions in the nftables netlink api. This uses the same NETFILTER_CFG record format but overloads the table field. type=NETFILTER_CFG msg=audit(2020-05-28 17:46:41.878:162) : table=?:0;?:0 family=unspecified entries=2 op=nft_register_gen pid=396 subj=system_u:system_r:firewalld_t:s0 comm=firewalld ... type=NETFILTER_CFG msg=audit(2020-05-28 17:46:41.878:162) : table=firewalld:1;?:0 family=inet entries=0 op=nft_register_table pid=396 subj=system_u:system_r:firewalld_t:s0 comm=firewalld ... type=NETFILTER_CFG msg=audit(2020-05-28 17:46:41.911:163) : table=firewalld:1;filter_FORWARD:85 family=inet entries=8 op=nft_register_chain pid=396 subj=system_u:system_r:firewalld_t:s0 comm=firewalld ... type=NETFILTER_CFG msg=audit(2020-05-28 17:46:41.911:163) : table=firewalld:1;filter_FORWARD:85 family=inet entries=101 op=nft_register_rule pid=396 subj=system_u:system_r:firewalld_t:s0 comm=firewalld ... type=NETFILTER_CFG msg=audit(2020-05-28 17:46:41.911:163) : table=firewalld:1;__set0:87 family=inet entries=87 op=nft_register_setelem pid=396 subj=system_u:system_r:firewalld_t:s0 comm=firewalld ... type=NETFILTER_CFG msg=audit(2020-05-28 17:46:41.911:163) : table=firewalld:1;__set0:87 family=inet entries=0 op=nft_register_set pid=396 subj=system_u:system_r:firewalld_t:s0 comm=firewalld For further information please see issue https://github.com/linux-audit/audit-kernel/issues/124 Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com>
2020-06-23Merge up to bpf_probe_read_kernel_str() fix into bpf-nextAlexei Starovoitov
2020-06-23ring-buffer: Zero out time extend if it is nested and not absoluteSteven Rostedt (VMware)
Currently the ring buffer makes events that happen in interrupts that preempt another event have a delta of zero. (Hopefully we can change this soon). But this is to deal with the races of updating a global counter with lockless and nesting functions updating deltas. With the addition of absolute time stamps, the time extend didn't follow this rule. A time extend can happen if two events happen longer than 2^27 nanoseconds appart, as the delta time field in each event is only 27 bits. If that happens, then a time extend is injected with 2^59 bits of nanoseconds to use (18 years). But if the 2^27 nanoseconds happen between two events, and as it is writing the event, an interrupt triggers, it will see the 2^27 difference as well and inject a time extend of its own. But a recent change made the time extend logic not take into account the nesting, and this can cause two time extend deltas to happen moving the time stamp much further ahead than the current time. This gets all reset when the ring buffer moves to the next page, but that can cause time to appear to go backwards. This was observed in a trace-cmd recording, and since the data is saved in a file, with trace-cmd report --debug, it was possible to see that this indeed did happen! bash-52501 110d... 81778.908247: sched_switch: bash:52501 [120] S ==> swapper/110:0 [120] [12770284:0x2e8:64] <idle>-0 110d... 81778.908757: sched_switch: swapper/110:0 [120] R ==> bash:52501 [120] [509947:0x32c:64] TIME EXTEND: delta:306454770 length:0 bash-52501 110.... 81779.215212: sched_swap_numa: src_pid=52501 src_tgid=52388 src_ngid=52501 src_cpu=110 src_nid=2 dst_pid=52509 dst_tgid=52388 dst_ngid=52501 dst_cpu=49 dst_nid=1 [0:0x378:48] TIME EXTEND: delta:306458165 length:0 bash-52501 110dNh. 81779.521670: sched_wakeup: migration/110:565 [0] success=1 CPU:110 [0:0x3b4:40] and at the next page, caused the time to go backwards: bash-52504 110d... 81779.685411: sched_switch: bash:52504 [120] S ==> swapper/110:0 [120] [8347057:0xfb4:64] CPU:110 [SUBBUFFER START] [81779379165886:0x1320000] <idle>-0 110dN.. 81779.379166: sched_wakeup: bash:52504 [120] success=1 CPU:110 [0:0x10:40] <idle>-0 110d... 81779.379167: sched_switch: swapper/110:0 [120] R ==> bash:52504 [120] [1168:0x3c:64] Link: https://lkml.kernel.org/r/20200622151815.345d1bf5@oasis.local.home Cc: Ingo Molnar <mingo@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Tom Zanussi <zanussi@kernel.org> Cc: stable@vger.kernel.org Fixes: dc4e2801d400b ("ring-buffer: Redefine the unimplemented RINGBUF_TYPE_TIME_STAMP") Reported-by: Julia Lawall <julia.lawall@inria.fr> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-06-23dma-remap: align the size in dma_common_*_remap()Eric Auger
Running a guest with a virtio-iommu protecting virtio devices is broken since commit 515e5b6d90d4 ("dma-mapping: use vmap insted of reimplementing it"). Before the conversion, the size was page aligned in __get_vm_area_node(). Doing so fixes the regression. Fixes: 515e5b6d90d4 ("dma-mapping: use vmap insted of reimplementing it") Signed-off-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-06-23dma-mapping: DMA_COHERENT_POOL should select GENERIC_ALLOCATORChristoph Hellwig
The dma coherent pool code needs genalloc. Move the select over from DMA_REMAP, which doesn't actually need it. Fixes: dbed452a078d ("dma-pool: decouple DMA_REMAP from DMA_COHERENT_POOL") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: David Rientjes <rientjes@google.com>
2020-06-23dma-direct: add missing set_memory_decrypted() for coherent mappingDavid Rientjes
When a coherent mapping is created in dma_direct_alloc_pages(), it needs to be decrypted if the device requires unencrypted DMA before returning. Fixes: 3acac065508f ("dma-mapping: merge the generic remapping helpers into dma-direct") Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-06-22bpf: Set map_btf_{name, id} for all map typesAndrey Ignatov
Set map_btf_name and map_btf_id for all map types so that map fields can be accessed by bpf programs. Signed-off-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/a825f808f22af52b018dbe82f1c7d29dab5fc978.1592600985.git.rdna@fb.com
2020-06-22bpf: Support access to bpf map fieldsAndrey Ignatov
There are multiple use-cases when it's convenient to have access to bpf map fields, both `struct bpf_map` and map type specific struct-s such as `struct bpf_array`, `struct bpf_htab`, etc. For example while working with sock arrays it can be necessary to calculate the key based on map->max_entries (some_hash % max_entries). Currently this is solved by communicating max_entries via "out-of-band" channel, e.g. via additional map with known key to get info about target map. That works, but is not very convenient and error-prone while working with many maps. In other cases necessary data is dynamic (i.e. unknown at loading time) and it's impossible to get it at all. For example while working with a hash table it can be convenient to know how much capacity is already used (bpf_htab.count.counter for BPF_F_NO_PREALLOC case). At the same time kernel knows this info and can provide it to bpf program. Fill this gap by adding support to access bpf map fields from bpf program for both `struct bpf_map` and map type specific fields. Support is implemented via btf_struct_access() so that a user can define their own `struct bpf_map` or map type specific struct in their program with only necessary fields and preserve_access_index attribute, cast a map to this struct and use a field. For example: struct bpf_map { __u32 max_entries; } __attribute__((preserve_access_index)); struct bpf_array { struct bpf_map map; __u32 elem_size; } __attribute__((preserve_access_index)); struct { __uint(type, BPF_MAP_TYPE_ARRAY); __uint(max_entries, 4); __type(key, __u32); __type(value, __u32); } m_array SEC(".maps"); SEC("cgroup_skb/egress") int cg_skb(void *ctx) { struct bpf_array *array = (struct bpf_array *)&m_array; struct bpf_map *map = (struct bpf_map *)&m_array; /* .. use map->max_entries or array->map.max_entries .. */ } Similarly to other btf_struct_access() use-cases (e.g. struct tcp_sock in net/ipv4/bpf_tcp_ca.c) the patch allows access to any fields of corresponding struct. Only reading from map fields is supported. For btf_struct_access() to work there should be a way to know btf id of a struct that corresponds to a map type. To get btf id there should be a way to get a stringified name of map-specific struct, such as "bpf_array", "bpf_htab", etc for a map type. Two new fields are added to `struct bpf_map_ops` to handle it: * .map_btf_name keeps a btf name of a struct returned by map_alloc(); * .map_btf_id is used to cache btf id of that struct. To make btf ids calculation cheaper they're calculated once while preparing btf_vmlinux and cached same way as it's done for btf_id field of `struct bpf_func_proto` While calculating btf ids, struct names are NOT checked for collision. Collisions will be checked as a part of the work to prepare btf ids used in verifier in compile time that should land soon. The only known collision for `struct bpf_htab` (kernel/bpf/hashtab.c vs net/core/sock_map.c) was fixed earlier. Both new fields .map_btf_name and .map_btf_id must be set for a map type for the feature to work. If neither is set for a map type, verifier will return ENOTSUPP on a try to access map_ptr of corresponding type. If just one of them set, it's verifier misconfiguration. Only `struct bpf_array` for BPF_MAP_TYPE_ARRAY and `struct bpf_htab` for BPF_MAP_TYPE_HASH are supported by this patch. Other map types will be supported separately. The feature is available only for CONFIG_DEBUG_INFO_BTF=y and gated by perfmon_capable() so that unpriv programs won't have access to bpf map fields. Signed-off-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/6479686a0cd1e9067993df57b4c3eef0e276fec9.1592600985.git.rdna@fb.com
2020-06-22bpf: Switch btf_parse_vmlinux to btf_find_by_name_kindAndrey Ignatov
btf_parse_vmlinux() implements manual search for struct bpf_ctx_convert since at the time of implementing btf_find_by_name_kind() was not available. Later btf_find_by_name_kind() was introduced in 27ae7997a661 ("bpf: Introduce BPF_PROG_TYPE_STRUCT_OPS"). It provides similar search functionality and can be leveraged in btf_parse_vmlinux(). Do it. Signed-off-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andriin@fb.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/6e12d5c3e8a3d552925913ef73a695dd1bb27800.1592600985.git.rdna@fb.com
2020-06-22fork: fold legacy_clone_args_valid() into _do_fork()Christian Brauner
This separate helper only existed to guarantee the mutual exclusivity of CLONE_PIDFD and CLONE_PARENT_SETTID for legacy clone since CLONE_PIDFD abuses the parent_tid field to return the pidfd. But we can actually handle this uniformely thus removing the helper. For legacy clone we can detect that CLONE_PIDFD is specified in conjunction with CLONE_PARENT_SETTID because they will share the same memory which is invalid and for clone3() setting the separate pidfd and parent_tid fields to the same memory is bogus as well. So fold that helper directly into _do_fork() by detecting this case. Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org> Cc: linux-m68k@lists.linux-m68k.org Cc: x86@kernel.org Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2020-06-21Revert "kernel/printk: add kmsg SEEK_CUR handling"Jason A. Donenfeld
This reverts commit 8ece3b3eb576a78d2e67ad4c3a80a39fa6708809. This commit broke userspace. Bash uses ESPIPE to determine whether or not the file should be read using "unbuffered I/O", which means reading 1 byte at a time instead of 128 bytes at a time. I used to use bash to read through kmsg in a really quite nasty way: while read -t 0.1 -r line 2>/dev/null || [[ $? -ne 142 ]]; do echo "SARU $line" done < /dev/kmsg This will show all lines that can fit into the 128 byte buffer, and skip lines that don't. That's pretty awful, but at least it worked. With this change, bash now tries to do 1-byte reads, which means it skips all the lines, which is worse than before. Now, I don't really care very much about this, and I'm already look for a workaround. But I did just spend an hour trying to figure out why my scripts were broken. Either way, it makes no difference to me personally whether this is reverted, but it might be something to consider. If you declare that "trying to read /dev/kmsg with bash is terminally stupid anyway," I might be inclined to agree with you. But do note that bash uses lseek(fd, 0, SEEK_CUR)==>ESPIPE to determine whether or not it's reading from a pipe. Cc: Bruno Meneguele <bmeneg@redhat.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: David Laight <David.Laight@ACULAB.COM> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Petr Mladek <pmladek@suse.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>