summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2020-07-04genirq: Remove preflow handler supportValentin Schneider
That was put in place for sparc64, and blackfin also used it for some time; sparc64 no longer uses those, and blackfin is dead. As there are no more users, remove preflow handlers. Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20200703155645.29703-3-valentin.schneider@arm.com
2020-07-03vmalloc: fix the owner argument for the new __vmalloc_node_range callersChristoph Hellwig
Fix the recently added new __vmalloc_node_range callers to pass the correct values as the owner for display in /proc/vmallocinfo. Fixes: 800e26b81311 ("x86/hyperv: allocate the hypercall page with only read and execute bits") Fixes: 10d5e97c1bf8 ("arm64: use PAGE_KERNEL_ROX directly in alloc_insn_page") Fixes: 7a0e27b2a0ce ("mm: remove vmalloc_exec") Reported-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20200627075649.2455097-1-hch@lst.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-07-02Merge tag 'for-linus-2020-07-02' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux Pull data race annotation from Christian Brauner: "This contains an annotation patch for a data race in copy_process() reported by KCSAN when reading and writing nr_threads. The data race is intentional and benign. This is obvious from the comment above the relevant code and based on general consensus when discussing this issue. So simply using data_race() to annotate this as an intentional race seems the best option" * tag 'for-linus-2020-07-02' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: fork: annotate data race in copy_process()
2020-07-02bpf: Fix build without CONFIG_STACKTRACESong Liu
Without CONFIG_STACKTRACE stack_trace_save_tsk() is not defined. Let get_callchain_entry_for_task() to always return NULL in such cases. Fixes: fa28dcb82a38 ("bpf: Introduce helper bpf_get_task_stack()") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200703024537.79971-1-songliubraving@fb.com
2020-07-02Merge tag 'io_uring-5.8-2020-07-01' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull io_uring fixes from Jens Axboe: "One fix in here, for a regression in 5.7 where a task is waiting in the kernel for a condition, but that condition won't become true until task_work is run. And the task_work can't be run exactly because the task is waiting in the kernel, so we'll never make any progress. One example of that is registering an eventfd and queueing io_uring work, and then the task goes and waits in eventfd read with the expectation that it'll get woken (and read an event) when the io_uring request completes. The io_uring request is finished through task_work, which won't get run while the task is looping in eventfd read" * tag 'io_uring-5.8-2020-07-01' of git://git.kernel.dk/linux-block: io_uring: use signal based task_work running task_work: teach task_work_add() to do signal_wake_up()
2020-07-02crash_core, vmcoreinfo: Append 'MAX_PHYSMEM_BITS' to vmcoreinfoBhupesh Sharma
Right now user-space tools like 'makedumpfile' and 'crash' need to rely on a best-guess method of determining value of 'MAX_PHYSMEM_BITS' supported by underlying kernel. This value is used in user-space code to calculate the bit-space required to store a section for SPARESMEM (similar to the existing calculation method used in the kernel implementation): #define SECTIONS_SHIFT (MAX_PHYSMEM_BITS - SECTION_SIZE_BITS) Now, regressions have been reported in user-space utilities like 'makedumpfile' and 'crash' on arm64, with the recently added kernel support for 52-bit physical address space, as there is no clear method of determining this value in user-space (other than reading kernel CONFIG flags). As per suggestion from makedumpfile maintainer (Kazu), it makes more sense to append 'MAX_PHYSMEM_BITS' to vmcoreinfo in the core code itself rather than in arch-specific code, so that the user-space code for other archs can also benefit from this addition to the vmcoreinfo and use it as a standard way of determining 'SECTIONS_SHIFT' value in user-land. A reference 'makedumpfile' implementation which reads the 'MAX_PHYSMEM_BITS' value from vmcoreinfo in a arch-independent fashion is available here: While at it also update vmcoreinfo documentation for 'MAX_PHYSMEM_BITS' variable being added to vmcoreinfo. 'MAX_PHYSMEM_BITS' defines the maximum supported physical address space memory. Signed-off-by: Bhupesh Sharma <bhsharma@redhat.com> Tested-by: John Donnelly <john.p.donnelly@oracle.com> Acked-by: Dave Young <dyoung@redhat.com> Cc: Boris Petkov <bp@alien8.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: James Morse <james.morse@arm.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Dave Anderson <anderson@redhat.com> Cc: Kazuhito Hagio <k-hagio@ab.jp.nec.com> Cc: x86@kernel.org Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-kernel@vger.kernel.org Cc: kexec@lists.infradead.org Link: https://lore.kernel.org/r/1589395957-24628-2-git-send-email-bhsharma@redhat.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2020-07-02Merge branch 'perf/vlbr'Peter Zijlstra
2020-07-02cpufreq: Register governors at core_initcallQuentin Perret
Currently, most CPUFreq governors are registered at the core_initcall time when the given governor is the default one, and the module_init time otherwise. In preparation for letting users specify the default governor on the kernel command line, change all of them to be registered at the core_initcall unconditionally, as it is already the case for the schedutil and performance governors. This will allow us to assume that builtin governors have been registered before the built-in CPUFreq drivers probe. And since all governors have similar init/exit patterns now, introduce two new macros, cpufreq_governor_{init,exit}(), to factorize the code. Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Quentin Perret <qperret@google.com> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> [ rjw: Changelog ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-07-01ring-buffer: Do not trigger a WARN if clock going backwards is detectedSteven Rostedt (VMware)
After tweaking the ring buffer to be a bit faster, a warning is triggering on one of my machines, and causing my tests to fail. This warning is caused when the delta (current time stamp minus previous time stamp), is larger than the max time held by the ring buffer (59 bits). If the clock were to go backwards slightly, this would then easily trigger this warning. The machine that it triggered on, the clock did go backwards by around 450 nanoseconds, and this happened after a recalibration of the TSC clock. Now that the ring buffer is faster, it detects this, and the delta that is used larger than the max, the warning is triggered and my test fails. To handle the clock going backwards, look at the saved before and after time stamps. If they are the same, it means that the current event did not interrupt another event, and that those timestamp are of a previous event that was recorded. If the max delta is triggered, look at those time stamps, make sure they are the same, then use them to compare with the current timestamp. If the current timestamp is less than the before/after time stamps, then that means the clock being used went backward. Print out a message that this has happened, but do not warn about it (and only print the message once). Still do the warning if the delta is indeed larger than what can be used. Also remove the unneeded KERN_WARNING from the WARN_ONCE() print. Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-07-01ring-buffer: Call trace_clock_local() directly for RETPOLINE kernelsSteven Rostedt (VMware)
After doing some benchmarks and examining the code, I found that the ring buffer clock calls were quite expensive, and noticed that it uses retpolines. This is because the ring buffer clock is programmable, and can be set. But in most cases it simply uses the fastest ns unit clock which is the trace_clock_local(). For RETPOLINE builds, checking if the ring buffer clock is set to trace_clock_local() and then calling it directly has brought the time of an event on my i7 box from an average of 93 nanoseconds an event down to 83 nanoseconds an event, and the minimum time from 81 nanoseconds to 68 nanoseconds! Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-07-01ring-buffer: Move the add_timestamp into its own functionSteven Rostedt (VMware)
Make a helper function rb_add_timestamp() that moves the adding of the extended time stamps into its own function. Also, remove the noinline and inline for the functions it calls, as recent benchmarks appear they do not make a difference (just let gcc decide). Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-07-01ring-buffer: Consolidate add_timestamp to remove some branchesSteven Rostedt (VMware)
Reorganize a little the logic to handle adding the absolute time stamp, extended and forced time stamps, in such a way to remove a branch or two. This is just a micro optimization. Also add before and after time stamps to the rb_event_info structure to display those values in the rb_check_timestamps() code, if something were to go wrong. Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-07-01bpf: Allow %pB in bpf_seq_printf() and bpf_trace_printk()Song Liu
This makes it easy to dump stack trace in text. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20200630062846.664389-4-songliubraving@fb.com
2020-07-01bpf: Introduce helper bpf_get_task_stack()Song Liu
Introduce helper bpf_get_task_stack(), which dumps stack trace of given task. This is different to bpf_get_stack(), which gets stack track of current task. One potential use case of bpf_get_task_stack() is to call it from bpf_iter__task and dump all /proc/<pid>/stack to a seq_file. bpf_get_task_stack() uses stack_trace_save_tsk() instead of get_perf_callchain() for kernel stack. The benefit of this choice is that stack_trace_save_tsk() doesn't require changes in arch/. The downside of using stack_trace_save_tsk() is that stack_trace_save_tsk() dumps the stack trace to unsigned long array. For 32-bit systems, we need to translate it to u64 array. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20200630062846.664389-3-songliubraving@fb.com
2020-07-01perf: Expose get/put_callchain_entry()Song Liu
Sanitize and expose get/put_callchain_entry(). This would be used by bpf stack map. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20200630062846.664389-2-songliubraving@fb.com
2020-07-01bpf: Remove redundant synchronize_rcu.Alexei Starovoitov
bpf_free_used_maps() or close(map_fd) will trigger map_free callback. bpf_free_used_maps() is called after bpf prog is no longer executing: bpf_prog_put->call_rcu->bpf_prog_free->bpf_free_used_maps. Hence there is no need to call synchronize_rcu() to protect map elements. Note that hash_of_maps and array_of_maps update/delete inner maps via sys_bpf() that calls maybe_wait_bpf_programs() and synchronize_rcu(). Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Paul E. McKenney <paulmck@kernel.org> Link: https://lore.kernel.org/bpf/20200630043343.53195-2-alexei.starovoitov@gmail.com
2020-07-01arch: remove unicore32 portMike Rapoport
The unicore32 port do not seem maintained for a long time now, there is no upstream toolchain that can create unicore32 binaries and all the links to prebuilt toolchains for unicore32 are dead. Even compilers that were available are not supported by the kernel anymore. Guenter Roeck says: I have stopped building unicore32 images since v4.19 since there is no available compiler that is still supported by the kernel. I am surprised that support for it has not been removed from the kernel. Remove unicore32 port. Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Guenter Roeck <linux@roeck-us.net>
2020-06-30Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller
Daniel Borkmann says: ==================== pull-request: bpf 2020-06-30 The following pull-request contains BPF updates for your *net* tree. We've added 28 non-merge commits during the last 9 day(s) which contain a total of 35 files changed, 486 insertions(+), 232 deletions(-). The main changes are: 1) Fix an incorrect verifier branch elimination for PTR_TO_BTF_ID pointer types, from Yonghong Song. 2) Fix UAPI for sockmap and flow_dissector progs that were ignoring various arguments passed to BPF_PROG_{ATTACH,DETACH}, from Lorenz Bauer & Jakub Sitnicki. 3) Fix broken AF_XDP DMA hacks that are poking into dma-direct and swiotlb internals and integrate it properly into DMA core, from Christoph Hellwig. 4) Fix RCU splat from recent changes to avoid skipping ingress policy when kTLS is enabled, from John Fastabend. 5) Fix BPF ringbuf map to enforce size to be the power of 2 in order for its position masking to work, from Andrii Nakryiko. 6) Fix regression from CAP_BPF work to re-allow CAP_SYS_ADMIN for loading of network programs, from Maciej Żenczykowski. 7) Fix libbpf section name prefix for devmap progs, from Jesper Dangaard Brouer. 8) Fix formatting in UAPI documentation for BPF helpers, from Quentin Monnet. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-30ring-buffer: Mark the !tail (crossing a page) as unlikelySteven Rostedt (VMware)
It is the uncommon case where an event crosses a sub buffer boundary (page) mark that check at the end of reserving an event as unlikely. Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-06-30ring-buffer: speed up buffer resets by avoiding synchronize_rcu for each CPUNicholas Piggin
On a 144 thread system, `perf ftrace` takes about 20 seconds to start up, due to calling synchronize_rcu() for each CPU. cat /proc/108560/stack 0xc0003e7eb336f470 __switch_to+0x2e0/0x480 __wait_rcu_gp+0x20c/0x220 synchronize_rcu+0x9c/0xc0 ring_buffer_reset_cpu+0x88/0x2e0 tracing_reset_online_cpus+0x84/0xe0 tracing_open+0x1d4/0x1f0 On a system with 10x more threads, it starts to become an annoyance. Batch these up so we disable all the per-cpu buffers first, then synchronize_rcu() once, then reset each of the buffers. This brings the time down to about 0.5s. Link: https://lkml.kernel.org/r/20200625053403.2386972-1-npiggin@gmail.com Tested-by: Anton Blanchard <anton@ozlabs.org> Acked-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-06-30ring-buffer: Add rb_time_t 64 bit operations for speeding up 32 bitSteven Rostedt (VMware)
After a discussion with the new time algorithm to have nested events still have proper time keeping but required using local64_t atomic operations. Mathieu was concerned about the performance this would have on 32 bit machines, as in most cases, atomic 64 bit operations on them can be expensive. As the ring buffer's timing needs do not require full features of local64_t, a wrapper is made to implement a new rb_time_t operation that uses two longs on 32 bit machines but still uses the local64_t operations on 64 bit machines. There's a switch that can be made in the file to force 64 bit to use the 32 bit version just for testing purposes. All reads do not need to succeed if a read happened while the stamp being read is in the process of being updated. The requirement is that all reads must succed that were done by an interrupting event (where this event was interrupted by another event that did the write). Or if the event itself did the write first. That is: rb_time_set(t, x) followed by rb_time_read(t) will always succeed (even if it gets interrupted by another event that writes to t. The result of the read will be either the previous set, or a set performed by an interrupting event. If the read is done by an event that interrupted another event that was in the process of setting the time stamp, and no other event came along to write to that time stamp, it will fail and the rb_time_read() will return that it failed (the value to read will be undefined). A set will always write to the time stamp and return with a valid time stamp, such that any read after it will be valid. A cmpxchg may fail if it interrupted an event that was in the process of updating the time stamp just like the reads do. Other than that, it will act like a normal cmpxchg. The way this works is that the rb_time_t is made of of three fields. A cnt, that gets updated atomically everyting a modification is made. A top that represents the most significant 30 bits of the time, and a bottom to represent the least significant 30 bits of the time. Notice, that the time values is only 60 bits long (where the ring buffer only uses 59 bits, which gives us 18 years of nanoseconds!). The top two bits of both the top and bottom is a 2 bit counter that gets set by the value of the least two significant bits of the cnt. A read of the top and the bottom where both the top and bottom have the same most significant top 2 bits, are considered a match and a valid 60 bit number can be created from it. If they do not match, then the number is considered invalid, and this must only happen if an event interrupted another event in the midst of updating the time stamp. This is only used for 32 bits machines as 64 bit machines can get better performance out of the local64_t. This has been tested heavily by forcing 64 bit to use this logic. Link: https://lore.kernel.org/r/20200625225345.18cf5881@oasis.local.home Link: http://lkml.kernel.org/r/20200629025259.309232719@goodmis.org Inspired-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-06-30bpf: Fix an incorrect branch elimination by verifierYonghong Song
Wenbo reported an issue in [1] where a checking of null pointer is evaluated as always false. In this particular case, the program type is tp_btf and the pointer to compare is a PTR_TO_BTF_ID. The current verifier considers PTR_TO_BTF_ID always reprents a non-null pointer, hence all PTR_TO_BTF_ID compares to 0 will be evaluated as always not-equal, which resulted in the branch elimination. For example, struct bpf_fentry_test_t { struct bpf_fentry_test_t *a; }; int BPF_PROG(test7, struct bpf_fentry_test_t *arg) { if (arg == 0) test7_result = 1; return 0; } int BPF_PROG(test8, struct bpf_fentry_test_t *arg) { if (arg->a == 0) test8_result = 1; return 0; } In above bpf programs, both branch arg == 0 and arg->a == 0 are removed. This may not be what developer expected. The bug is introduced by Commit cac616db39c2 ("bpf: Verifier track null pointer branch_taken with JNE and JEQ"), where PTR_TO_BTF_ID is considered to be non-null when evaluting pointer vs. scalar comparison. This may be added considering we have PTR_TO_BTF_ID_OR_NULL in the verifier as well. PTR_TO_BTF_ID_OR_NULL is added to explicitly requires a non-NULL testing in selective cases. The current generic pointer tracing framework in verifier always assigns PTR_TO_BTF_ID so users does not need to check NULL pointer at every pointer level like a->b->c->d. We may not want to assign every PTR_TO_BTF_ID as PTR_TO_BTF_ID_OR_NULL as this will require a null test before pointer dereference which may cause inconvenience for developers. But we could avoid branch elimination to preserve original code intention. This patch simply removed PTR_TO_BTD_ID from reg_type_not_null() in verifier, which prevented the above branches from being eliminated. [1]: https://lore.kernel.org/bpf/79dbb7c0-449d-83eb-5f4f-7af0cc269168@fb.com/T/ Fixes: cac616db39c2 ("bpf: Verifier track null pointer branch_taken with JNE and JEQ") Reported-by: Wenbo Zhang <ethercflow@gmail.com> Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20200630171240.2523722-1-yhs@fb.com
2020-06-30ring-buffer: Incorporate absolute timestamp into add_timestamp logicSteven Rostedt (VMware)
Instead of calling out the absolute test for each time to check if the ring buffer wants absolute time stamps for all its recording, incorporate it with the add_timestamp field and turn it into flags for faster processing between wanting a absolute tag and needing to force one. Link: http://lkml.kernel.org/r/20200629025259.154892368@goodmis.org Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-06-30ring-buffer: Have nested events still record running time stampSteven Rostedt (VMware)
Up until now, if an event is interrupted while it is recorded by an interrupt, and that interrupt records events, the time of those events will all be the same. This is because events only record the delta of the time since the previous event (or beginning of a page), and to handle updating the time keeping for that of nested events is extremely racy. After years of thinking about this and several failed attempts, I finally have a solution to solve this puzzle. The problem is that you need to atomically calculate the delta and then update the time stamp you made the delta from, as well as then record it into the buffer, all this while at any time an interrupt can come in and do the same thing. This is easy to solve with heavy weight atomics, but that would be detrimental to the performance of the ring buffer. The current state of affairs sacrificed the time deltas for nested events for performance. The reason for previous failed attempts at solving this puzzle was because I was trying to completely avoid slow atomic operations like cmpxchg. I final came to the conclusion to always avoid cmpxchg is not possible, which is why those previous attempts always failed. But it is possible to pick one path (the most common case) and avoid cmpxchg in that path, which is the "fast path". The most common case is that an event will not be interrupted and have other events added into it. An event can detect if it has interrupted another event, and for these cases we can make it the slow path and use the heavy operations like cmpxchg. One more player was added to the game that made this possible, and that is the "absolute timestamp" (by Tom Zanussi) that allows us to inject a full 59 bit time stamp. (Of course this breaks if a machine is running for more than 18 years without a reboot!). There's barrier() placements around for being paranoid, even when they are not needed because of other atomic functions near by. But those should not hurt, as if they are not needed, they basically become a nop. Note, this also makes the race window much smaller, which means there are less slow paths to slow down the performance. The basic idea is that there's two main paths taken. 1) Not being interrupted between time stamps and reserving buffer space. In this case, the time stamps taken are true to the location in the buffer. 2) Was interrupted by another path between taking time stamps and reserving buffer space. The objective is to know what the delta is from the last reserved location in the buffer. As it is possible to detect if an event is interrupting another event before reserving data, space is added to the length to be reserved to inject a full time stamp along with the event being reserved. When an event is not interrupted, the write stamp is always the time of the last event written to the buffer. In path 1, there's two sub paths we care about: a) The event did not interrupt another event. b) The event interrupted another event. In case a, as the write stamp was read and known to be correct, the delta between the current time stamp and the write stamp is the delta between the current event and the previously recorded event. In case b, extra space was reserved to just put the full time stamp into the buffer. Which is done, as stated, in this path the time stamp taken is known to match the location in the buffer. In path 2, there's also two sub paths we care about: a) The event was not interrupted by another event since it reserved space on the buffer and re-reading the write stamp. b) The event was interrupted by another event. In case a, the write stamp is that of the last event that interrupted this event between taking the time stamps and reserving. As no event came in after re-reading the write stamp, that event is known to be the time of the event directly before this event and the delta can be the new time stamp and the write stamp. In case b, one or more events came in between reserving the event and re-reading he write stamp. Since this event's buffer reservation is between other events at this path, there's no way to know what the delta is. But because an event interrupted this event after it started, its fine to just give a zero delta, and take the same time stamp as the events that happened within the event being recorded. Here's the implementation of the design of this solution: All this is per cpu, and only needs to worry about nested events (not parallel events). The players: write_tail: The index in the buffer where new events can be written to. It is incremented via local_add() to reserve space for a new event. before_stamp: A time stamp set by all events before reserving space. write_stamp: A time stamp updated by events after it has successfully reserved space. /* Save the current position of write */ [A] w = local_read(write_tail); barrier(); /* Read both before and write stamps before touching anything */ before = local_read(before_stamp); after = local_read(write_stamp); barrier(); /* * If before and after are the same, then this event is not * interrupting a time update. If it is, then reserve space for adding * a full time stamp (this can turn into a time extend which is * just an extended time delta but fill up the extra space). */ if (after != before) abs = true; ts = clock(); /* Now update the before_stamp (everyone does this!) */ [B] local_set(before_stamp, ts); /* Now reserve space on the buffer */ [C] write = local_add_return(len, write_tail); /* Set tail to be were this event's data is */ tail = write - len; if (w == tail) { /* Nothing interrupted this between A and C */ [D] local_set(write_stamp, ts); barrier(); [E] save_before = local_read(before_stamp); if (!abs) { /* This did not interrupt a time update */ delta = ts - after; } else { delta = ts; /* The full time stamp will be in use */ } if (ts != save_before) { /* slow path - Was interrupted between C and E */ /* The update to write_stamp could have overwritten the update to * it by the interrupting event, but before and after should be * the same for all completed top events */ after = local_read(write_stamp); if (save_before > after) local_cmpxchg(write_stamp, after, save_before); } } else { /* slow path - Interrupted between A and C */ after = local_read(write_stamp); temp_ts = clock(); barrier(); [F] if (write == local_read(write_tail) && after < temp_ts) { /* This was not interrupted since C and F * The last write_stamp is still valid for the previous event * in the buffer. */ delta = temp_ts - after; /* OK to keep this new time stamp */ ts = temp_ts; } else { /* Interrupted between C and F * Well, there's no use to try to know what the time stamp * is for the previous event. Just set delta to zero and * be the same time as that event that interrupted us before * the reservation of the buffer. */ delta = 0; } /* No need to use full timestamps here */ abs = 0; } Link: https://lkml.kernel.org/r/20200625094454.732790f7@oasis.local.home Link: https://lore.kernel.org/r/20200627010041.517736087@goodmis.org Link: http://lkml.kernel.org/r/20200629025258.957440797@goodmis.org Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-06-30tracing: Move pipe reference to trace array instead of current_tracerSteven Rostedt (VMware)
If a process has the trace_pipe open on a trace_array, the current tracer for that trace array should not be changed. This was original enforced by a global lock, but when instances were introduced, it was moved to the current_trace. But this structure is shared by all instances, and a trace_pipe is for a single instance. There's no reason that a process that has trace_pipe open on one instance should prevent another instance from changing its current tracer. Move the reference counter to the trace_array instead. This is marked as "Fixes" but is more of a clean up than a true fix. Backport if you want, but its not critical. Fixes: cf6ab6d9143b1 ("tracing: Add ref count to tracer for when they are being read by pipe") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-06-30task_work: teach task_work_add() to do signal_wake_up()Oleg Nesterov
So that the target task will exit the wait_event_interruptible-like loop and call task_work_run() asap. The patch turns "bool notify" into 0,TWA_RESUME,TWA_SIGNAL enum, the new TWA_SIGNAL flag implies signal_wake_up(). However, it needs to avoid the race with recalc_sigpending(), so the patch also adds the new JOBCTL_TASK_WORK bit included in JOBCTL_PENDING_MASK. TODO: once this patch is merged we need to change all current users of task_work_add(notify = true) to use TWA_RESUME. Cc: stable@vger.kernel.org # v5.7 Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-30bpf, netns: Fix use-after-free in pernet pre_exit callbackJakub Sitnicki
Iterating over BPF links attached to network namespace in pre_exit hook is not safe, even if there is just one. Once link gets auto-detached, that is its back-pointer to net object is set to NULL, the link can be released and freed without waiting on netns_bpf_mutex, effectively causing the list element we are operating on to be freed. This leads to use-after-free when trying to access the next element on the list, as reported by KASAN. Bug can be triggered by destroying a network namespace, while also releasing a link attached to this network namespace. | ================================================================== | BUG: KASAN: use-after-free in netns_bpf_pernet_pre_exit+0xd9/0x130 | Read of size 8 at addr ffff888119e0d778 by task kworker/u8:2/177 | | CPU: 3 PID: 177 Comm: kworker/u8:2 Not tainted 5.8.0-rc1-00197-ga0c04c9d1008-dirty #776 | Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014 | Workqueue: netns cleanup_net | Call Trace: | dump_stack+0x9e/0xe0 | print_address_description.constprop.0+0x3a/0x60 | ? netns_bpf_pernet_pre_exit+0xd9/0x130 | kasan_report.cold+0x1f/0x40 | ? netns_bpf_pernet_pre_exit+0xd9/0x130 | netns_bpf_pernet_pre_exit+0xd9/0x130 | cleanup_net+0x30b/0x5b0 | ? unregister_pernet_device+0x50/0x50 | ? rcu_read_lock_bh_held+0xb0/0xb0 | ? _raw_spin_unlock_irq+0x24/0x50 | process_one_work+0x4d1/0xa10 | ? lock_release+0x3e0/0x3e0 | ? pwq_dec_nr_in_flight+0x110/0x110 | ? rwlock_bug.part.0+0x60/0x60 | worker_thread+0x7a/0x5c0 | ? process_one_work+0xa10/0xa10 | kthread+0x1e3/0x240 | ? kthread_create_on_node+0xd0/0xd0 | ret_from_fork+0x1f/0x30 | | Allocated by task 280: | save_stack+0x1b/0x40 | __kasan_kmalloc.constprop.0+0xc2/0xd0 | netns_bpf_link_create+0xfe/0x650 | __do_sys_bpf+0x153a/0x2a50 | do_syscall_64+0x59/0x300 | entry_SYSCALL_64_after_hwframe+0x44/0xa9 | | Freed by task 198: | save_stack+0x1b/0x40 | __kasan_slab_free+0x12f/0x180 | kfree+0xed/0x350 | process_one_work+0x4d1/0xa10 | worker_thread+0x7a/0x5c0 | kthread+0x1e3/0x240 | ret_from_fork+0x1f/0x30 | | The buggy address belongs to the object at ffff888119e0d700 | which belongs to the cache kmalloc-192 of size 192 | The buggy address is located 120 bytes inside of | 192-byte region [ffff888119e0d700, ffff888119e0d7c0) | The buggy address belongs to the page: | page:ffffea0004678340 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 | flags: 0x2fffe0000000200(slab) | raw: 02fffe0000000200 ffffea00045ba8c0 0000000600000006 ffff88811a80ea80 | raw: 0000000000000000 0000000000100010 00000001ffffffff 0000000000000000 | page dumped because: kasan: bad access detected | | Memory state around the buggy address: | ffff888119e0d600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb | ffff888119e0d680: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc | >ffff888119e0d700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb | ^ | ffff888119e0d780: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc | ffff888119e0d800: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb | ================================================================== Remove the "fast-path" for releasing a link that got auto-detached by a dying network namespace to fix it. This way as long as link is on the list and netns_bpf mutex is held, we have a guarantee that link memory can be accessed. An alternative way to fix this issue would be to safely iterate over the list of links and ensure there is no access to link object after detaching it. But, at the moment, optimizing synchronization overhead on link release without a workload in mind seems like an overkill. Fixes: ab53cad90eb1 ("bpf, netns: Keep a list of attached bpf_link's") Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20200630164541.1329993-1-jakub@cloudflare.com
2020-06-30bpf: sockmap: Require attach_bpf_fd when detaching a programLorenz Bauer
The sockmap code currently ignores the value of attach_bpf_fd when detaching a program. This is contrary to the usual behaviour of checking that attach_bpf_fd represents the currently attached program. Ensure that attach_bpf_fd is indeed the currently attached program. It turns out that all sockmap selftests already do this, which indicates that this is unlikely to cause breakage. Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface") Signed-off-by: Lorenz Bauer <lmb@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200629095630.7933-5-lmb@cloudflare.com
2020-06-30bpf: flow_dissector: Check value of unused flags to BPF_PROG_DETACHLorenz Bauer
Using BPF_PROG_DETACH on a flow dissector program supports neither attach_flags nor attach_bpf_fd. Yet no value is enforced for them. Enforce that attach_flags are zero, and require the current program to be passed via attach_bpf_fd. This allows us to remove the check for CAP_SYS_ADMIN, since userspace can now no longer remove arbitrary flow dissector programs. Fixes: b27f7bb590ba ("flow_dissector: Move out netns_bpf prog callbacks") Signed-off-by: Lorenz Bauer <lmb@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200629095630.7933-3-lmb@cloudflare.com
2020-06-30bpf: flow_dissector: Check value of unused flags to BPF_PROG_ATTACHLorenz Bauer
Using BPF_PROG_ATTACH on a flow dissector program supports neither target_fd, attach_flags or replace_bpf_fd but accepts any value. Enforce that all of them are zero. This is fine for replace_bpf_fd since its presence is indicated by BPF_F_REPLACE. It's more problematic for target_fd, since zero is a valid fd. Should we want to use the flag later on we'd have to add an exception for fd 0. The alternative is to force a value like -1. This requires more changes to tests. There is also precedent for using 0, since bpf_iter uses this for target_fd as well. Fixes: b27f7bb590ba ("flow_dissector: Move out netns_bpf prog callbacks") Signed-off-by: Lorenz Bauer <lmb@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200629095630.7933-2-lmb@cloudflare.com
2020-06-30bpf, netns: Keep a list of attached bpf_link'sJakub Sitnicki
To support multi-prog link-based attachments for new netns attach types, we need to keep track of more than one bpf_link per attach type. Hence, convert net->bpf.links into a list, that currently can be either empty or have just one item. Instead of reusing bpf_prog_list from bpf-cgroup, we link together bpf_netns_link's themselves. This makes list management simpler as we don't have to allocate, initialize, and later release list elements. We can do this because multi-prog attachment will be available only for bpf_link, and we don't need to build a list of programs attached directly and indirectly via links. No functional changes intended. Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20200625141357.910330-4-jakub@cloudflare.com
2020-06-30bpf, netns: Keep attached programs in bpf_prog_arrayJakub Sitnicki
Prepare for having multi-prog attachments for new netns attach types by storing programs to run in a bpf_prog_array, which is well suited for iterating over programs and running them in sequence. After this change bpf(PROG_QUERY) may block to allocate memory in bpf_prog_array_copy_to_user() for collected program IDs. This forces a change in how we protect access to the attached program in the query callback. Because bpf_prog_array_copy_to_user() can sleep, we switch from an RCU read lock to holding a mutex that serializes updaters. Because we allow only one BPF flow_dissector program to be attached to netns at all times, the bpf_prog_array pointed by net->bpf.run_array is always either detached (null) or one element long. No functional changes intended. Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20200625141357.910330-3-jakub@cloudflare.com
2020-06-30flow_dissector: Pull BPF program assignment up to bpf-netnsJakub Sitnicki
Prepare for using bpf_prog_array to store attached programs by moving out code that updates the attached program out of flow dissector. Managing bpf_prog_array is more involved than updating a single bpf_prog pointer. This will let us do it all from one place, bpf/net_namespace.c, in the subsequent patch. No functional change intended. Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20200625141357.910330-2-jakub@cloudflare.com
2020-06-30bpf: Enforce BPF ringbuf size to be the power of 2Andrii Nakryiko
BPF ringbuf assumes the size to be a multiple of page size and the power of 2 value. The latter is important to avoid division while calculating position inside the ring buffer and using (N-1) mask instead. This patch fixes omission to enforce power-of-2 size rule. Fixes: 457f44363a88 ("bpf: Implement BPF ring buffer and verifier support for it") Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20200630061500.1804799-1-andriin@fb.com
2020-06-30dma-mapping: Add a new dma_need_sync APIChristoph Hellwig
Add a new API to check if calls to dma_sync_single_for_{device,cpu} are required for a given DMA streaming mapping. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20200629130359.2690853-2-hch@lst.de
2020-06-29audit: add gfp parameter to audit_log_nfcfgRichard Guy Briggs
Fixed an inconsistent use of GFP flags in nft_obj_notify() that used GFP_KERNEL when a GFP flag was passed in to that function. Given this allocated memory was then used in audit_log_nfcfg() it led to an audit of all other GFP allocations in net/netfilter/nf_tables_api.c and a modification of audit_log_nfcfg() to accept a GFP parameter. Reported-by: Dan Carptenter <dan.carpenter@oracle.com> Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com>
2020-06-29kcsan: Disable branch tracing in core runtimeMarco Elver
Disable branch tracing in core KCSAN runtime if branches are being traced (TRACE_BRANCH_PROFILING). This it to avoid its performance impact, but also avoid recursion in case KCSAN is enabled for the branch tracing runtime. The latter had already been a problem for KASAN: https://lore.kernel.org/lkml/CANpmjNOeXmD5E3O50Z3MjkiuCYaYOPyi+1rq=GZvEKwBvLR0Ug@mail.gmail.com/ Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29kcsan: Simplify compiler flagsMarco Elver
Simplify the set of compiler flags for the runtime by removing cc-option from -fno-stack-protector, because all supported compilers support it. This saves us one compiler invocation during build. Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29kcsan: Add jiffies test to test suiteMarco Elver
Add a test that KCSAN nor the compiler gets confused about accesses to jiffies on different architectures. Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29kcsan: Remove existing special atomic rulesMarco Elver
Remove existing special atomic rules from kcsan_is_atomic_special() because they are no longer needed. Since we rely on the compiler emitting instrumentation distinguishing volatile accesses, the rules have become redundant. Let's keep kcsan_is_atomic_special() around, so that we have an obvious place to add special rules should the need arise in future. Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29kcsan: Rename test.c to selftest.cMarco Elver
Rename 'test.c' to 'selftest.c' to better reflect its purpose (Kconfig variable and code inside already match this). This is to avoid confusion with the test suite module in 'kcsan-test.c'. No functional change. Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29kcsan: Silence -Wmissing-prototypes warning with W=1Marco Elver
The functions here should not be forward declared for explicit use elsewhere in the kernel, as they should only be emitted by the compiler due to sanitizer instrumentation. Add forward declarations a line above their definition to shut up warnings in W=1 builds. Link: https://lkml.kernel.org/r/202006060103.jSCpnV1g%lkp@intel.com Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29kcsan: Prefer '__no_kcsan inline' in testMarco Elver
Instead of __no_kcsan_or_inline, prefer '__no_kcsan inline' in test -- this is in case we decide to remove __no_kcsan_or_inline. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29locking/osq_lock: Annotate a data race in osq_lockQian Cai
The prev->next pointer can be accessed concurrently as noticed by KCSAN: write (marked) to 0xffff9d3370dbbe40 of 8 bytes by task 3294 on cpu 107: osq_lock+0x25f/0x350 osq_wait_next at kernel/locking/osq_lock.c:79 (inlined by) osq_lock at kernel/locking/osq_lock.c:185 rwsem_optimistic_spin <snip> read to 0xffff9d3370dbbe40 of 8 bytes by task 3398 on cpu 100: osq_lock+0x196/0x350 osq_lock at kernel/locking/osq_lock.c:157 rwsem_optimistic_spin <snip> Since the write only stores NULL to prev->next and the read tests if prev->next equals to this_cpu_ptr(&osq_node). Even if the value is shattered, the code is still working correctly. Thus, mark it as an intentional data race using the data_race() macro. Signed-off-by: Qian Cai <cai@lca.pw> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29kcsan: Add test suiteMarco Elver
This adds KCSAN test focusing on behaviour of the integrated runtime. Tests various race scenarios, and verifies the reports generated to console. Makes use of KUnit for test organization, and the Torture framework for test thread control. Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29fork: Annotate a data race in vm_area_dup()Qian Cai
struct vm_area_struct could be accessed concurrently as noticed by KCSAN, write to 0xffff9cf8bba08ad8 of 8 bytes by task 14263 on cpu 35: vma_interval_tree_insert+0x101/0x150: rb_insert_augmented_cached at include/linux/rbtree_augmented.h:58 (inlined by) vma_interval_tree_insert at mm/interval_tree.c:23 __vma_link_file+0x6e/0xe0 __vma_link_file at mm/mmap.c:629 vma_link+0xa2/0x120 mmap_region+0x753/0xb90 do_mmap+0x45c/0x710 vm_mmap_pgoff+0xc0/0x130 ksys_mmap_pgoff+0x1d1/0x300 __x64_sys_mmap+0x33/0x40 do_syscall_64+0x91/0xc44 entry_SYSCALL_64_after_hwframe+0x49/0xbe read to 0xffff9cf8bba08a80 of 200 bytes by task 14262 on cpu 122: vm_area_dup+0x6a/0xe0 vm_area_dup at kernel/fork.c:362 __split_vma+0x72/0x2a0 __split_vma at mm/mmap.c:2661 split_vma+0x5a/0x80 mprotect_fixup+0x368/0x3f0 do_mprotect_pkey+0x263/0x420 __x64_sys_mprotect+0x51/0x70 do_syscall_64+0x91/0xc44 entry_SYSCALL_64_after_hwframe+0x49/0xbe vm_area_dup() blindly copies all fields of original VMA to the new one. This includes coping vm_area_struct::shared.rb which is normally protected by i_mmap_lock. But this is fine because the read value will be overwritten on the following __vma_link_file() under proper protection. Thus, mark it as an intentional data race and insert a few assertions for the fields that should not be modified concurrently. Signed-off-by: Qian Cai <cai@lca.pw> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29Merge branches 'doc.2020.06.29a', 'fixes.2020.06.29a', ↵Paul E. McKenney
'kfree_rcu.2020.06.29a', 'rcu-tasks.2020.06.29a', 'scale.2020.06.29a', 'srcu.2020.06.29a' and 'torture.2020.06.29a' into HEAD doc.2020.06.29a: Documentation updates. fixes.2020.06.29a: Miscellaneous fixes. kfree_rcu.2020.06.29a: kfree_rcu() updates. rcu-tasks.2020.06.29a: RCU Tasks updates. scale.2020.06.29a: Read-side scalability tests. srcu.2020.06.29a: SRCU updates. torture.2020.06.29a: Torture-test updates.
2020-06-29torture: Dump ftrace at shutdown only if requestedPaul E. McKenney
If there is a large number of torture tests running concurrently, all of which are dumping large ftrace buffers at shutdown time, the resulting dumping can take a very long time, particularly on systems with rotating-rust storage. This commit therefore adds a default-off torture.ftrace_dump_at_shutdown module parameter that enables shutdown-time ftrace-buffer dumping. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29rcutorture: Check for unwatched readersPaul E. McKenney
RCU is supposed to be watching all non-idle kernel code and also all softirq handlers. This commit adds some teeth to this statement by adding a WARN_ON_ONCE(). Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29rcu/rcutorture: Replace 0 with falseJules Irenge
Coccinelle reports a warning WARNING: Assignment of 0/1 to bool variable The root cause is that the variable lastphase is a bool, but is initialised with integer 0. This commit therefore replaces the 0 with a false. Signed-off-by: Jules Irenge <jbi.octave@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>