summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2019-11-06ftrace: add ftrace_init_nop()Mark Rutland
Architectures may need to perform special initialization of ftrace callsites, and today they do so by special-casing ftrace_make_nop() when the expected branch address is MCOUNT_ADDR. In some cases (e.g. for patchable-function-entry), we don't have an mcount-like symbol and don't want a synthetic MCOUNT_ADDR, but we may need to perform some initialization of callsites. To make it possible to separate initialization from runtime modification, and to handle cases without an mcount-like symbol, this patch adds an optional ftrace_init_nop() function that architectures can implement, which does not pass a branch address. Where an architecture does not provide ftrace_init_nop(), we will fall back to the existing behaviour of calling ftrace_make_nop() with MCOUNT_ADDR. At the same time, ftrace_code_disable() is renamed to ftrace_nop_initialize() to make it clearer that it is intended to intialize a callsite into a disabled state, and is not for disabling a callsite that has been runtime enabled. The kerneldoc description of rec arguments is updated to cover non-mcount callsites. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Reviewed-by: Amit Daniel Kachhap <amit.kachhap@arm.com> Reviewed-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Reviewed-by: Miroslav Benes <mbenes@suse.cz> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Reviewed-by: Torsten Duwe <duwe@suse.de> Tested-by: Amit Daniel Kachhap <amit.kachhap@arm.com> Tested-by: Sven Schnelle <svens@stackframe.org> Tested-by: Torsten Duwe <duwe@suse.de> Cc: Ingo Molnar <mingo@redhat.com>
2019-11-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller
Daniel Borkmann says: ==================== pull-request: bpf 2019-11-02 The following pull-request contains BPF updates for your *net* tree. We've added 6 non-merge commits during the last 6 day(s) which contain a total of 8 files changed, 35 insertions(+), 9 deletions(-). The main changes are: 1) Fix ppc BPF JIT's tail call implementation by performing a second pass to gather a stable JIT context before opcode emission, from Eric Dumazet. 2) Fix build of BPF samples sys_perf_event_open() usage to compiled out unavailable test_attr__{enabled,open} checks. Also fix potential overflows in bpf_map_{area_alloc,charge_init} on 32 bit archs, from Björn Töpel. 3) Fix narrow loads of bpf_sysctl context fields with offset > 0 on big endian archs like s390x and also improve the test coverage, from Ilya Leoshkevich. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-05Merge tag 'for-linus-2019-11-05' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux Pull clone3 stack argument update from Christian Brauner: "This changes clone3() to do basic stack validation and to set up the stack depending on whether or not it is growing up or down. With clone3() the expectation is now very simply that the .stack argument points to the lowest address of the stack and that .stack_size specifies the initial stack size. This is diferent from legacy clone() where the "stack" argument had to point to the lowest or highest address of the stack depending on the architecture. clone3() was released with 5.3. Currently, it is not documented and very unclear to userspace how the stack and stack_size argument have to be passed. After talking to glibc folks we concluded that changing clone3() to determine stack direction and doing basic validation is the right course of action. Note, this is a potentially user visible change. In the very unlikely case, that it breaks someone's use-case we will revert. (And then e.g. place the new behavior under an appropriate flag.) Note that passing an empty stack will continue working just as before. Breaking someone's use-case is very unlikely. Neither glibc nor musl currently expose a wrapper for clone3(). There is currently also no real motivation for anyone to use clone3() directly. First, because using clone{3}() with stacks requires some assembly (see glibc and musl). Second, because it does not provide features that legacy clone() doesn't. New features for clone3() will first happen in v5.5 which is why v5.4 is still a good time to try and make that change now and backport it to v5.3. I did a codesearch on https://codesearch.debian.net, github, and gitlab and could not find any software currently relying directly on clone3(). I expect this to change once we land CLONE_CLEAR_SIGHAND which was a request coming from glibc at which point they'll likely start using it" * tag 'for-linus-2019-11-05' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: clone3: validate stack arguments
2019-11-05clone3: validate stack argumentsChristian Brauner
Validate the stack arguments and setup the stack depening on whether or not it is growing down or up. Legacy clone() required userspace to know in which direction the stack is growing and pass down the stack pointer appropriately. To make things more confusing microblaze uses a variant of the clone() syscall selected by CONFIG_CLONE_BACKWARDS3 that takes an additional stack_size argument. IA64 has a separate clone2() syscall which also takes an additional stack_size argument. Finally, parisc has a stack that is growing upwards. Userspace therefore has a lot nasty code like the following: #define __STACK_SIZE (8 * 1024 * 1024) pid_t sys_clone(int (*fn)(void *), void *arg, int flags, int *pidfd) { pid_t ret; void *stack; stack = malloc(__STACK_SIZE); if (!stack) return -ENOMEM; #ifdef __ia64__ ret = __clone2(fn, stack, __STACK_SIZE, flags | SIGCHLD, arg, pidfd); #elif defined(__parisc__) /* stack grows up */ ret = clone(fn, stack, flags | SIGCHLD, arg, pidfd); #else ret = clone(fn, stack + __STACK_SIZE, flags | SIGCHLD, arg, pidfd); #endif return ret; } or even crazier variants such as [3]. With clone3() we have the ability to validate the stack. We can check that when stack_size is passed, the stack pointer is valid and the other way around. We can also check that the memory area userspace gave us is fine to use via access_ok(). Furthermore, we probably should not require userspace to know in which direction the stack is growing. It is easy for us to do this in the kernel and I couldn't find the original reasoning behind exposing this detail to userspace. /* Intentional user visible API change */ clone3() was released with 5.3. Currently, it is not documented and very unclear to userspace how the stack and stack_size argument have to be passed. After talking to glibc folks we concluded that trying to change clone3() to setup the stack instead of requiring userspace to do this is the right course of action. Note, that this is an explicit change in user visible behavior we introduce with this patch. If it breaks someone's use-case we will revert! (And then e.g. place the new behavior under an appropriate flag.) Breaking someone's use-case is very unlikely though. First, neither glibc nor musl currently expose a wrapper for clone3(). Second, there is no real motivation for anyone to use clone3() directly since it does not provide features that legacy clone doesn't. New features for clone3() will first happen in v5.5 which is why v5.4 is still a good time to try and make that change now and backport it to v5.3. Searches on [4] did not reveal any packages calling clone3(). [1]: https://lore.kernel.org/r/CAG48ez3q=BeNcuVTKBN79kJui4vC6nw0Bfq6xc-i0neheT17TA@mail.gmail.com [2]: https://lore.kernel.org/r/20191028172143.4vnnjpdljfnexaq5@wittgenstein [3]: https://github.com/systemd/systemd/blob/5238e9575906297608ff802a27e2ff9effa3b338/src/basic/raw-clone.h#L31 [4]: https://codesearch.debian.net Fixes: 7f192e3cd316 ("fork: add clone3") Cc: Kees Cook <keescook@chromium.org> Cc: Jann Horn <jannh@google.com> Cc: David Howells <dhowells@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Florian Weimer <fweimer@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: linux-api@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: <stable@vger.kernel.org> # 5.3 Cc: GNU C Library <libc-alpha@sourceware.org> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Aleksa Sarai <cyphar@cyphar.com> Link: https://lore.kernel.org/r/20191031113608.20713-1-christian.brauner@ubuntu.com
2019-11-05irq/irqdomain: Update __irq_domain_alloc_fwnode() function documentationYi Wang
A recent commit changed a parameter of __irq_domain_alloc_fwnode(), but did not update the documentation comment. Fix it up. Fixes: b977fcf477c1 ("irqdomain/debugfs: Use PAs to generate fwnode names") Signed-off-by: Yi Wang <wang.yi59@zte.com.cn> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/1571476047-29463-1-git-send-email-wang.yi59@zte.com.cn
2019-11-04timekeeping/vsyscall: Update VDSO data unconditionallyHuacai Chen
The update of the VDSO data is depending on __arch_use_vsyscall() returning True. This is a leftover from the attempt to map the features of various architectures 1:1 into generic code. The usage of __arch_use_vsyscall() in the actual vsyscall implementations got dropped and replaced by the requirement for the architecture code to return U64_MAX if the global clocksource is not usable in the VDSO. But the __arch_use_vsyscall() check in the update code stayed which causes the VDSO data to be stale or invalid when an architecture actually implements that function and returns False when the current clocksource is not usable in the VDSO. As a consequence the VDSO implementations of clock_getres(), time(), clock_gettime(CLOCK_.*_COARSE) operate on invalid data and return bogus information. Remove the __arch_use_vsyscall() check from the VDSO update function and update the VDSO data unconditionally. [ tglx: Massaged changelog and removed the now useless implementations in asm-generic/ARM64/MIPS ] Fixes: 44f57d788e7deecb50 ("timekeeping: Provide a generic update_vsyscall() implementation") Signed-off-by: Huacai Chen <chenhc@lemote.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andy Lutomirski <luto@kernel.org> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Paul Burton <paul.burton@mips.com> Cc: linux-mips@vger.kernel.org Cc: linux-arm-kernel@lists.infradead.org Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/1571887709-11447-1-git-send-email-chenhc@lemote.com
2019-11-04stacktrace: Don't skip first entry on noncurrent tasksJiri Slaby
When doing cat /proc/<PID>/stack, the output is missing the first entry. When the current code walks the stack starting in stack_trace_save_tsk, it skips all scheduler functions (that's OK) plus one more function. But this one function should be skipped only for the 'current' task as it is stack_trace_save_tsk proper. The original code (before the common infrastructure) skipped one function only for the 'current' task -- see save_stack_trace_tsk before 3599fe12a125. So do so also in the new infrastructure now. Fixes: 214d8ca6ee85 ("stacktrace: Provide common infrastructure") Signed-off-by: Jiri Slaby <jslaby@suse.cz> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Michal Suchanek <msuchanek@suse.de> Acked-by: Josh Poimboeuf <jpoimboe@redhat.com> Link: https://lkml.kernel.org/r/20191030072545.19462-1-jslaby@suse.cz
2019-11-04ftrace: Introduce PERMANENT ftrace_ops flagMiroslav Benes
Livepatch uses ftrace for redirection to new patched functions. It means that if ftrace is disabled, all live patched functions are disabled as well. Toggling global 'ftrace_enabled' sysctl thus affect it directly. It is not a problem per se, because only administrator can set sysctl values, but it still may be surprising. Introduce PERMANENT ftrace_ops flag to amend this. If the FTRACE_OPS_FL_PERMANENT is set on any ftrace ops, the tracing cannot be disabled by disabling ftrace_enabled. Equally, a callback with the flag set cannot be registered if ftrace_enabled is disabled. Link: http://lkml.kernel.org/r/20191016113316.13415-2-mbenes@suse.cz Reviewed-by: Petr Mladek <pmladek@suse.com> Reviewed-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com> Signed-off-by: Miroslav Benes <mbenes@suse.cz> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-11-04cpu/speculation: Uninline and export CPU mitigations helpersTyler Hicks
A kernel module may need to check the value of the "mitigations=" kernel command line parameter as part of its setup when the module needs to perform software mitigations for a CPU flaw. Uninline and export the helper functions surrounding the cpu_mitigations enum to allow for their usage from a module. Lastly, privatize the enum and cpu_mitigations variable since the value of cpu_mitigations can be checked with the exported helper functions. Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2019-11-02Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller
Alexei Starovoitov says: ==================== pull-request: bpf-next 2019-11-02 The following pull-request contains BPF updates for your *net-next* tree. We've added 30 non-merge commits during the last 7 day(s) which contain a total of 41 files changed, 1864 insertions(+), 474 deletions(-). The main changes are: 1) Fix long standing user vs kernel access issue by introducing bpf_probe_read_user() and bpf_probe_read_kernel() helpers, from Daniel. 2) Accelerated xskmap lookup, from Björn and Maciej. 3) Support for automatic map pinning in libbpf, from Toke. 4) Cleanup of BTF-enabled raw tracepoints, from Alexei. 5) Various fixes to libbpf and selftests. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-02Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller
The only slightly tricky merge conflict was the netdevsim because the mutex locking fix overlapped a lot of driver reload reorganization. The rest were (relatively) trivial in nature. Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-02bpf: Switch BPF probe insns to bpf_probe_read_kernelDaniel Borkmann
Commit 2a02759ef5f8 ("bpf: Add support for BTF pointers to interpreter") explicitly states that the pointer to BTF object is a pointer to a kernel object or NULL. Therefore we should also switch to using the strict kernel probe helper which is restricted to kernel addresses only when architectures have non-overlapping address spaces. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/d2b90827837685424a4b8008dfe0460558abfada.1572649915.git.daniel@iogearbox.net
2019-11-02bpf: Add probe_read_{user, kernel} and probe_read_{user, kernel}_str helpersDaniel Borkmann
The current bpf_probe_read() and bpf_probe_read_str() helpers are broken in that they assume they can be used for probing memory access for kernel space addresses /as well as/ user space addresses. However, plain use of probe_kernel_read() for both cases will attempt to always access kernel space address space given access is performed under KERNEL_DS and some archs in-fact have overlapping address spaces where a kernel pointer and user pointer would have the /same/ address value and therefore accessing application memory via bpf_probe_read{,_str}() would read garbage values. Lets fix BPF side by making use of recently added 3d7081822f7f ("uaccess: Add non-pagefault user-space read functions"). Unfortunately, the only way to fix this status quo is to add dedicated bpf_probe_read_{user,kernel}() and bpf_probe_read_{user,kernel}_str() helpers. The bpf_probe_read{,_str}() helpers are kept as-is to retain their current behavior. The two *_user() variants attempt the access always under USER_DS set, the two *_kernel() variants will -EFAULT when accessing user memory if the underlying architecture has non-overlapping address ranges, also avoiding throwing the kernel warning via 00c42373d397 ("x86-64: add warning for non-canonical user access address dereferences"). Fixes: a5e8c07059d0 ("bpf: add bpf_probe_read_str helper") Fixes: 2541517c32be ("tracing, perf: Implement BPF programs attached to kprobes") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/796ee46e948bc808d54891a1108435f8652c6ca4.1572649915.git.daniel@iogearbox.net
2019-11-02bpf: Make use of probe_user_write in probe write helperDaniel Borkmann
Convert the bpf_probe_write_user() helper to probe_user_write() such that writes are not attempted under KERNEL_DS anymore which is buggy as kernel and user space pointers can have overlapping addresses. Also, given we have the access_ok() check inside probe_user_write(), the helper doesn't need to do it twice. Fixes: 96ae52279594 ("bpf: Add bpf_probe_write_user BPF helper to be called in tracers") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/841c461781874c07a0ee404a454c3bc0459eed30.1572649915.git.daniel@iogearbox.net
2019-11-01Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netLinus Torvalds
Pull networking fixes from David Miller: 1) Fix free/alloc races in batmanadv, from Sven Eckelmann. 2) Several leaks and other fixes in kTLS support of mlx5 driver, from Tariq Toukan. 3) BPF devmap_hash cost calculation can overflow on 32-bit, from Toke Høiland-Jørgensen. 4) Add an r8152 device ID, from Kazutoshi Noguchi. 5) Missing include in ipv6's addrconf.c, from Ben Dooks. 6) Use siphash in flow dissector, from Eric Dumazet. Attackers can easily infer the 32-bit secret otherwise etc. 7) Several netdevice nesting depth fixes from Taehee Yoo. 8) Fix several KCSAN reported errors, from Eric Dumazet. For example, when doing lockless skb_queue_empty() checks, and accessing sk_napi_id/sk_incoming_cpu lockless as well. 9) Fix jumbo packet handling in RXRPC, from David Howells. 10) Bump SOMAXCONN and tcp_max_syn_backlog values, from Eric Dumazet. 11) Fix DMA synchronization in gve driver, from Yangchun Fu. 12) Several bpf offload fixes, from Jakub Kicinski. 13) Fix sk_page_frag() recursion during memory reclaim, from Tejun Heo. 14) Fix ping latency during high traffic rates in hisilicon driver, from Jiangfent Xiao. * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (146 commits) net: fix installing orphaned programs net: cls_bpf: fix NULL deref on offload filter removal selftests: bpf: Skip write only files in debugfs selftests: net: reuseport_dualstack: fix uninitalized parameter r8169: fix wrong PHY ID issue with RTL8168dp net: dsa: bcm_sf2: Fix IMP setup for port different than 8 net: phylink: Fix phylink_dbg() macro gve: Fixes DMA synchronization. inet: stop leaking jiffies on the wire ixgbe: Remove duplicate clear_bit() call Documentation: networking: device drivers: Remove stray asterisks e1000: fix memory leaks i40e: Fix receive buffer starvation for AF_XDP igb: Fix constant media auto sense switching when no cable is connected net: ethernet: arc: add the missed clk_disable_unprepare igb: Enable media autosense for the i350. igb/igc: Don't warn on fatal read failures when the device is removed tcp: increase tcp_max_syn_backlog max value net: increase SOMAXCONN to 4096 netdevsim: Fix use-after-free during device dismantle ...
2019-11-02xsk: Restructure/inline XSKMAP lookup/redirect/flushBjörn Töpel
In this commit the XSKMAP entry lookup function used by the XDP redirect code is moved from the xskmap.c file to the xdp_sock.h header, so the lookup can be inlined from, e.g., the bpf_xdp_redirect_map() function. Further the __xsk_map_redirect() and __xsk_map_flush() is moved to the xsk.c, which lets the compiler inline the xsk_rcv() and xsk_flush() functions. Finally, all the XDP socket functions were moved from linux/bpf.h to net/xdp_sock.h, where most of the XDP sockets functions are anyway. This yields a ~2% performance boost for the xdpsock "rx_drop" scenario. Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20191101110346.15004-4-bjorn.topel@gmail.com
2019-11-02bpf: Implement map_gen_lookup() callback for XSKMAPMaciej Fijalkowski
Inline the xsk_map_lookup_elem() via implementing the map_gen_lookup() callback. This results in emitting the bpf instructions in place of bpf_map_lookup_elem() helper call and better performance of bpf programs. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com> Link: https://lore.kernel.org/bpf/20191101110346.15004-3-bjorn.topel@gmail.com
2019-11-02xsk: Store struct xdp_sock as a flexible array member of the XSKMAPBjörn Töpel
Prior this commit, the array storing XDP socket instances were stored in a separate allocated array of the XSKMAP. Now, we store the sockets as a flexible array member in a similar fashion as the arraymap. Doing so, we do less pointer chasing in the lookup. Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com> Link: https://lore.kernel.org/bpf/20191101110346.15004-2-bjorn.topel@gmail.com
2019-11-01kexec: Fix pointer-to-int-cast warningsHelge Deller
Fix two pointer-to-int-cast warnings when compiling for the 32-bit parisc platform: kernel/kexec_file.c: In function ‘crash_prepare_elf64_headers’: kernel/kexec_file.c:1307:19: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast] phdr->p_vaddr = (Elf64_Addr)_text; ^ kernel/kexec_file.c:1324:19: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast] phdr->p_vaddr = (unsigned long long) __va(mstart); ^ Signed-off-by: Helge Deller <deller@gmx.de>
2019-11-01Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: "Fix two scheduler topology bugs/oversights on Juno r0 2+4 big.LITTLE systems" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/topology: Allow sched_asym_cpucapacity to be disabled sched/topology: Don't try to build empty sched domains
2019-11-01livepatch: Allow to distinguish different version of system state changesPetr Mladek
The atomic replace runs pre/post (un)install callbacks only from the new livepatch. There are several reasons for this: + Simplicity: clear ordering of operations, no interactions between old and new callbacks. + Reliability: only new livepatch knows what changes can already be made by older livepatches and how to take over the state. + Testing: the atomic replace can be properly tested only when a newer livepatch is available. It might be too late to fix unwanted effect of callbacks from older livepatches. It might happen that an older change is not enough and the same system state has to be modified another way. Different changes need to get distinguished by a version number added to struct klp_state. The version can also be used to prevent loading incompatible livepatches. The check is done when the livepatch is enabled. The rules are: + Any completely new system state modification is allowed. + System state modifications with the same or higher version are allowed for already modified system states. + Cumulative livepatches must handle all system state modifications from already installed livepatches. + Non-cumulative livepatches are allowed to touch already modified system states. Link: http://lkml.kernel.org/r/20191030154313.13263-4-pmladek@suse.com To: Jiri Kosina <jikos@kernel.org> Cc: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com> Cc: Nicolai Stange <nstange@suse.de> Cc: live-patching@vger.kernel.org Cc: linux-kernel@vger.kernel.org Acked-by: Miroslav Benes <mbenes@suse.cz> Acked-by: Joe Lawrence <joe.lawrence@redhat.com> Acked-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Petr Mladek <pmladek@suse.com>
2019-11-01livepatch: Basic API to track system state changesPetr Mladek
This is another step how to help maintaining more livepatches. One big help was the atomic replace and cumulative livepatches. These livepatches replace the already installed ones. Therefore it should be enough when each cumulative livepatch is consistent. The problems might come with shadow variables and callbacks. They might change the system behavior or state so that it is no longer safe to go back and use an older livepatch or the original kernel code. Also, a new livepatch must be able to detect changes which were made by the already installed livepatches. This is where the livepatch system state tracking gets useful. It allows to: - find whether a system state has already been modified by previous livepatches - store data needed to manipulate and restore the system state The information about the manipulated system states is stored in an array of struct klp_state. It can be searched by two new functions klp_get_state() and klp_get_prev_state(). The dependencies are going to be solved by a version field added later. The only important information is that it will be allowed to modify the same state by more non-cumulative livepatches. It is similar to allowing to modify the same function several times. The livepatch author is responsible for preventing incompatible changes. Link: http://lkml.kernel.org/r/20191030154313.13263-3-pmladek@suse.com To: Jiri Kosina <jikos@kernel.org> Cc: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com> Cc: Nicolai Stange <nstange@suse.de> Cc: live-patching@vger.kernel.org Cc: linux-kernel@vger.kernel.org Acked-by: Miroslav Benes <mbenes@suse.cz> Acked-by: Joe Lawrence <joe.lawrence@redhat.com> Acked-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Petr Mladek <pmladek@suse.com>
2019-11-01livepatch: Keep replaced patches until post_patch callback is calledPetr Mladek
Pre/post (un)patch callbacks might manipulate the system state. Cumulative livepatches might need to take over the changes made by the replaced ones. For this they might need to access some data stored or referenced by the old livepatches. Therefore the replaced livepatches have to stay around until post_patch() callback is called. It is achieved by calling the free functions later. It is the same location where disabled livepatches have already been freed. Link: http://lkml.kernel.org/r/20191030154313.13263-2-pmladek@suse.com To: Jiri Kosina <jikos@kernel.org> Cc: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com> Cc: Nicolai Stange <nstange@suse.de> Cc: live-patching@vger.kernel.org Cc: linux-kernel@vger.kernel.org Acked-by: Miroslav Benes <mbenes@suse.cz> Acked-by: Joe Lawrence <joe.lawrence@redhat.com> Acked-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Petr Mladek <pmladek@suse.com>
2019-11-01dma/direct: turn ARCH_ZONE_DMA_BITS into a variableNicolas Saenz Julienne
Some architectures, notably ARM, are interested in tweaking this depending on their runtime DMA addressing limitations. Acked-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Nicolas Saenz Julienne <nsaenzjulienne@suse.de> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2019-10-31bpf: Change size to u64 for bpf_map_{area_alloc, charge_init}()Björn Töpel
The functions bpf_map_area_alloc() and bpf_map_charge_init() prior this commit passed the size parameter as size_t. In this commit this is changed to u64. All users of these functions avoid size_t overflows on 32-bit systems, by explicitly using u64 when calculating the allocation size and memory charge cost. However, since the result was narrowed by the size_t when passing size and cost to the functions, the overflow handling was in vain. Instead of changing all call sites to size_t and handle overflow at the call site, the parameter is changed to u64 and checked in the functions above. Fixes: d407bd25a204 ("bpf: don't trigger OOM killer under pressure with map alloc") Fixes: c85d69135a91 ("bpf: move memory size checks to bpf_map_charge_init()") Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Link: https://lore.kernel.org/bpf/20191029154307.23053-1-bjorn.topel@gmail.com
2019-10-31Add wake_up_interruptible_sync_poll_locked()David Howells
Add a wakeup call for a case whereby the caller already has the waitqueue spinlock held. This can be used by pipes to alter the ring buffer indices and issue a wakeup under the same spinlock. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2019-10-31bpf: Replace prog_raw_tp+btf_id with prog_tracingAlexei Starovoitov
The bpf program type raw_tp together with 'expected_attach_type' was the most appropriate api to indicate BTF-enabled raw_tp programs. But during development it became apparent that 'expected_attach_type' cannot be used and new 'attach_btf_id' field had to be introduced. Which means that the information is duplicated in two fields where one of them is ignored. Clean it up by introducing new program type where both 'expected_attach_type' and 'attach_btf_id' fields have specific meaning. In the future 'expected_attach_type' will be extended with other attach points that have similar semantics to raw_tp. This patch is replacing BTF-enabled BPF_PROG_TYPE_RAW_TRACEPOINT with prog_type = BPF_RPOG_TYPE_TRACING expected_attach_type = BPF_TRACE_RAW_TP attach_btf_id = btf_id of raw tracepoint inside the kernel Future patches will add expected_attach_type = BPF_TRACE_FENTRY or BPF_TRACE_FEXIT where programs have the same input context and the same helpers, but different attach points. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20191030223212.953010-2-ast@kernel.org
2019-10-31Merge branch 'for-mingo' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu Pull RCU and LKMM changes from Paul E. McKenney: - Documentation updates. - Miscellaneous fixes. - Dynamic tick (nohz) updates, perhaps most notably changes to force the tick on when needed due to lengthy in-kernel execution on CPUs on which RCU is waiting. - Replace rcu_swap_protected() with rcu_prepace_pointer(). - Torture-test updates. - Linux-kernel memory consistency model updates. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-31bpf: Fix bpf jit kallsym accessAlexei Starovoitov
Jiri reported crash when JIT is on, but net.core.bpf_jit_kallsyms is off. bpf_prog_kallsyms_find() was skipping addr->bpf_prog resolution logic in oops and stack traces. That's incorrect. It should only skip addr->name resolution for 'cat /proc/kallsyms'. That's what bpf_jit_kallsyms and bpf_jit_harden protect. Fixes: 3dec541b2e63 ("bpf: Add support for BTF pointers to x86 JIT") Reported-by: Jiri Olsa <jolsa@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20191030233019.1187404-1-ast@kernel.org
2019-10-30bpf: Allow narrow loads of bpf_sysctl fields with offset > 0Ilya Leoshkevich
"ctx:file_pos sysctl:read read ok narrow" works on s390 by accident: it reads the wrong byte, which happens to have the expected value of 0. Improve the test by seeking to the 4th byte and expecting 4 instead of 0. This makes the latent problem apparent: the test attempts to read the first byte of bpf_sysctl.file_pos, assuming this is the least-significant byte, which is not the case on big-endian machines: a non-zero offset is needed. The point of the test is to verify narrow loads, so we cannot cheat our way out by simply using BPF_W. The existence of the test means that such loads have to be supported, most likely because llvm can generate them. Fix the test by adding a big-endian variant, which uses an offset to access the least-significant byte of bpf_sysctl.file_pos. This reveals the final problem: verifier rejects accesses to bpf_sysctl fields with offset > 0. Such accesses are already allowed for a wide range of structs: __sk_buff, bpf_sock_addr and sk_msg_md to name a few. Extend this support to bpf_sysctl by using bpf_ctx_range instead of offsetof when matching field offsets. Fixes: 7b146cebe30c ("bpf: Sysctl hook") Fixes: e1550bfe0de4 ("bpf: Add file_pos field to bpf_sysctl ctx") Fixes: 9a1027e52535 ("selftests/bpf: Test file_pos field in bpf_sysctl ctx") Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrey Ignatov <rdna@fb.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20191028122902.9763-1-iii@linux.ibm.com
2019-10-30kernel: dma-contiguous: mark CMA parameters __initdata/__initconstShyam Saini
These parameters are only referenced by __init routine calls during early boot so they should be marked as __initdata and __initconst accordingly. Signed-off-by: Shyam Saini <mayhs11saini@gmail.com> Reviewed-by: Robin Murphy <robin.murphy@arm.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-10-30dma-debug: add a schedule point in debug_dma_dump_mappings()Eric Dumazet
debug_dma_dump_mappings() can take a lot of cpu cycles : lpk43:/# time wc -l /sys/kernel/debug/dma-api/dump 163435 /sys/kernel/debug/dma-api/dump real 0m0.463s user 0m0.003s sys 0m0.459s Let's add a cond_resched() to avoid holding cpu for too long. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Corentin Labbe <clabbe@baylibre.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-10-30dma-mapping: fix handling of dma-ranges for reserved memory (again)Vladimir Murzin
Daniele reported that issue previously fixed in c41f9ea998f3 ("drivers: dma-coherent: Account dma_pfn_offset when used with device tree") reappear shortly after 43fc509c3efb ("dma-coherent: introduce interface for default DMA pool") where fix was accidentally dropped. Lets put fix back in place and respect dma-ranges for reserved memory. Fixes: 43fc509c3efb ("dma-coherent: introduce interface for default DMA pool") Reported-by: Daniele Alessandrelli <daniele.alessandrelli@gmail.com> Tested-by: Daniele Alessandrelli <daniele.alessandrelli@gmail.com> Tested-by: Alexandre Torgue <alexandre.torgue@st.com> Signed-off-by: Vladimir Murzin <vladimir.murzin@arm.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-10-30Merge branches 'doc.2019.10.29a', 'fixes.2019.10.30a', 'nohz.2019.10.28a', ↵Paul E. McKenney
'replace.2019.10.30a', 'torture.2019.10.05a' and 'lkmm.2019.10.05a' into HEAD doc.2019.10.29a: RCU documentation updates. fixes.2019.10.30a: RCU miscellaneous fixes. nohz.2019.10.28a: RCU NO_HZ and NO_HZ_FULL updates. replace.2019.10.30a: Replace rcu_swap_protected() with rcu_replace(). torture.2019.10.05a: RCU torture-test updates. lkmm.2019.10.05a: Linux kernel memory model updates.
2019-10-30bpf/cgroup: Replace rcu_swap_protected() with rcu_replace_pointer()Paul E. McKenney
This commit replaces the use of rcu_swap_protected() with the more intuitively appealing rcu_replace_pointer() as a step towards removing rcu_swap_protected(). Link: https://lore.kernel.org/lkml/CAHk-=wiAsJLw1egFEE=Z7-GGtM6wcvtyytXZA1+BHqta4gg6Hw@mail.gmail.com/ Reported-by: Linus Torvalds <torvalds@linux-foundation.org> [ paulmck: From rcu_replace() to rcu_replace_pointer() per Ingo Molnar. ] Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Song Liu <songliubraving@fb.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Martin KaFai Lau <kafai@fb.com> Cc: Yonghong Song <yhs@fb.com> Cc: <netdev@vger.kernel.org> Cc: <bpf@vger.kernel.org>
2019-10-30rcu: Suppress levelspread uninitialized messagesPaul E. McKenney
New tools bring new warnings, and with v5.3 comes: kernel/rcu/srcutree.c: warning: 'levelspread[<U aa0>]' may be used uninitialized in this function [-Wuninitialized]: => 121:34 This commit suppresses this warning by initializing the full array to INT_MIN, which will result in failures should any out-of-bounds references appear. Reported-by: Michael Ellerman <mpe@ellerman.id.au> Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org>
2019-10-30rcu: Fix uninitialized variable in nocb_gp_wait()Dan Carpenter
We never set this to false. This probably doesn't affect most people's runtime because GCC will automatically initialize it to false at certain common optimization levels. But that behavior is related to a bug in GCC and obviously should not be relied on. Fixes: 5d6742b37727 ("rcu/nocb: Use rcu_segcblist for no-CBs CPUs") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2019-10-30rcu: Ensure that ->rcu_urgent_qs is set before resched IPIJoel Fernandes (Google)
The RCU-specific resched_cpu() function sends a resched IPI to the specified CPU, which can be used to force the tick on for a given nohz_full CPU. This is needed when this nohz_full CPU is looping in the kernel while blocking the current grace period. However, for the tick to actually be forced on in all cases, that CPU's rcu_data structure's ->rcu_urgent_qs flag must be set beforehand. This commit therefore causes rcu_implicit_dynticks_qs() to set this flag prior to invoking resched_cpu() on a holdout nohz_full CPU. Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2019-10-30workqueue: Convert for_each_wq to use built-in list checkJoel Fernandes (Google)
Because list_for_each_entry_rcu() can now check for holding a lock as well as for being in an RCU read-side critical section, this commit replaces the workqueue_sysfs_unregister() function's use of assert_rcu_or_wq_mutex() and list_for_each_entry_rcu() with list_for_each_entry_rcu() augmented with a lockdep_is_held() optional argument. Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2019-10-30rcu: Several rcu_segcblist functions can be statickbuild test robot
None of rcu_segcblist_set_len(), rcu_segcblist_add_len(), or rcu_segcblist_xchg_len() are used outside of kernel/rcu/rcu_segcblist.c. This commit therefore makes them static. Fixes: eda669a6a2c5 ("rcu/nocb: Atomic ->len field in rcu_segcblist structure") Signed-off-by: kbuild test robot <lkp@intel.com> [ paulmck: "Fixes:" updated per Stephen Rothwell feedback. ] Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2019-10-30bpf: Enforce 'return 0' in BTF-enabled raw_tp programsAlexei Starovoitov
The return value of raw_tp programs is ignored by __bpf_trace_run() that calls them. The verifier also allows any value to be returned. For BTF-enabled raw_tp lets enforce 'return 0', so that return value can be used for something in the future. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20191029032426.1206762-1-ast@kernel.org
2019-10-29io-wq: small threadpool implementation for io_uringJens Axboe
This adds support for io-wq, a smaller and specialized thread pool implementation. This is meant to replace workqueues for io_uring. Among the reasons for this addition are: - We can assign memory context smarter and more persistently if we manage the life time of threads. - We can drop various work-arounds we have in io_uring, like the async_list. - We can implement hashed work insertion, to manage concurrency of buffered writes without needing a) an extra workqueue, or b) needlessly making the concurrency of said workqueue very low which hurts performance of multiple buffered file writers. - We can implement cancel through signals, for cancelling interruptible work like read/write (or send/recv) to/from sockets. - We need the above cancel for being able to assign and use file tables from a process. - We can implement a more thorough cancel operation in general. - We need it to move towards a syslet/threadlet model for even faster async execution. For that we need to take ownership of the used threads. This list is just off the top of my head. Performance should be the same, or better, at least that's what I've seen in my testing. io-wq supports basic NUMA functionality, setting up a pool per node. io-wq hooks up to the scheduler schedule in/out just like workqueue and uses that to drive the need for more/less workers. Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-29locking/mutex: Complain upon mutex API misuse in IRQ contextsDavidlohr Bueso
Add warning checks if mutex_trylock() or mutex_unlock() are used in IRQ contexts, under CONFIG_DEBUG_MUTEXES=y. While the mutex rules and semantics are explicitly documented, this allows to expose any abusers and robustifies the whole thing. While trylock and unlock are non-blocking, calling from IRQ context is still forbidden (lock must be within the same context as unlock). Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dave@stgolabs.net Link: https://lkml.kernel.org/r/20191025033634.3330-1-dave@stgolabs.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-29futex: Drop leftover wake_q_add() commentDavidlohr Bueso
Since the original comment, we have moved to do the task reference counting explicitly along with wake_q_add_safe(). Drop the now incorrect comment. Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dave@stgolabs.net Link: https://lkml.kernel.org/r/20191023033450.6445-1-dave@stgolabs.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-29sched/kcpustat: Introduce vtime-aware kcpustat accessor for CPUTIME_SYSTEMFrederic Weisbecker
Kcpustat is not correctly supported on nohz_full CPUs. The tick doesn't fire and the cputime therefore doesn't move forward. The issue has shown up after the vanishing of the remaining 1Hz which has made the stall visible. We are solving that with checking the task running on a CPU through RCU and reading its vtime delta that we add to the raw kcpustat values. We make sure that we fetch a coherent raw-kcpustat/vtime-delta couple sequence while checking that the CPU referred by the target vtime is the correct one, under the locked vtime seqcount. Only CPUTIME_SYSTEM is handled here as a start because it's the trivial case. User and guest time will require more preparation work to correctly handle niceness. Reported-by: Yauheni Kaliuta <yauheni.kaliuta@redhat.com> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Wanpeng Li <wanpengli@tencent.com> Link: https://lkml.kernel.org/r/20191025020303.19342-1-frederic@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-29sched/vtime: Rename vtime_accounting_cpu_enabled() to ↵Frederic Weisbecker
vtime_accounting_enabled_this_cpu() Standardize the naming on top of the vtime_accounting_enabled_*() base. Also make it clear we are checking the vtime state of the *current* CPU with this function. We'll need to add an API to check that state on remote CPUs as well, so we must disambiguate the naming. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Jacek Anaszewski <jacek.anaszewski@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Pavel Machek <pavel@ucw.cz> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J . Wysocki <rjw@rjwysocki.net> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Viresh Kumar <viresh.kumar@linaro.org> Cc: Wanpeng Li <wanpengli@tencent.com> Cc: Yauheni Kaliuta <yauheni.kaliuta@redhat.com> Link: https://lkml.kernel.org/r/20191016025700.31277-9-frederic@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-29context_tracking: Rename context_tracking_is_enabled() => ↵Frederic Weisbecker
context_tracking_enabled() Remove the superfluous "is" in the middle of the name. We want to standardize the naming so that it can be expanded through suffixes: context_tracking_enabled() context_tracking_enabled_cpu() context_tracking_enabled_this_cpu() Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Jacek Anaszewski <jacek.anaszewski@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Pavel Machek <pavel@ucw.cz> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J . Wysocki <rjw@rjwysocki.net> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Viresh Kumar <viresh.kumar@linaro.org> Cc: Wanpeng Li <wanpengli@tencent.com> Cc: Yauheni Kaliuta <yauheni.kaliuta@redhat.com> Link: https://lkml.kernel.org/r/20191016025700.31277-6-frederic@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-29sched/cputime: Add vtime guest task stateFrederic Weisbecker
Record guest as a VTIME state instead of guessing it from VTIME_SYS and PF_VCPU. This is going to simplify the cputime read side especially as its state machine is going to further expand in order to fully support kcpustat on nohz_full. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Jacek Anaszewski <jacek.anaszewski@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Pavel Machek <pavel@ucw.cz> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J . Wysocki <rjw@rjwysocki.net> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Viresh Kumar <viresh.kumar@linaro.org> Cc: Wanpeng Li <wanpengli@tencent.com> Cc: Yauheni Kaliuta <yauheni.kaliuta@redhat.com> Link: https://lkml.kernel.org/r/20191016025700.31277-4-frederic@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-29sched/cputime: Add vtime idle task stateFrederic Weisbecker
Record idle as a VTIME state instead of guessing it from VTIME_SYS and is_idle_task(). This is going to simplify the cputime read side especially as its state machine is going to further expand in order to fully support kcpustat on nohz_full. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Jacek Anaszewski <jacek.anaszewski@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Pavel Machek <pavel@ucw.cz> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J . Wysocki <rjw@rjwysocki.net> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Viresh Kumar <viresh.kumar@linaro.org> Cc: Wanpeng Li <wanpengli@tencent.com> Cc: Yauheni Kaliuta <yauheni.kaliuta@redhat.com> Link: https://lkml.kernel.org/r/20191016025700.31277-3-frederic@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-29sched/vtime: Record CPU under seqcount for kcpustat needsFrederic Weisbecker
In order to compute the kcpustat delta on a nohz CPU, we'll need to fetch the task running on that target. Checking that its vtime state snapshot actually refers to the relevant target involves recording that CPU under the seqcount locked on task switch. This is a step toward making kcpustat moving forward on full nohz CPUs. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Jacek Anaszewski <jacek.anaszewski@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Pavel Machek <pavel@ucw.cz> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J . Wysocki <rjw@rjwysocki.net> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Viresh Kumar <viresh.kumar@linaro.org> Cc: Wanpeng Li <wanpengli@tencent.com> Cc: Yauheni Kaliuta <yauheni.kaliuta@redhat.com> Link: https://lkml.kernel.org/r/20191016025700.31277-2-frederic@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>