summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2018-09-06printk/tracing: Do not trace printk_nmi_enter()Steven Rostedt (VMware)
I hit the following splat in my tests: ------------[ cut here ]------------ IRQs not enabled as expected WARNING: CPU: 3 PID: 0 at kernel/time/tick-sched.c:982 tick_nohz_idle_enter+0x44/0x8c Modules linked in: ip6t_REJECT nf_reject_ipv6 ip6table_filter ip6_tables ipv6 CPU: 3 PID: 0 Comm: swapper/3 Not tainted 4.19.0-rc2-test+ #2 Hardware name: MSI MS-7823/CSM-H87M-G43 (MS-7823), BIOS V1.6 02/22/2014 EIP: tick_nohz_idle_enter+0x44/0x8c Code: ec 05 00 00 00 75 26 83 b8 c0 05 00 00 00 75 1d 80 3d d0 36 3e c1 00 75 14 68 94 63 12 c1 c6 05 d0 36 3e c1 01 e8 04 ee f8 ff <0f> 0b 58 fa bb a0 e5 66 c1 e8 25 0f 04 00 64 03 1d 28 31 52 c1 8b EAX: 0000001c EBX: f26e7f8c ECX: 00000006 EDX: 00000007 ESI: f26dd1c0 EDI: 00000000 EBP: f26e7f40 ESP: f26e7f38 DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010296 CR0: 80050033 CR2: 0813c6b0 CR3: 2f342000 CR4: 001406f0 Call Trace: do_idle+0x33/0x202 cpu_startup_entry+0x61/0x63 start_secondary+0x18e/0x1ed startup_32_smp+0x164/0x168 irq event stamp: 18773830 hardirqs last enabled at (18773829): [<c040150c>] trace_hardirqs_on_thunk+0xc/0x10 hardirqs last disabled at (18773830): [<c040151c>] trace_hardirqs_off_thunk+0xc/0x10 softirqs last enabled at (18773824): [<c0ddaa6f>] __do_softirq+0x25f/0x2bf softirqs last disabled at (18773767): [<c0416bbe>] call_on_stack+0x45/0x4b ---[ end trace b7c64aa79e17954a ]--- After a bit of debugging, I found what was happening. This would trigger when performing "perf" with a high NMI interrupt rate, while enabling and disabling function tracer. Ftrace uses breakpoints to convert the nops at the start of functions to calls to the function trampolines. The breakpoint traps disable interrupts and this makes calls into lockdep via the trace_hardirqs_off_thunk in the entry.S code. What happens is the following: do_idle { [interrupts enabled] <interrupt> [interrupts disabled] TRACE_IRQS_OFF [lockdep says irqs off] [...] TRACE_IRQS_IRET test if pt_regs say return to interrupts enabled [yes] TRACE_IRQS_ON [lockdep says irqs are on] <nmi> nmi_enter() { printk_nmi_enter() [traced by ftrace] [ hit ftrace breakpoint ] <breakpoint exception> TRACE_IRQS_OFF [lockdep says irqs off] [...] TRACE_IRQS_IRET [return from breakpoint] test if pt_regs say interrupts enabled [no] [iret back to interrupt] [iret back to code] tick_nohz_idle_enter() { lockdep_assert_irqs_enabled() [lockdep say no!] Although interrupts are indeed enabled, lockdep thinks it is not, and since we now do asserts via lockdep, it gives a false warning. The issue here is that printk_nmi_enter() is called before lockdep_off(), which disables lockdep (for this reason) in NMIs. By simply not allowing ftrace to see printk_nmi_enter() (via notrace annotation) we keep lockdep from getting confused. Cc: stable@vger.kernel.org Fixes: 42a0bb3f71383 ("printk/nmi: generic solution for safe printk in NMI") Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-09-06cpu/hotplug: Prevent state corruption on error rollbackThomas Gleixner
When a teardown callback fails, the CPU hotplug code brings the CPU back to the previous state. The previous state becomes the new target state. The rollback happens in undo_cpu_down() which increments the state unconditionally even if the state is already the same as the target. As a consequence the next CPU hotplug operation will start at the wrong state. This is easily to observe when __cpu_disable() fails. Prevent the unconditional undo by checking the state vs. target before incrementing state and fix up the consequently wrong conditional in the unplug code which handles the failure of the final CPU take down on the control CPU side. Fixes: 4dddfb5faa61 ("smp/hotplug: Rewrite AP state machine core") Reported-by: Neeraj Upadhyay <neeraju@codeaurora.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Geert Uytterhoeven <geert+renesas@glider.be> Tested-by: Sudeep Holla <sudeep.holla@arm.com> Tested-by: Neeraj Upadhyay <neeraju@codeaurora.org> Cc: josh@joshtriplett.org Cc: peterz@infradead.org Cc: jiangshanlai@gmail.com Cc: dzickus@redhat.com Cc: brendan.jackman@arm.com Cc: malat@debian.org Cc: sramana@codeaurora.org Cc: linux-arm-msm@vger.kernel.org Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1809051419580.1416@nanos.tec.linutronix.de ----
2018-09-06cpu/hotplug: Adjust misplaced smb() in cpuhp_thread_fun()Neeraj Upadhyay
The smp_mb() in cpuhp_thread_fun() is misplaced. It needs to be after the load of st->should_run to prevent reordering of the later load/stores w.r.t. the load of st->should_run. Fixes: 4dddfb5faa61 ("smp/hotplug: Rewrite AP state machine core") Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infraded.org> Cc: josh@joshtriplett.org Cc: peterz@infradead.org Cc: jiangshanlai@gmail.com Cc: dzickus@redhat.com Cc: brendan.jackman@arm.com Cc: malat@debian.org Cc: mojha@codeaurora.org Cc: sramana@codeaurora.org Cc: linux-arm-msm@vger.kernel.org Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/1536126727-11629-1-git-send-email-neeraju@codeaurora.org
2018-09-05bpf/verifier: fix verifier instabilityAlexei Starovoitov
Edward Cree says: In check_mem_access(), for the PTR_TO_CTX case, after check_ctx_access() has supplied a reg_type, the other members of the register state are set appropriately. Previously reg.range was set to 0, but as it is in a union with reg.map_ptr, which is larger, upper bytes of the latter were left in place. This then caused the memcmp() in regsafe() to fail, preventing some branches from being pruned (and occasionally causing the same program to take a varying number of processed insns on repeated verifier runs). Fix the instability by clearing bpf_reg_state in __mark_reg_[un]known() Fixes: f1174f77b50c ("bpf/verifier: rework value tracking") Debugged-by: Edward Cree <ecree@solarflare.com> Acked-by: Edward Cree <ecree@solarflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-09-04Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
2018-09-04Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge misc fixes from Andrew Morton: "17 fixes" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: nilfs2: convert to SPDX license tags drivers/dax/device.c: convert variable to vm_fault_t type lib/Kconfig.debug: fix three typos in help text checkpatch: add __ro_after_init to known $Attribute mm: fix BUG_ON() in vmf_insert_pfn_pud() from VM_MIXEDMAP removal uapi/linux/keyctl.h: don't use C++ reserved keyword as a struct member name memory_hotplug: fix kernel_panic on offline page processing checkpatch: add optional static const to blank line declarations test ipc/shm: properly return EIDRM in shm_lock() mm/hugetlb: filter out hugetlb pages if HUGEPAGE migration is not supported. mm/util.c: improve kvfree() kerneldoc tools/vm/page-types.c: fix "defined but not used" warning tools/vm/slabinfo.c: fix sign-compare warning kmemleak: always register debugfs file mm: respect arch_dup_mmap() return value mm, oom: fix missing tlb_finish_mmu() in __oom_reap_task_mm(). mm: memcontrol: print proper OOM header when no eligible victim left
2018-09-04mm: respect arch_dup_mmap() return valueNadav Amit
Commit d70f2a14b72a ("include/linux/sched/mm.h: uninline mmdrop_async(), etc") ignored the return value of arch_dup_mmap(). As a result, on x86, a failure to duplicate the LDT (e.g. due to memory allocation error) would leave the duplicated memory mapping in an inconsistent state. Fix by using the return value, as it was before the change. Link: http://lkml.kernel.org/r/20180823051229.211856-1-namit@vmware.com Fixes: d70f2a14b72a4 ("include/linux/sched/mm.h: uninline mmdrop_async(), etc") Signed-off-by: Nadav Amit <namit@vmware.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-09-04Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: 1) Must perform TXQ teardown before unregistering interfaces in mac80211, from Toke Høiland-Jørgensen. 2) Don't allow creating mac80211_hwsim with less than one channel, from Johannes Berg. 3) Division by zero in cfg80211, fix from Johannes Berg. 4) Fix endian issue in tipc, from Haiqing Bai. 5) BPF sockmap use-after-free fixes from Daniel Borkmann. 6) Spectre-v1 in mac80211_hwsim, from Jinbum Park. 7) Missing rhashtable_walk_exit() in tipc, from Cong Wang. 8) Revert kvzalloc() conversion of AF_PACKET, it breaks mmap() when kvzalloc() tries to use kmalloc() pages. From Eric Dumazet. 9) Fix deadlock in hv_netvsc, from Dexuan Cui. 10) Do not restart timewait timer on RST, from Florian Westphal. 11) Fix double lwstate refcount grab in ipv6, from Alexey Kodanev. 12) Unsolicit report count handling is off-by-one, fix from Hangbin Liu. 13) Sleep-in-atomic in cadence driver, from Jia-Ju Bai. 14) Respect ttl-inherit in ip6 tunnel driver, from Hangbin Liu. 15) Use-after-free in act_ife, fix from Cong Wang. 16) Missing hold to meta module in act_ife, from Vlad Buslov. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (91 commits) net: phy: sfp: Handle unimplemented hwmon limits and alarms net: sched: action_ife: take reference to meta module act_ife: fix a potential use-after-free net/mlx5: Fix SQ offset in QPs with small RQ tipc: correct spelling errors for tipc_topsrv_queue_evt() comments tipc: correct spelling errors for struct tipc_bc_base's comment bnxt_en: Do not adjust max_cp_rings by the ones used by RDMA. bnxt_en: Clean up unused functions. bnxt_en: Fix firmware signaled resource change logic in open. sctp: not traverse asoc trans list if non-ipv6 trans exists for ipv6_flowlabel sctp: fix invalid reference to the index variable of the iterator net/ibm/emac: wrong emac_calc_base call was used by typo net: sched: null actions array pointer before releasing action vhost: fix VHOST_GET_BACKEND_FEATURES ioctl request definition r8169: add support for NCube 8168 network card ip6_tunnel: respect ttl inherit for ip6tnl mac80211: shorten the IBSS debug messages mac80211: don't Tx a deauth frame if the AP forbade Tx mac80211: Fix station bandwidth setting after channel switch mac80211: fix a race between restart and CSA flows ...
2018-09-04stackleak: Allow runtime disabling of kernel stack erasingAlexander Popov
Introduce CONFIG_STACKLEAK_RUNTIME_DISABLE option, which provides 'stack_erasing' sysctl. It can be used in runtime to control kernel stack erasing for kernels built with CONFIG_GCC_PLUGIN_STACKLEAK. Suggested-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Alexander Popov <alex.popov@linux.com> Tested-by: Laura Abbott <labbott@redhat.com> Signed-off-by: Kees Cook <keescook@chromium.org>
2018-09-04fs/proc: Show STACKLEAK metrics in the /proc file systemAlexander Popov
Introduce CONFIG_STACKLEAK_METRICS providing STACKLEAK information about tasks via the /proc file system. In particular, /proc/<pid>/stack_depth shows the maximum kernel stack consumption for the current and previous syscalls. Although this information is not precise, it can be useful for estimating the STACKLEAK performance impact for your workloads. Suggested-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Alexander Popov <alex.popov@linux.com> Tested-by: Laura Abbott <labbott@redhat.com> Signed-off-by: Kees Cook <keescook@chromium.org>
2018-09-04gcc-plugins: Add STACKLEAK plugin for tracking the kernel stackAlexander Popov
The STACKLEAK feature erases the kernel stack before returning from syscalls. That reduces the information which kernel stack leak bugs can reveal and blocks some uninitialized stack variable attacks. This commit introduces the STACKLEAK gcc plugin. It is needed for tracking the lowest border of the kernel stack, which is important for the code erasing the used part of the kernel stack at the end of syscalls (comes in a separate commit). The STACKLEAK feature is ported from grsecurity/PaX. More information at: https://grsecurity.net/ https://pax.grsecurity.net/ This code is modified from Brad Spengler/PaX Team's code in the last public patch of grsecurity/PaX based on our understanding of the code. Changes or omissions from the original code are ours and don't reflect the original grsecurity/PaX code. Signed-off-by: Alexander Popov <alex.popov@linux.com> Tested-by: Laura Abbott <labbott@redhat.com> Signed-off-by: Kees Cook <keescook@chromium.org>
2018-09-04x86/entry: Add STACKLEAK erasing the kernel stack at the end of syscallsAlexander Popov
The STACKLEAK feature (initially developed by PaX Team) has the following benefits: 1. Reduces the information that can be revealed through kernel stack leak bugs. The idea of erasing the thread stack at the end of syscalls is similar to CONFIG_PAGE_POISONING and memzero_explicit() in kernel crypto, which all comply with FDP_RIP.2 (Full Residual Information Protection) of the Common Criteria standard. 2. Blocks some uninitialized stack variable attacks (e.g. CVE-2017-17712, CVE-2010-2963). That kind of bugs should be killed by improving C compilers in future, which might take a long time. This commit introduces the code filling the used part of the kernel stack with a poison value before returning to userspace. Full STACKLEAK feature also contains the gcc plugin which comes in a separate commit. The STACKLEAK feature is ported from grsecurity/PaX. More information at: https://grsecurity.net/ https://pax.grsecurity.net/ This code is modified from Brad Spengler/PaX Team's code in the last public patch of grsecurity/PaX based on our understanding of the code. Changes or omissions from the original code are ours and don't reflect the original grsecurity/PaX code. Performance impact: Hardware: Intel Core i7-4770, 16 GB RAM Test #1: building the Linux kernel on a single core 0.91% slowdown Test #2: hackbench -s 4096 -l 2000 -g 15 -f 25 -P 4.2% slowdown So the STACKLEAK description in Kconfig includes: "The tradeoff is the performance impact: on a single CPU system kernel compilation sees a 1% slowdown, other systems and workloads may vary and you are advised to test this feature on your expected workload before deploying it". Signed-off-by: Alexander Popov <alex.popov@linux.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Kees Cook <keescook@chromium.org>
2018-09-02Merge tag 'dma-mapping-4.19-2' of git://git.infradead.org/users/hch/dma-mappingLinus Torvalds
Pull dma-mapping fixes from Christoph Hellwig: "A few fixes for the fallout of being a little more pedantic about dma masks" * tag 'dma-mapping-4.19-2' of git://git.infradead.org/users/hch/dma-mapping: of/platform: initialise AMBA default DMA masks sparc: set a default 32-bit dma mask for OF devices kernel/dma/direct: take DMA offset into account in dma_direct_supported
2018-09-02bpf: avoid misuse of psock when TCP_ULP_BPF collides with another ULPJohn Fastabend
Currently we check sk_user_data is non NULL to determine if the sk exists in a map. However, this is not sufficient to ensure the psock or the ULP ops are not in use by another user, such as kcm or TLS. To avoid this when adding a sock to a map also verify it is of the correct ULP type. Additionally, when releasing a psock verify that it is the TCP_ULP_BPF type before releasing the ULP. The error case where we abort an update due to ULP collision can cause this error path. For example, __sock_map_ctx_update_elem() [...] err = tcp_set_ulp_id(sock, TCP_ULP_BPF) <- collides with TLS if (err) <- so err out here goto out_free [...] out_free: smap_release_sock() <- calling tcp_cleanup_ulp releases the TLS ULP incorrectly. Fixes: 2f857d04601a ("bpf: sockmap, remove STRPARSER map_flags and add multi-map support") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-09-02Merge branch 'smp-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull CPU hotplug fix from Thomas Gleixner: "Remove the stale skip_onerr member from the hotplug states" * 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: cpu/hotplug: Remove skip_onerr field from cpuhp_step structure
2018-09-02Merge branch 'core-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull core fixes from Thomas Gleixner: "A small set of updates for core code: - Prevent tracing in functions which are called from trace patching via stop_machine() to prevent executing half patched function trace entries. - Remove old GCC workarounds - Remove pointless includes of notifier.h" * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: objtool: Remove workaround for unreachable warnings from old GCC notifier: Remove notifier header file wherever not used watchdog: Mark watchdog touch functions as notrace
2018-09-01kernel/dma/direct: take DMA offset into account in dma_direct_supportedChristoph Hellwig
When a device has a DMA offset the dma capable result will change due to the difference between the physical and DMA address. Take that into account. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Reviewed-by: Robin Murphy <robin.murphy@arm.com>
2018-08-31cpu/hotplug: Remove skip_onerr field from cpuhp_step structureMukesh Ojha
When notifiers were there, `skip_onerr` was used to avoid calling particular step startup/teardown callbacks in the CPU up/down rollback path, which made the hotplug asymmetric. As notifiers are gone now after the full state machine conversion, the `skip_onerr` field is no longer required. Remove it from the structure and its usage. Signed-off-by: Mukesh Ojha <mojha@codeaurora.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/1535439294-31426-1-git-send-email-mojha@codeaurora.org
2018-08-30Merge branches 'doc.2018.08.30a', 'dynticks.2018.08.30b', 'srcu.2018.08.30b' ↵Paul E. McKenney
and 'torture.2018.08.29a' into HEAD doc.2018.08.30a: Documentation updates dynticks.2018.08.30b: RCU flavor consolidation updates and cleanups srcu.2018.08.30b: SRCU updates torture.2018.08.29a: Torture-test updates
2018-08-30srcu: Make early-boot call_srcu() reuse workqueue listsPaul E. McKenney
Allocating a list_head structure that is almost never used, and, when used, is used only during early boot (rcu_init() and earlier), is a bit wasteful. This commit therefore eliminates that list_head in favor of the one in the work_struct structure. This is safe because the work_struct structure cannot be used until after rcu_init() returns. Reported-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Tejun Heo <tj@kernel.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Tested-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-08-30srcu: Make call_srcu() available during very early bootPaul E. McKenney
Event tracing is moving to SRCU in order to take advantage of the fact that SRCU may be safely used from idle and even offline CPUs. However, event tracing can invoke call_srcu() very early in the boot process, even before workqueue_init_early() is invoked (let alone rcu_init()). Therefore, call_srcu()'s attempts to queue work fail miserably. This commit therefore detects this situation, and refrains from attempting to queue work before rcu_init() time, but does everything else that it would have done, and in addition, adds the srcu_struct to a global list. The rcu_init() function now invokes a new srcu_init() function, which is empty if CONFIG_SRCU=n. Otherwise, srcu_init() queues work for each srcu_struct on the list. This all happens early enough in boot that there is but a single CPU with interrupts disabled, which allows synchronization to be dispensed with. Of course, the queued work won't actually be invoked until after workqueue_init() is invoked, which happens shortly after the scheduler is up and running. This means that although call_srcu() may be invoked any time after per-CPU variables have been set up, there is still a very narrow window when synchronize_srcu() won't work, and this window extends from the time that the scheduler starts until the time that workqueue_init() returns. This can be fixed in a manner similar to the fix for synchronize_rcu_expedited() and friends, but until someone actually needs to use synchronize_srcu() during this window, this fix is added churn for no benefit. Finally, note that Tree SRCU's new srcu_init() function invokes queue_work() rather than the queue_delayed_work() function that is invoked post-boot. The reason is that queue_delayed_work() will (as you would expect) post a timer, and timers have not yet been initialized. So use of queue_work() avoids the complaints about use of uninitialized spinlocks that would otherwise result. Besides, some delay is already provide by the aforementioned fact that the queued work won't actually be invoked until after the scheduler is up and running. Requested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Tested-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-08-30rcu: Convert rcu_state.ofl_lock to raw_spinlock_tMike Galbraith
1e64b15a4b10 ("rcu: Fix grace-period hangs due to race with CPU offline") added spinlock_t ofl_lock to the rcu_state structure, then takes it with preemption disabled during CPU offline, which gives the -rt patchset's sleeping spinlock heartburn. This commit therefore converts ->ofl_lock to raw_spinlock_t. Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2018-08-30rcu: Remove obsolete ->dynticks_fqs and ->cond_resched_completedPaul E. McKenney
The rcu_data structure's ->dynticks_fqs is incremented but never accesses. Its ->cond_resched_completed field isn't used at all. This commit therefore removes both fields. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2018-08-30rcu: Switch ->dynticks to rcu_data structure, remove rcu_dynticksPaul E. McKenney
This commit move ->dynticks from the rcu_dynticks structure to the rcu_data structure, replacing the field of the same name. It also updates the code to access ->dynticks from the rcu_data structure and to use the rcu_data structure rather than following to now-gone ->dynticks field to the now-gone rcu_dynticks structure. While in the area, this commit also fixes up comments. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2018-08-30rcu: Switch dyntick nesting counters to rcu_data structurePaul E. McKenney
This commit removes ->dynticks_nesting and ->dynticks_nmi_nesting from the rcu_dynticks structure and updates the code to access them from the rcu_data structure. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2018-08-30rcu: Switch urgent quiescent-state requests to rcu_data structurePaul E. McKenney
This commit removes ->rcu_need_heavy_qs and ->rcu_urgent_qs from the rcu_dynticks structure and updates the code to access them from the rcu_data structure. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2018-08-30rcu: Switch lazy counts to rcu_data structurePaul E. McKenney
This commit removes ->all_lazy, ->nonlazy_posted and ->nonlazy_posted_snap from the rcu_dynticks structure and updates the code to access them from the rcu_data structure. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2018-08-30rcu: Switch last accelerate/advance to rcu_data structurePaul E. McKenney
This commit removes ->last_accelerate and ->last_advance_all from the rcu_dynticks structure and updates the code to access them from the rcu_data structure. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2018-08-30rcu: Switch ->tick_nohz_enabled_snap to rcu_data structurePaul E. McKenney
This commit removes ->tick_nohz_enabled_snap from the rcu_dynticks structure and updates the code to access it from the rcu_data structure. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2018-08-30rcu: Merge rcu_dynticks structure into rcu_data structurePaul E. McKenney
Now that there is only ever one rcu_data structure per CPU, there is no need for a separate rcu_dynticks structure. This commit therefore adds the rcu_dynticks fields into the rcu_data structure in preparation for removing the rcu_dynticks structure entirely. Note that the ->dynticks field will be handled specially because there is a field by that name in both structures. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2018-08-30rcu: Convert "1UL << x" to "BIT(x)"Paul E. McKenney
This commit saves a few characters by converting "1UL << x" to "BIT(x)". Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2018-08-30rcu: Avoid resched_cpu() when rescheduling the current CPUPaul E. McKenney
The resched_cpu() interface is quite handy, but it does acquire the specified CPU's runqueue lock, which does not come for free. This commit therefore substitutes the following when directing resched_cpu() at the current CPU: set_tsk_need_resched(current); set_preempt_need_resched(); Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org>
2018-08-30rcu: More aggressively enlist scheduler aid for nohz_full CPUsPaul E. McKenney
Because nohz_full CPUs can leave the scheduler-clock interrupt disabled even when in kernel mode, RCU cannot rely on rcu_check_callbacks() to enlist the scheduler's aid in extracting a quiescent state from such CPUs. This commit therefore more aggressively uses resched_cpu() on nohz_full CPUs that fail to pass through a quiescent state in a timely manner. By default, the resched_cpu() beating starts 300 milliseconds into the quiescent state. While in the neighborhood, add a ->last_fqs_resched field to the rcu_data structure in order to rate-limit resched_cpu() calls from the RCU grace-period kthread. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2018-08-30rcu: Compute jiffies_till_sched_qs from other kernel parametersPaul E. McKenney
The jiffies_till_sched_qs value used to determine how old a grace period must be before RCU enlists the help of the scheduler to force a quiescent state on the holdout CPU. Currently, this defaults to HZ/10 regardless of system size and may be set only at boot time. This can be a problem for very large systems, because if the values of the jiffies_till_first_fqs and jiffies_till_next_fqs kernel parameters are left at their defaults, they are calculated to increase as the number of CPUs actually configured on the system increases. Thus, on a sufficiently large system, RCU would enlist the help of the scheduler before the grace-period kthread had a chance to scan for idle CPUs, which wastes CPU time. This commit therefore allows jiffies_till_sched_qs to be set, if desired, but if left as default, computes is as jiffies_till_first_fqs plus twice jiffies_till_next_fqs, thus allowing three force-quiescent-state scans for idle CPUs. This scales with the number of CPUs, providing sensible default values. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2018-08-30rcu: Provide functions for determining if call_rcu() has been invokedPaul E. McKenney
This commit adds rcu_head_init() and rcu_head_after_call_rcu() functions to help RCU users detect when another CPU has passed the specified rcu_head structure and function to call_rcu(). The rcu_head_init() should be invoked before making the structure visible to RCU readers, and then the rcu_head_after_call_rcu() may be invoked from within an RCU read-side critical section on an rcu_head structure that was obtained during a traversal of the data structure in question. The rcu_head_after_call_rcu() function will return true if the rcu_head structure has already been passed (with the specified function) to call_rcu(), otherwise it will return false. If rcu_head_init() has not been invoked on the rcu_head structure or if the rcu_head (AKA callback) has already been invoked, then rcu_head_after_call_rcu() will do WARN_ON_ONCE(). Reported-by: NeilBrown <neilb@suse.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> [ paulmck: Apply neilb naming feedback. ]
2018-08-30rcu: Eliminate ->rcu_qs_ctr from the rcu_dynticks structurePaul E. McKenney
The ->rcu_qs_ctr counter was intended to allow providing a lightweight report of a quiescent state to all RCU flavors. But now that there is only one flavor of RCU in any one running kernel, there is no point in having this feature. This commit therefore removes the ->rcu_qs_ctr field from the rcu_dynticks structure and the ->rcu_qs_ctr_snap field from the rcu_data structure. This results in the "rqc" option to the rcu_fqs trace event no longer being used, so this commit also removes the "rqc" description from the header comment. While in the neighborhood, this commit also causes the forward-progress request .rcu_need_heavy_qs be set one jiffies_till_sched_qs interval later in the grace period than the first setting of .rcu_urgent_qs. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2018-08-30rcu: Motivate Tiny RCU forward progressPaul E. McKenney
If a long-running CPU-bound in-kernel task invokes call_rcu(), the callback won't be invoked until the next context switch. If there are no other runnable tasks (which is not an uncommon situation on deep embedded systems), the callback might never be invoked. This commit therefore causes rcu_check_callbacks() to ask the scheduler for a context switch if there are callbacks posted that are still waiting for a grace period. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2018-08-30rcutorture: Dump reader protection sequence if failures or close callsPaul E. McKenney
Now that RCU can have readers with multiple segments, it is quite possible that a specific sequence of reader segments might result in an rcutorture failure (reader spans a full grace period as detected by one of the grace-period primitives) or an rcutorture close call (reader potentially spans a full grace period based on reading out the RCU implementation's grace-period counter, but with no ordering). In such cases, it would clearly ease debugging if the offending specific sequence was known. For the first reader encountering a failure or a close call, this commit therefore dumps out the segments, delay durations, and whether or not the reader was preempted. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> [ paulmck: Mark variables static, as suggested by kbuild test robot. ]
2018-08-30rcu: Provide improved interrupt-from-idle check in rcu_check_callbacks()Paul E. McKenney
The patch making need_resched() respond to urgent RCU-QS needs used is_idle_task(current) to detect an interrupt from idle, which does work reasonably, but is (in theory at least) vulnerable to loops containing need_resched() invoked from within RCU_NONIDLE() or its tracepoint equivalent. This commit therefore moves rcu_is_cpu_rrupt_from_idle() to a place from which rcu_check_callbacks() can invoke it and replaces the is_idle_task(current) with rcu_is_cpu_rrupt_from_idle(). Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2018-08-30rcu: Make need_resched() respond to urgent RCU-QS needsPaul E. McKenney
The per-CPU rcu_dynticks.rcu_urgent_qs variable communicates an urgent need for an RCU quiescent state from the force-quiescent-state processing within the grace-period kthread to context switches and to cond_resched(). Unfortunately, such urgent needs are not communicated to need_resched(), which is sometimes used to decide when to invoke cond_resched(), for but one example, within the KVM vcpu_run() function. As of v4.15, this can result in synchronize_sched() being delayed by up to ten seconds, which can be problematic, to say nothing of annoying. This commit therefore checks rcu_dynticks.rcu_urgent_qs from within rcu_check_callbacks(), which is invoked from the scheduling-clock interrupt handler. If the current task is not an idle task and is not executing in usermode, a context switch is forced, and either way, the rcu_dynticks.rcu_urgent_qs variable is set to false. If the current task is an idle task, then RCU's dyntick-idle code will detect the quiescent state, so no further action is required. Similarly, if the task is executing in usermode, other code in rcu_check_callbacks() and its called functions will report the corresponding quiescent state. Reported-by: Marius Hillenbrand <mhillenb@amazon.de> Reported-by: David Woodhouse <dwmw2@infradead.org> Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2018-08-30rcu: Inline _rcu_barrier() into its sole remaining callerPaul E. McKenney
Because rcu_barrier() is a one-line wrapper function for _rcu_barrier() and because nothing else calls _rcu_barrier(), this commit inlines _rcu_barrier() into rcu_barrier(). Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2018-08-30rcu: Define rcu_all_qs() only in !PREEMPT buildsPaul E. McKenney
Now that rcu_all_qs() is used only in !PREEMPT builds, move it to tree_plugin.h so that it is defined only in those builds. This in turn means that rcu_momentary_dyntick_idle() is only used in !PREEMPT builds, but it is simply marked __maybe_unused in order to keep it near the rest of the dyntick-idle code. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2018-08-30rcu: Clean up flavor-related definitions and comments in update.cPaul E. McKenney
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2018-08-30rcu: Clean up flavor-related definitions and comments in tree_plugin.hPaul E. McKenney
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2018-08-30rcu: Clean up flavor-related definitions and comments in tree_exp.hPaul E. McKenney
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2018-08-30rcu: Clean up flavor-related definitions and comments in tree.cPaul E. McKenney
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2018-08-30rcu: Clean up flavor-related definitions and comments in tiny.cPaul E. McKenney
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2018-08-30rcu: Clean up flavor-related definitions and comments in srcutree.hPaul E. McKenney
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2018-08-30rcu: Clean up flavor-related definitions and comments in rcutorture.cPaul E. McKenney
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2018-08-30rcu: Clean up flavor-related definitions and comments in rcu.hPaul E. McKenney
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>