summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2021-12-07rcu: Prevent expedited GP from enabling tick on offline CPUPaul E. McKenney
If an RCU expedited grace period starts just when a CPU is in the process of going offline, so that the outgoing CPU has completed its pass through stop-machine but has not yet completed its final dive into the idle loop, RCU will attempt to enable that CPU's scheduling-clock tick via a call to tick_dep_set_cpu(). For this to happen, that CPU has to have been online when the expedited grace period completed its CPU-selection phase. This is pointless: The outgoing CPU has interrupts disabled, so it cannot take a scheduling-clock tick anyway. In addition, the tick_dep_set_cpu() function's eventual call to irq_work_queue_on() will splat as follows: smpboot: CPU 1 is now offline WARNING: CPU: 6 PID: 124 at kernel/irq_work.c:95 +irq_work_queue_on+0x57/0x60 Modules linked in: CPU: 6 PID: 124 Comm: kworker/6:2 Not tainted 5.15.0-rc1+ #3 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS +rel-1.14.0-0-g155821a-rebuilt.opensuse.org 04/01/2014 Workqueue: rcu_gp wait_rcu_exp_gp RIP: 0010:irq_work_queue_on+0x57/0x60 Code: 8b 05 1d c7 ea 62 a9 00 00 f0 00 75 21 4c 89 ce 44 89 c7 e8 +9b 37 fa ff ba 01 00 00 00 89 d0 c3 4c 89 cf e8 3b ff ff ff eb ee <0f> 0b eb b7 +0f 0b eb db 90 48 c7 c0 98 2a 02 00 65 48 03 05 91 6f RSP: 0000:ffffb12cc038fe48 EFLAGS: 00010282 RAX: 0000000000000001 RBX: 0000000000005208 RCX: 0000000000000020 RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff9ad01f45a680 RBP: 000000000004c990 R08: 0000000000000001 R09: ffff9ad01f45a680 R10: ffffb12cc0317db0 R11: 0000000000000001 R12: 00000000fffecee8 R13: 0000000000000001 R14: 0000000000026980 R15: ffffffff9e53ae00 FS: 0000000000000000(0000) GS:ffff9ad01f580000(0000) +knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 000000000de0c000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: tick_nohz_dep_set_cpu+0x59/0x70 rcu_exp_wait_wake+0x54e/0x870 ? sync_rcu_exp_select_cpus+0x1fc/0x390 process_one_work+0x1ef/0x3c0 ? process_one_work+0x3c0/0x3c0 worker_thread+0x28/0x3c0 ? process_one_work+0x3c0/0x3c0 kthread+0x115/0x140 ? set_kthread_struct+0x40/0x40 ret_from_fork+0x22/0x30 ---[ end trace c5bf75eb6aa80bc6 ]--- This commit therefore avoids invoking tick_dep_set_cpu() on offlined CPUs to limit both futility and false-positive splats. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-12-07rcu: Mark sync_sched_exp_online_cleanup() ->cpu_no_qs.b.exp loadPaul E. McKenney
The sync_sched_exp_online_cleanup() is called from rcutree_online_cpu(), which can be invoked with interrupts enabled. This means that the ->cpu_no_qs.b.exp field is subject to data races from the rcu_exp_handler() IPI handler, so this commit marks the load from that field. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-12-07rcu: Remove rcu_data.exp_deferred_qs and convert to rcu_data.cpu no_qs.b.expFrederic Weisbecker
Having two fields for the same purpose with subtle differences on different RCU flavours is confusing, especially when both fields always exist on both RCU flavours. Fortunately, it is now safe for preemptible RCU to rely on the rcu_data structure's ->cpu_no_qs.b.exp field, just like non-preemptible RCU. This commit therefore removes the ad-hoc ->exp_deferred_qs field. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-12-07rcu: Move rcu_data.cpu_no_qs.b.exp reset to rcu_export_exp_rdp()Frederic Weisbecker
On non-preemptible RCU, move clearing of the rcu_data structure's ->cpu_no_qs.b.exp filed to the actual expedited quiescent state report function, matching hw preemptible RCU handles the ->exp_deferred_qs field. This prepares for removing ->exp_deferred_qs in favor of ->cpu_no_qs.b.exp for both preemptible and non-preemptible RCU. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-12-07rcu: Ignore rdp.cpu_no_qs.b.exp on preemptible RCU's rcu_qs()Frederic Weisbecker
Preemptible RCU does not use the rcu_data structure's ->cpu_no_qs.b.exp, instead using a separate ->exp_deferred_qs field to record the need for an expedited quiescent state. In fact ->cpu_no_qs.b.exp should never be set in preemptible RCU because preemptible RCU's expedited grace periods use other mechanisms to record quiescent states. This commit therefore removes the implicit rcu_qs() reference to ->cpu_no_qs.b.exp in favor of a direct reference to ->cpu_no_qs.b.norm. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-11-30torture: Properly redirect kvm-remote.sh "echo" commandsPaul E. McKenney
The echo commands following initialization of the "oldrun" variable need to be "tee"d to $oldrun/remote-log. This commit fixes several stragglers. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-11-30torture: Fix incorrectly redirected "exit" in kvm-remote.shPaul E. McKenney
The "exit 4" in kvm-remote.sh is pointlessly redirected, so this commit removes the redirection. Fixes: 0092eae4cb4e ("torture: Add kvm-remote.sh script for distributed rcutorture test runs") Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-11-30rcutorture: Test RCU Tasks lock-contention detectionPaul E. McKenney
This commit adjusts the TRACE02 scenario to use a pair of callback-flood kthreads. This in turn forces lock contention on the single RCU Tasks Trace callback queue, which forces use of all CPUs' queues, thus testing this transition. (No, there is not yet any way to transition back. Cc: Neeraj Upadhyay <neeraj.iitr10@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-11-30rcutorture: Cause TREE02 and TREE10 scenarios to do more callback floodingPaul E. McKenney
This commit enables two callback-flood kthreads for the TREE02 scenario and 28 for the TREE10 scenario. Cc: Neeraj Upadhyay <neeraj.iitr10@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-11-30torture: Retry download once before giving upPaul E. McKenney
Currently, a transient network error can kill a run if it happens while downloading the tarball to one of the target systems. This commit therefore does a 60-second wait and then a retry. If further experience indicates, a more elaborate mechanism might be used later. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-11-30torture: Make kvm-find-errors.sh report link-time undefined symbolsPaul E. McKenney
This commit makes kvm-find-errors.sh check for and report undefined symbols that are detected at link time. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-11-30torture: Catch kvm.sh help text up with actual optionsPaul E. McKenney
This commit brings the kvm.sh script's help text up to date with recently (and some not-so-recently) added parameters. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-11-30refscale: Prevent buffer to pr_alert() being too longLi Zhijian
0Day/LKP observed that the refscale results fail to complete when larger values of nrun (such as 300) are specified. The problem is that printk() can accept at most a 1024-byte buffer. This commit therefore prints the buffer whenever its length exceeds 800 bytes. CC: Philip Li <philip.li@intel.com> Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-11-30refscale: Simplify the errexit checkpointLi Zhijian
There is only the one OOM error case in main_func(), so this commit eliminates the errexit local variable in favor of a branch to cleanup code. Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-11-30rcutorture: Suppress pi-lock-across read-unlock testing for Tiny SRCUPaul E. McKenney
Because Tiny srcu_read_unlock() directly calls swake_up_one(), lockdep complains when a pi lock is held across that srcu_read_unlock(). Although this is a lockdep false positive (there is no other CPU to complete the deadlock cycle), lockdep is what it is at the moment. This commit therefore prevents rcutorture from holding pi lock across a Tiny srcu_read_unlock(). Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-11-30rcutorture: More thoroughly test nested readersPaul E. McKenney
Currently, nested readers occur only when a timer handler interrupts a reader. This is rare, and is thus insufficient testing of the transition between nesting levels. This commit therefore causes rcutorture nested readers to be the rule rather than the exception. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-11-30rcutorture: Sanitize RCUTORTURE_RDR_MASKPaul E. McKenney
RCUTORTURE_RDR_MASK is currently not the bit indicated by RCUTORTURE_RDR_SHIFT, but is instead all the bits less significant than that one. This is an accident waiting to happen, so this commit makes RCUTORTURE_RDR_MASK be that one bit and adjusts uses accordingly. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-11-30rcu-tasks: Don't remove tasks with pending IPIs from holdout listPaul E. McKenney
Currently, the check_all_holdout_tasks_trace() function removes all tasks marked with ->trc_reader_checked from the holdout list, including those with IPIs pending. This means that the IPI handler might arrive at a task that has already been removed from the list, which is at best an accident waiting to happen. This commit therefore avoids removing tasks with IPIs pending from the holdout list. This in turn means that the "if" condition in the for_each_online_cpu() loop in rcu_tasks_trace_postgp() should always evaluate to false, so a WARN_ON_ONCE() is added to check that. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-11-30srcu: Prevent redundant __srcu_read_unlock() wakeupPaul E. McKenney
Tiny SRCU readers can appear at task level, but also in interrupt and softirq handlers. Because Tiny SRCU is selected only in kernels built with CONFIG_SMP=n and CONFIG_PREEMPTION=n, it is not possible for a grace period to start while there is a non-task-level SRCU reader executing. This means that it does not make sense for __srcu_read_unlock() to awaken the Tiny SRCU grace period, because that can only happen when the grace period is waiting for one value of ->srcu_idx and __srcu_read_unlock() is ending the last reader for some other value of ->srcu_idx. After all, any such wakeup will be redundant. Worse yet, in some cases, such wakeups generate lockdep splats: ====================================================== WARNING: possible circular locking dependency detected 5.15.0-rc1+ #3758 Not tainted ------------------------------------------------------ rcu_torture_rea/53 is trying to acquire lock: ffffffff9514e6a8 (srcu_ctl.srcu_wq.lock){..-.}-{2:2}, at: xa/0x30 but task is already holding lock: ffff95c642479d80 (&p->pi_lock){-.-.}-{2:2}, at: _extend+0x370/0x400 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (&p->pi_lock){-.-.}-{2:2}: _raw_spin_lock_irqsave+0x2f/0x50 try_to_wake_up+0x50/0x580 swake_up_locked.part.7+0xe/0x30 swake_up_one+0x22/0x30 rcutorture_one_extend+0x1b6/0x400 rcu_torture_one_read+0x290/0x5d0 rcu_torture_timer+0x1a/0x70 call_timer_fn+0xa6/0x230 run_timer_softirq+0x493/0x4c0 __do_softirq+0xc0/0x371 irq_exit+0x73/0x90 sysvec_apic_timer_interrupt+0x63/0x80 asm_sysvec_apic_timer_interrupt+0x12/0x20 default_idle+0xb/0x10 default_idle_call+0x5e/0x170 do_idle+0x18a/0x1f0 cpu_startup_entry+0xa/0x10 start_kernel+0x678/0x69f secondary_startup_64_no_verify+0xc2/0xcb -> #0 (srcu_ctl.srcu_wq.lock){..-.}-{2:2}: __lock_acquire+0x130c/0x2440 lock_acquire+0xc2/0x270 _raw_spin_lock_irqsave+0x2f/0x50 swake_up_one+0xa/0x30 rcutorture_one_extend+0x387/0x400 rcu_torture_one_read+0x290/0x5d0 rcu_torture_reader+0xac/0x200 kthread+0x12d/0x150 ret_from_fork+0x22/0x30 other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&p->pi_lock); lock(srcu_ctl.srcu_wq.lock); lock(&p->pi_lock); lock(srcu_ctl.srcu_wq.lock); *** DEADLOCK *** 1 lock held by rcu_torture_rea/53: #0: ffff95c642479d80 (&p->pi_lock){-.-.}-{2:2}, at: _extend+0x370/0x400 stack backtrace: CPU: 0 PID: 53 Comm: rcu_torture_rea Not tainted 5.15.0-rc1+ Hardware name: Red Hat KVM/RHEL-AV, BIOS e_el8.5.0+746+bbd5d70c 04/01/2014 Call Trace: check_noncircular+0xfe/0x110 ? find_held_lock+0x2d/0x90 __lock_acquire+0x130c/0x2440 lock_acquire+0xc2/0x270 ? swake_up_one+0xa/0x30 ? find_held_lock+0x72/0x90 _raw_spin_lock_irqsave+0x2f/0x50 ? swake_up_one+0xa/0x30 swake_up_one+0xa/0x30 rcutorture_one_extend+0x387/0x400 rcu_torture_one_read+0x290/0x5d0 rcu_torture_reader+0xac/0x200 ? rcutorture_oom_notify+0xf0/0xf0 ? __kthread_parkme+0x61/0x90 ? rcu_torture_one_read+0x5d0/0x5d0 kthread+0x12d/0x150 ? set_kthread_struct+0x40/0x40 ret_from_fork+0x22/0x30 This is a false positive because there is only one CPU, and both locks are raw (non-preemptible) spinlocks. However, it is worthwhile getting rid of the redundant wakeup, which has the side effect of breaking the theoretical deadlock cycle. This commit therefore eliminates the redundant wakeups. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-11-30tools/nolibc: Implement gettid()Mark Brown
Allow test programs to determine their thread ID. Signed-off-by: Mark Brown <broonie@kernel.org> Cc: Willy Tarreau <w@1wt.eu> Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-11-30tools/nolibc: x86-64: Use `mov $60,%eax` instead of `mov $60,%rax`Ammar Faizi
Note that mov to 32-bit register will zero extend to 64-bit register. Thus `mov $60,%eax` has the same effect with `mov $60,%rax`. Use the shorter opcode to achieve the same thing. ``` b8 3c 00 00 00 mov $60,%eax (5 bytes) [1] 48 c7 c0 3c 00 00 00 mov $60,%rax (7 bytes) [2] ``` Currently, we use [2]. Change it to [1] for shorter code. Signed-off-by: Ammar Faizi <ammar.faizi@students.amikom.ac.id> Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-11-30tools/nolibc: x86: Remove `r8`, `r9` and `r10` from the clobber listAmmar Faizi
Linux x86-64 syscall only clobbers rax, rcx and r11 (and "memory"). - rax for the return value. - rcx to save the return address. - r11 to save the rflags. Other registers are preserved. Having r8, r9 and r10 in the syscall clobber list is harmless, but this results in a missed-optimization. As the syscall doesn't clobber r8-r10, GCC should be allowed to reuse their value after the syscall returns to userspace. But since they are in the clobber list, GCC will always miss this opportunity. Remove them from the x86-64 syscall clobber list to help GCC generate better code and fix the comment. See also the x86-64 ABI, section A.2 AMD64 Linux Kernel Conventions, A.2.1 Calling Conventions [1]. Extra note: Some people may think it does not really give a benefit to remove r8, r9 and r10 from the syscall clobber list because the impression of syscall is a C function call, and function call always clobbers those 3. However, that is not the case for nolibc.h, because we have a potential to inline the "syscall" instruction (which its opcode is "0f 05") to the user functions. All syscalls in the nolibc.h are written as a static function with inline ASM and are likely always inline if we use optimization flag, so this is a profit not to have r8, r9 and r10 in the clobber list. Here is the example where this matters. Consider the following C code: ``` #include "tools/include/nolibc/nolibc.h" #define read_abc(a, b, c) __asm__ volatile("nop"::"r"(a),"r"(b),"r"(c)) int main(void) { int a = 0xaa; int b = 0xbb; int c = 0xcc; read_abc(a, b, c); write(1, "test\n", 5); read_abc(a, b, c); return 0; } ``` Compile with: gcc -Os test.c -o test -nostdlib With r8, r9, r10 in the clobber list, GCC generates this: 0000000000001000 <main>: 1000: f3 0f 1e fa endbr64 1004: 41 54 push %r12 1006: 41 bc cc 00 00 00 mov $0xcc,%r12d 100c: 55 push %rbp 100d: bd bb 00 00 00 mov $0xbb,%ebp 1012: 53 push %rbx 1013: bb aa 00 00 00 mov $0xaa,%ebx 1018: 90 nop 1019: b8 01 00 00 00 mov $0x1,%eax 101e: bf 01 00 00 00 mov $0x1,%edi 1023: ba 05 00 00 00 mov $0x5,%edx 1028: 48 8d 35 d1 0f 00 00 lea 0xfd1(%rip),%rsi 102f: 0f 05 syscall 1031: 90 nop 1032: 31 c0 xor %eax,%eax 1034: 5b pop %rbx 1035: 5d pop %rbp 1036: 41 5c pop %r12 1038: c3 ret GCC thinks that syscall will clobber r8, r9, r10. So it spills 0xaa, 0xbb and 0xcc to callee saved registers (r12, rbp and rbx). This is clearly extra memory access and extra stack size for preserving them. But syscall does not actually clobber them, so this is a missed optimization. Now without r8, r9, r10 in the clobber list, GCC generates better code: 0000000000001000 <main>: 1000: f3 0f 1e fa endbr64 1004: 41 b8 aa 00 00 00 mov $0xaa,%r8d 100a: 41 b9 bb 00 00 00 mov $0xbb,%r9d 1010: 41 ba cc 00 00 00 mov $0xcc,%r10d 1016: 90 nop 1017: b8 01 00 00 00 mov $0x1,%eax 101c: bf 01 00 00 00 mov $0x1,%edi 1021: ba 05 00 00 00 mov $0x5,%edx 1026: 48 8d 35 d3 0f 00 00 lea 0xfd3(%rip),%rsi 102d: 0f 05 syscall 102f: 90 nop 1030: 31 c0 xor %eax,%eax 1032: c3 ret Cc: Andy Lutomirski <luto@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: x86@kernel.org Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: David Laight <David.Laight@ACULAB.COM> Acked-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Ammar Faizi <ammar.faizi@students.amikom.ac.id> Link: https://gitlab.com/x86-psABIs/x86-64-ABI/-/wikis/x86-64-psABI [1] Link: https://lore.kernel.org/lkml/20211011040344.437264-1-ammar.faizi@students.amikom.ac.id/ Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-11-30tools/nolibc: fix incorrect truncation of exit codeWilly Tarreau
Ammar Faizi reported that our exit code handling is wrong. We truncate it to the lowest 8 bits but the syscall itself is expected to take a regular 32-bit signed integer, not an unsigned char. It's the kernel that later truncates it to the lowest 8 bits. The difference is visible in strace, where the program below used to show exit(255) instead of exit(-1): int main(void) { return -1; } This patch applies the fix to all archs. x86_64, i386, arm64, armv7 and mips were all tested and confirmed to work fine now. Risc-v was not tested but the change is trivial and exactly the same as for other archs. Reported-by: Ammar Faizi <ammar.faizi@students.amikom.ac.id> Cc: stable@vger.kernel.org Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-11-30tools/nolibc: i386: fix initial stack alignmentWilly Tarreau
After re-checking in the spec and comparing stack offsets with glibc, The last pushed argument must be 16-byte aligned (i.e. aligned before the call) so that in the callee esp+4 is multiple of 16, so the principle is the 32-bit equivalent to what Ammar fixed for x86_64. It's possible that 32-bit code using SSE2 or MMX could have been affected. In addition the frame pointer ought to be zero at the deepest level. Link: https://gitlab.com/x86-psABIs/i386-ABI/-/wikis/Intel386-psABI Cc: Ammar Faizi <ammar.faizi@students.amikom.ac.id> Cc: stable@vger.kernel.org Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-11-30tools/nolibc: x86-64: Fix startup code bugAmmar Faizi
Before this patch, the `_start` function looks like this: ``` 0000000000001170 <_start>: 1170: pop %rdi 1171: mov %rsp,%rsi 1174: lea 0x8(%rsi,%rdi,8),%rdx 1179: and $0xfffffffffffffff0,%rsp 117d: sub $0x8,%rsp 1181: call 1000 <main> 1186: movzbq %al,%rdi 118a: mov $0x3c,%rax 1191: syscall 1193: hlt 1194: data16 cs nopw 0x0(%rax,%rax,1) 119f: nop ``` Note the "and" to %rsp with $-16, it makes the %rsp be 16-byte aligned, but then there is a "sub" with $0x8 which makes the %rsp no longer 16-byte aligned, then it calls main. That's the bug! What actually the x86-64 System V ABI mandates is that right before the "call", the %rsp must be 16-byte aligned, not after the "call". So the "sub" with $0x8 here breaks the alignment. Remove it. An example where this rule matters is when the callee needs to align its stack at 16-byte for aligned move instruction, like `movdqa` and `movaps`. If the callee can't align its stack properly, it will result in segmentation fault. x86-64 System V ABI also mandates the deepest stack frame should be zero. Just to be safe, let's zero the %rbp on startup as the content of %rbp may be unspecified when the program starts. Now it looks like this: ``` 0000000000001170 <_start>: 1170: pop %rdi 1171: mov %rsp,%rsi 1174: lea 0x8(%rsi,%rdi,8),%rdx 1179: xor %ebp,%ebp # zero the %rbp 117b: and $0xfffffffffffffff0,%rsp # align the %rsp 117f: call 1000 <main> 1184: movzbq %al,%rdi 1188: mov $0x3c,%rax 118f: syscall 1191: hlt 1192: data16 cs nopw 0x0(%rax,%rax,1) 119d: nopl (%rax) ``` Cc: Bedirhan KURT <windowz414@gnuweeb.org> Cc: Louvian Lyndal <louvianlyndal@gmail.com> Reported-by: Peter Cordes <peter@cordes.ca> Signed-off-by: Ammar Faizi <ammar.faizi@students.amikom.ac.id> [wt: I did this on purpose due to a misunderstanding of the spec, other archs will thus have to be rechecked, particularly i386] Cc: stable@vger.kernel.org Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-11-30rcu: Avoid alloc_pages() when recording stackJun Miao
The default kasan_record_aux_stack() calls stack_depot_save() with GFP_NOWAIT, which in turn can then call alloc_pages(GFP_NOWAIT, ...). In general, however, it is not even possible to use either GFP_ATOMIC nor GFP_NOWAIT in certain non-preemptive contexts/RT kernel including raw_spin_locks (see gfp.h and ab00db216c9c7). Fix it by instructing stackdepot to not expand stack storage via alloc_pages() in case it runs out by using kasan_record_aux_stack_noalloc(). Jianwei Hu reported: BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:969 in_atomic(): 0, irqs_disabled(): 1, non_block: 0, pid: 15319, name: python3 INFO: lockdep is turned off. irq event stamp: 0 hardirqs last enabled at (0): [<0000000000000000>] 0x0 hardirqs last disabled at (0): [<ffffffff856c8b13>] copy_process+0xaf3/0x2590 softirqs last enabled at (0): [<ffffffff856c8b13>] copy_process+0xaf3/0x2590 softirqs last disabled at (0): [<0000000000000000>] 0x0 CPU: 6 PID: 15319 Comm: python3 Tainted: G W O 5.15-rc7-preempt-rt #1 Hardware name: Supermicro SYS-E300-9A-8C/A2SDi-8C-HLN4F, BIOS 1.1b 12/17/2018 Call Trace: show_stack+0x52/0x58 dump_stack+0xa1/0xd6 ___might_sleep.cold+0x11c/0x12d rt_spin_lock+0x3f/0xc0 rmqueue+0x100/0x1460 rmqueue+0x100/0x1460 mark_usage+0x1a0/0x1a0 ftrace_graph_ret_addr+0x2a/0xb0 rmqueue_pcplist.constprop.0+0x6a0/0x6a0 __kasan_check_read+0x11/0x20 __zone_watermark_ok+0x114/0x270 get_page_from_freelist+0x148/0x630 is_module_text_address+0x32/0xa0 __alloc_pages_nodemask+0x2f6/0x790 __alloc_pages_slowpath.constprop.0+0x12d0/0x12d0 create_prof_cpu_mask+0x30/0x30 alloc_pages_current+0xb1/0x150 stack_depot_save+0x39f/0x490 kasan_save_stack+0x42/0x50 kasan_save_stack+0x23/0x50 kasan_record_aux_stack+0xa9/0xc0 __call_rcu+0xff/0x9c0 call_rcu+0xe/0x10 put_object+0x53/0x70 __delete_object+0x7b/0x90 kmemleak_free+0x46/0x70 slab_free_freelist_hook+0xb4/0x160 kfree+0xe5/0x420 kfree_const+0x17/0x30 kobject_cleanup+0xaa/0x230 kobject_put+0x76/0x90 netdev_queue_update_kobjects+0x17d/0x1f0 ... ... ksys_write+0xd9/0x180 __x64_sys_write+0x42/0x50 do_syscall_64+0x38/0x50 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Links: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/include/linux/kasan.h?id=7cb3007ce2da27ec02a1a3211941e7fe6875b642 Fixes: 84109ab58590 ("rcu: Record kvfree_call_rcu() call stack for KASAN") Fixes: 26e760c9a7c8 ("rcu: kasan: record and print call_rcu() call stack") Reported-by: Jianwei Hu <jianwei.hu@windriver.com> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Acked-by: Marco Elver <elver@google.com> Tested-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Jun Miao <jun.miao@intel.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-11-30rcu: Avoid running boost kthreads on isolated CPUsZqiang
When the boost kthreads are created on systems with nohz_full CPUs, the cpus_allowed_ptr is set to housekeeping_cpumask(HK_FLAG_KTHREAD). However, when the rcu_boost_kthread_setaffinity() is called, the original affinity will be changed and these kthreads can subsequently run on nohz_full CPUs. This commit makes rcu_boost_kthread_setaffinity() restrict these boost kthreads to housekeeping CPUs. Signed-off-by: Zqiang <qiang.zhang1211@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-11-30rcu: Improve tree_plugin.h comments and add code cleanupsZhouyi Zhou
This commit cleans up some comments and code in kernel/rcu/tree_plugin.h. Signed-off-by: Zhouyi Zhou <zhouzhouyi@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-11-30rcu: in_irq() cleanupChangbin Du
This commit replaces the obsolete and ambiguous macro in_irq() with its shiny new in_hardirq() equivalent. Signed-off-by: Changbin Du <changbin.du@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-11-30rcu: Replace ________p1 and _________p1 with __UNIQUE_ID(rcu)Chun-Hung Tseng
This commit replaces both ________p1 and _________p1 with __UNIQUE_ID(rcu), and also adjusts the callers of the affected macros. __UNIQUE_ID(rcu) will generate unique variable names during compilation, which eliminates the need of ________p1 and _________p1 (both having 4 occurrences prior to the code change). This also avoids the variable name shadowing issue, or at least makes those wishing to cause shadowing problems work much harder to do so. The same idea is used for the min/max macros (commit 589a978 and commit e9092d0). Signed-off-by: Jim Huang <jserv@ccns.ncku.edu.tw> Signed-off-by: Chun-Hung Tseng <henrybear327@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-11-30rcu: Move rcu_needs_cpu() to tree.cPaul E. McKenney
Now that RCU_FAST_NO_HZ is no more, there is but one implementation of the rcu_needs_cpu() function. This commit therefore moves this function from kernel/rcu/tree_plugin.c to kernel/rcu/tree.c. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-11-30rcu: Remove the RCU_FAST_NO_HZ Kconfig optionPaul E. McKenney
All of the uses of CONFIG_RCU_FAST_NO_HZ=y that I have seen involve systems with RCU callbacks offloaded. In this situation, all that this Kconfig option does is slow down idle entry/exit with an additional allways-taken early exit. If this is the only use case, then this Kconfig option nothing but an attractive nuisance that needs to go away. This commit therefore removes the RCU_FAST_NO_HZ Kconfig option. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-11-30torture: Remove RCU_FAST_NO_HZ from rcu scenariosPaul E. McKenney
All of the rcu scenarios that mentioning CONFIG_RCU_FAST_NO_HZ disable it. But this Kconfig option is disabled by default, so this commit removes the pointless "CONFIG_RCU_FAST_NO_HZ=n" lines from these scenarios. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-11-30torture: Remove RCU_FAST_NO_HZ from rcuscale and refscale scenariosPaul E. McKenney
All of the rcuscale and refscale scenarios that mention the Kconfig option CONFIG_RCU_FAST_NO_HZ disable it. But this Kconfig option is disabled by default, so this commit removes the pointless "CONFIG_RCU_FAST_NO_HZ=n" lines from these scenarios. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-11-30doc: RCU: Avoid 'Symbol' font-family in SVG figuresAkira Yokosawa
On Ubuntu Focal, strings in some of SVG files under Documentation/RCU/Design can not be rendered properly when converted to PDF. Ubuntu releases since Focal and Debian bullseye have trouble with "Symbol" font-family in SVG files. As those strings are mostly API names such as "READ_ONCE()", "WRITE_ONCE(), "rcu_read_lock()", and so on, using a generic monospace font-family should be a good alternative. Substitute the font-family name by a simple sed pattern: 's/Symbol/monospace/g' Signed-off-by: Akira Yokosawa <akiyks@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-11-30doc: Add refcount analogy to What is RCUNeilBrown
The reader-writer-lock analogy is a useful way to think about RCU, but it is not always applicable. It is useful to have other analogies to work with, and particularly to emphasise that no single analogy is perfect. This patch add a "RCU as reference count" to the "what is RCU" document. See https://lwn.net/Articles/872559/ [ paulmck: Apply Akira Yokosawa feedback. ] Reviewed-by: Akira Yokosawa <akiyks@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-11-30doc: Remove obsolete kernel-per-CPU-kthreads RCU_FAST_NO_HZ advicePaul E. McKenney
This document advises building with both CONFIG_NO_HZ=y and CONFIG_RCU_FAST_NO_HZ=y. However, CONFIG_NO_HZ=y offloads callbacks from all nohz_full CPUs, and CPUs with offloaded callbacks do not benefit from CONFIG_RCU_FAST_NO_HZ=y. Quite the opposite: CONFIG_RCU_FAST_NO_HZ=y simply adds a bit of idle entry/exit overhead. This commit therefore changes that advice to only CONFIG_NO_HZ=y. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-11-30rcutorture: Add CONFIG_PREEMPT_DYNAMIC=n to tiny scenariosPaul E. McKenney
With CONFIG_PREEMPT_DYNAMIC=y, the kernel builds with CONFIG_PREEMPTION=y because preemption can be enabled at runtime. This prevents any tests of Tiny RCU or Tiny SRCU from running correctly. This commit therefore explicitly sets CONFIG_PREEMPT_DYNAMIC=n for those scenarios. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-11-14Linux 5.16-rc1v5.16-rc1Linus Torvalds
2021-11-14kconfig: Add support for -Wimplicit-fallthroughGustavo A. R. Silva
Add Kconfig support for -Wimplicit-fallthrough for both GCC and Clang. The compiler option is under configuration CC_IMPLICIT_FALLTHROUGH, which is enabled by default. Special thanks to Nathan Chancellor who fixed the Clang bug[1][2]. This bugfix only appears in Clang 14.0.0, so older versions still contain the bug and -Wimplicit-fallthrough won't be enabled for them, for now. This concludes a long journey and now we are finally getting rid of the unintentional fallthrough bug-class in the kernel, entirely. :) Link: https://github.com/llvm/llvm-project/commit/9ed4a94d6451046a51ef393cd62f00710820a7e8 [1] Link: https://bugs.llvm.org/show_bug.cgi?id=51094 [2] Link: https://github.com/KSPP/linux/issues/115 Link: https://github.com/ClangBuiltLinux/linux/issues/236 Co-developed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Kees Cook <keescook@chromium.org> Co-developed-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Reviewed-by: Nathan Chancellor <nathan@kernel.org> Tested-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-14Merge tag 'xfs-5.16-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull xfs cleanups from Darrick Wong: "The most 'exciting' aspect of this branch is that the xfsprogs maintainer and I have worked through the last of the code discrepancies between kernel and userspace libxfs such that there are no code differences between the two except for #includes. IOWs, diff suffices to demonstrate that the userspace tools behave the same as the kernel, and kernel-only bits are clearly marked in the /kernel/ source code instead of just the userspace source. Summary: - Clean up open-coded swap() calls. - A little bit of #ifdef golf to complete the reunification of the kernel and userspace libxfs source code" * tag 'xfs-5.16-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: sync xfs_btree_split macros with userspace libxfs xfs: #ifdef out perag code for userspace xfs: use swap() to make dabtree code cleaner
2021-11-14Merge tag 'for-5.16/parisc-3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux Pull more parisc fixes from Helge Deller: "Fix a build error in stracktrace.c, fix resolving of addresses to function names in backtraces, fix single-stepping in assembly code and flush userspace pte's when using set_pte_at()" * tag 'for-5.16/parisc-3' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux: parisc/entry: fix trace test in syscall exit path parisc: Flush kernel data mapping in set_pte_at() when installing pte for user page parisc: Fix implicit declaration of function '__kernel_text_address' parisc: Fix backtrace to always include init funtion names
2021-11-14Merge tag 'sh-for-5.16' of git://git.libc.org/linux-shLinus Torvalds
Pull arch/sh updates from Rich Felker. * tag 'sh-for-5.16' of git://git.libc.org/linux-sh: sh: pgtable-3level: Fix cast to pointer from integer of different size sh: fix READ/WRITE redefinition warnings sh: define __BIG_ENDIAN for math-emu sh: math-emu: drop unused functions sh: fix kconfig unmet dependency warning for FRAME_POINTER sh: Cleanup about SPARSE_IRQ sh: kdump: add some attribute to function maple: fix wrong return value of maple_bus_init(). sh: boot: avoid unneeded rebuilds under arch/sh/boot/compressed/ sh: boot: add intermediate vmlinux.bin* to targets instead of extra-y sh: boards: Fix the cacography in irq.c sh: check return code of request_irq sh: fix trivial misannotations
2021-11-14Merge tag 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-armLinus Torvalds
Pull ARM fixes from Russell King: - Fix early_iounmap - Drop cc-option fallbacks for architecture selection * tag 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-arm: ARM: 9156/1: drop cc-option fallbacks for architecture selection ARM: 9155/1: fix early early_iounmap()
2021-11-14Merge tag 'devicetree-fixes-for-5.16-1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux Pull devicetree fixes from Rob Herring: - Two fixes due to DT node name changes on Arm, Ltd. boards - Treewide rename of Ingenic CGU headers - Update ST email addresses - Remove Netlogic DT bindings - Dropping few more cases of redundant 'maxItems' in schemas - Convert toshiba,tc358767 bridge binding to schema * tag 'devicetree-fixes-for-5.16-1' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux: dt-bindings: watchdog: sunxi: fix error in schema bindings: media: venus: Drop redundant maxItems for power-domain-names dt-bindings: Remove Netlogic bindings clk: versatile: clk-icst: Ensure clock names are unique of: Support using 'mask' in making device bus id dt-bindings: treewide: Update @st.com email address to @foss.st.com dt-bindings: media: Update maintainers for st,stm32-hwspinlock.yaml dt-bindings: media: Update maintainers for st,stm32-cec.yaml dt-bindings: mfd: timers: Update maintainers for st,stm32-timers dt-bindings: timer: Update maintainers for st,stm32-timer dt-bindings: i2c: imx: hardware do not restrict clock-frequency to only 100 and 400 kHz dt-bindings: display: bridge: Convert toshiba,tc358767.txt to yaml dt-bindings: Rename Ingenic CGU headers to ingenic,*.h
2021-11-14Merge tag 'timers-urgent-2021-11-14' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fix from Thomas Gleixner: "A single fix for POSIX CPU timers to address a problem where POSIX CPU timer delivery stops working for a new child task because copy_process() copies state information which is only valid for the parent task" * tag 'timers-urgent-2021-11-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: posix-cpu-timers: Clear task::posix_cputimers_work in copy_process()
2021-11-14Merge tag 'irq-urgent-2021-11-14' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq fixes from Thomas Gleixner: "A set of fixes for the interrupt subsystem Core code: - A regression fix for the Open Firmware interrupt mapping code where a interrupt controller property in a node caused a map property in the same node to be ignored. Interrupt chip drivers: - Workaround a limitation in SiFive PLIC interrupt chip which silently ignores an EOI when the interrupt line is masked. - Provide the missing mask/unmask implementation for the CSKY MP interrupt controller. PCI/MSI: - Prevent a use after free when PCI/MSI interrupts are released by destroying the sysfs entries before freeing the memory which is accessed in the sysfs show() function. - Implement a mask quirk for the Nvidia ION AHCI chip which does not advertise masking capability despite implementing it. Even worse the chip comes out of reset with all MSI entries masked, which due to the missing masking capability never get unmasked. - Move the check which prevents accessing the MSI[X] masking for XEN back into the low level accessors. The recent consolidation missed that these accessors can be invoked from places which do not have that check which broke XEN. Move them back to he original place instead of sprinkling tons of these checks all over the code" * tag 'irq-urgent-2021-11-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: of/irq: Don't ignore interrupt-controller when interrupt-map failed irqchip/sifive-plic: Fixup EOI failed when masked irqchip/csky-mpintc: Fixup mask/unmask implementation PCI/MSI: Destroy sysfs before freeing entries PCI: Add MSI masking quirk for Nvidia ION AHCI PCI/MSI: Deal with devices lying about their MSI mask capability PCI/MSI: Move non-mask check back into low level accessors
2021-11-14Merge tag 'locking-urgent-2021-11-14' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 static call update from Thomas Gleixner: "A single fix for static calls to make the trampoline patching more robust by placing explicit signature bytes after the call trampoline to prevent patching random other jumps like the CFI jump table entries" * tag 'locking-urgent-2021-11-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: static_call,x86: Robustify trampoline patching
2021-11-14Merge tag 'sched_urgent_for_v5.16_rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Borislav Petkov: - Avoid touching ~100 config files in order to be able to select the preemption model - clear cluster CPU masks too, on the CPU unplug path - prevent use-after-free in cfs - Prevent a race condition when updating CPU cache domains - Factor out common shared part of smp_prepare_cpus() into a common helper which can be called by both baremetal and Xen, in order to fix a booting of Xen PV guests * tag 'sched_urgent_for_v5.16_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: preempt: Restore preemption model selection configs arch_topology: Fix missing clear cluster_cpumask in remove_cpu_topology() sched/fair: Prevent dead task groups from regaining cfs_rq's sched/core: Mitigate race cpus_share_cache()/update_top_cache_domain() x86/smp: Factor out parts of native_smp_prepare_cpus()
2021-11-14Merge tag 'perf_urgent_for_v5.16_rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Borislav Petkov: - Prevent unintentional page sharing by checking whether a page reference to a PMU samples page has been acquired properly before that - Make sure the LBR_SELECT MSR is saved/restored too - Reset the LBR_SELECT MSR when resetting the LBR PMU to clear any residual data left * tag 'perf_urgent_for_v5.16_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/core: Avoid put_page() when GUP fails perf/x86/vlbr: Add c->flags to vlbr event constraints perf/x86/lbr: Reset LBR_SELECT during vlbr reset