summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2020-10-29perf,mm: Handle non-page-table-aligned hugetlbfsPeter Zijlstra
A limited nunmber of architectures support hugetlbfs sizes that do not align with the page-tables (ARM64, Power, Sparc64). Add support for this to the generic perf_get_page_size() implementation, and also allow an architecture to override this implementation. This latter is only needed when it uses non-page-table aligned huge pages in its kernel map. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2020-10-29perf/core: Add support for PERF_SAMPLE_CODE_PAGE_SIZEStephane Eranian
When studying code layout, it is useful to capture the page size of the sampled code address. Add a new sample type for code page size. The new sample type requires collecting the ip. The code page size can be calculated from the NMI-safe perf_get_page_size(). For large PEBS, it's very unlikely that the mapping is gone for the earlier PEBS records. Enable the feature for the large PEBS. The worst case is that page-size '0' is returned. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20201001135749.2804-5-kan.liang@linux.intel.com
2020-10-29perf/core: Add PERF_SAMPLE_DATA_PAGE_SIZEKan Liang
Current perf can report both virtual addresses and physical addresses, but not the MMU page size. Without the MMU page size information of the utilized page, users cannot decide whether to promote/demote large pages to optimize memory usage. Add a new sample type for the data MMU page size. Current perf already has a facility to collect data virtual addresses. A page walker is required to walk the pages tables and calculate the MMU page size from a given virtual address. On some platforms, e.g., X86, the page walker is invoked in an NMI handler. So the page walker must be NMI-safe and low overhead. Besides, the page walker should work for both user and kernel virtual address. The existing generic page walker, e.g., walk_page_range_novma(), is a little bit complex and doesn't guarantee the NMI-safe. The follow_page() is only for user-virtual address. Add a new function perf_get_page_size() to walk the page tables and calculate the MMU page size. In the function: - Interrupts have to be disabled to prevent any teardown of the page tables. - For user space threads, the current->mm is used for the page walker. For kernel threads and the like, the current->mm is NULL. The init_mm is used for the page walker. The active_mm is not used here, because it can be NULL. Quote from Peter Zijlstra, "context_switch() can set prev->active_mm to NULL when it transfers it to @next. It does this before @current is updated. So an NMI that comes in between this active_mm swizzling and updating @current will see !active_mm." - The MMU page size is calculated from the page table level. The method should work for all architectures, but it has only been verified on X86. Should there be some architectures, which support perf, where the method doesn't work, it can be fixed later separately. Reporting the wrong page size would not be fatal for the architecture. Some under discussion features may impact the method in the future. Quote from Dave Hansen, "There are lots of weird things folks are trying to do with the page tables, like Address Space Isolation. For instance, if you get a perf NMI when running userspace, current->mm->pgd is *different* than the PGD that was in use when userspace was running. It's close enough today, but it might not stay that way." If the case happens later, lots of consecutive page walk errors will happen. The worst case is that lots of page-size '0' are returned, which would not be fatal. In the perf tool, a check is implemented to detect this case. Once it happens, a kernel patch could be implemented accordingly then. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20201001135749.2804-2-kan.liang@linux.intel.com
2020-10-29sched/fair: Check for idle core in wake_affineJulia Lawall
In the case of a thread wakeup, wake_affine determines whether a core will be chosen for the thread on the socket where the thread ran previously or on the socket of the waker. This is done primarily by comparing the load of the core where th thread ran previously (prev) and the load of the waker (this). commit 11f10e5420f6 ("sched/fair: Use load instead of runnable load in wakeup path") changed the load computation from the runnable load to the load average, where the latter includes the load of threads that have already blocked on the core. When a short-running daemon processes happens to run on prev, this change raised the situation that prev could appear to have a greater load than this, even when prev is actually idle. When prev and this are on the same socket, the idle prev is detected later, in select_idle_sibling. But if that does not hold, prev is completely ignored, causing the waking thread to move to the socket of the waker. In the case of N mostly active threads on N cores, this triggers other migrations and hurts performance. In contrast, before commit 11f10e5420f6, the load on an idle core was 0, and in the case of a non-idle waker core, the effect of wake_affine was to select prev as the target for searching for a core for the waking thread. To avoid unnecessary migrations, extend wake_affine_idle to check whether the core where the thread previously ran is currently idle, and if so simply return that core as the target. [1] commit 11f10e5420f6ce ("sched/fair: Use load instead of runnable load in wakeup path") This particularly has an impact when using the ondemand power manager, where kworkers run every 0.004 seconds on all cores, increasing the likelihood that an idle core will be considered to have a load. The following numbers were obtained with the benchmarking tool hyperfine (https://github.com/sharkdp/hyperfine) on the NAS parallel benchmarks (https://www.nas.nasa.gov/publications/npb.html). The tests were run on an 80-core Intel(R) Xeon(R) CPU E7-8870 v4 @ 2.10GHz. Active (intel_pstate) and passive (intel_cpufreq) power management were used. Times are in seconds. All experiments use all 160 hardware threads. v5.9/intel-pstate v5.9+patch/intel-pstate bt.C.c 24.725724+-0.962340 23.349608+-1.607214 lu.C.x 29.105952+-4.804203 25.249052+-5.561617 sp.C.x 31.220696+-1.831335 30.227760+-2.429792 ua.C.x 26.606118+-1.767384 25.778367+-1.263850 v5.9/ondemand v5.9+patch/ondemand bt.C.c 25.330360+-1.028316 23.544036+-1.020189 lu.C.x 35.872659+-4.872090 23.719295+-3.883848 sp.C.x 32.141310+-2.289541 29.125363+-0.872300 ua.C.x 29.024597+-1.667049 25.728888+-1.539772 On the smaller data sets (A and B) and on the other NAS benchmarks there is no impact on performance. This also has a major impact on the splash2x.volrend benchmark of the parsec benchmark suite that goes from 1m25 without this patch to 0m45, in active (intel_pstate) mode. Fixes: 11f10e5420f6 ("sched/fair: Use load instead of runnable load in wakeup path") Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by Vincent Guittot <vincent.guittot@linaro.org> Acked-by: Mel Gorman <mgorman@suse.de> Link: https://lkml.kernel.org/r/1603372550-14680-1-git-send-email-Julia.Lawall@inria.fr
2020-10-29sched: Remove relyance on STRUCT_ALIGNMENTPeter Zijlstra
Florian reported that all of kernel/sched/ is rebuild when CONFIG_BLK_DEV_INITRD is changed, which, while not a bug is unexpected. This is due to us including vmlinux.lds.h. Jakub explained that the problem is that we put the alignment requirement on the type instead of on a variable. Type alignment is a minimum, the compiler is free to pick any larger alignment for a specific instance of the type (eg. the variable). So force the type alignment on all individual variable definitions and remove the undesired dependency on vmlinux.lds.h. Fixes: 85c2ce9104eb ("sched, vmlinux.lds: Increase STRUCT_ALIGNMENT to 64 bytes for GCC-4.9") Reported-by: Florian Fainelli <f.fainelli@gmail.com> Suggested-by: Jakub Jelinek <jakub@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2020-10-29sched: Reenable interrupts in do_sched_yield()Thomas Gleixner
do_sched_yield() invokes schedule() with interrupts disabled which is not allowed. This goes back to the pre git era to commit a6efb709806c ("[PATCH] irqlock patch 2.5.27-H6") in the history tree. Reenable interrupts and remove the misleading comment which "explains" it. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/87r1pt7y5c.fsf@nanos.tec.linutronix.de
2020-10-29sched: membarrier: document memory ordering scenariosMathieu Desnoyers
Document membarrier ordering scenarios in membarrier.c. Thanks to Alan Stern for refreshing my memory. Now that I have those in mind, it seems appropriate to serialize them to comments for posterity. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20201020134715.13909-4-mathieu.desnoyers@efficios.com
2020-10-29sched: membarrier: cover kthread_use_mm (v4)Mathieu Desnoyers
Add comments and memory barrier to kthread_use_mm and kthread_unuse_mm to allow the effect of membarrier(2) to apply to kthreads accessing user-space memory as well. Given that no prior kthread use this guarantee and that it only affects kthreads, adding this guarantee does not affect user-space ABI. Refine the check in membarrier_global_expedited to exclude runqueues running the idle thread rather than all kthreads from the IPI cpumask. Now that membarrier_global_expedited can IPI kthreads, the scheduler also needs to update the runqueue's membarrier_state when entering lazy TLB state. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20201020134715.13909-3-mathieu.desnoyers@efficios.com
2020-10-29sched: fix exit_mm vs membarrier (v4)Mathieu Desnoyers
exit_mm should issue memory barriers after user-space memory accesses, before clearing current->mm, to order user-space memory accesses performed prior to exit_mm before clearing tsk->mm, which has the effect of skipping the membarrier private expedited IPIs. exit_mm should also update the runqueue's membarrier_state so membarrier global expedited IPIs are not sent when they are not needed. The membarrier system call can be issued concurrently with do_exit if we have thread groups created with CLONE_VM but not CLONE_THREAD. Here is the scenario I have in mind: Two thread groups are created, A and B. Thread group B is created by issuing clone from group A with flag CLONE_VM set, but not CLONE_THREAD. Let's assume we have a single thread within each thread group (Thread A and Thread B). The AFAIU we can have: Userspace variables: int x = 0, y = 0; CPU 0 CPU 1 Thread A Thread B (in thread group A) (in thread group B) x = 1 barrier() y = 1 exit() exit_mm() current->mm = NULL; r1 = load y membarrier() skips CPU 0 (no IPI) because its current mm is NULL r2 = load x BUG_ON(r1 == 1 && r2 == 0) Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20201020134715.13909-2-mathieu.desnoyers@efficios.com
2020-10-29sched/fair: Exclude the current CPU from find_new_ilb()Peter Zijlstra
It is possible for find_new_ilb() to select the current CPU, however, this only happens from newidle balancing, in which case need_resched() will be true, and consequently nohz_csd_func() will not trigger the softirq. Exclude the current CPU from becoming an ILB target. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2020-10-29sched/cpupri: Add CPUPRI_HIGHERPeter Zijlstra
Add CPUPRI_HIGHER above the RT99 priority to denote the CPU is in use by higher priority tasks (specifically deadline). XXX: we should probably drive PUSH-PULL from cpupri, that would automagically result in an RT-PUSH when DL sets cpupri to CPUPRI_HIGHER. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
2020-10-29sched/cpupri: Remap CPUPRI_NORMAL to MAX_RT_PRIO-1Peter Zijlstra
This makes the mapping continuous and frees up 100 for other usage. Prev mapping: p->rt_priority p->prio newpri cpupri -1 -1 (CPUPRI_INVALID) 100 0 (CPUPRI_NORMAL) 1 98 98 1 ... 49 50 50 49 50 49 49 50 ... 99 0 0 99 New mapping: p->rt_priority p->prio newpri cpupri -1 -1 (CPUPRI_INVALID) 99 0 (CPUPRI_NORMAL) 1 98 98 1 ... 49 50 50 49 50 49 49 50 ... 99 0 0 99 Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
2020-10-29sched/cpupri: Remove pri_to_cpu[1]Dietmar Eggemann
pri_to_cpu[1] isn't used since cpupri_set(..., newpri) is never called with newpri = 99. The valid RT priorities RT1..RT99 (p->rt_priority = [1..99]) map into cpupri (idx of pri_to_cpu[]) = [2..100] Current mapping: p->rt_priority p->prio newpri cpupri -1 -1 (CPUPRI_INVALID) 100 0 (CPUPRI_NORMAL) 1 98 98 2 ... 49 50 50 50 50 49 49 51 ... 99 0 0 100 So cpupri = 1 isn't used. Reduce the size of pri_to_cpu[] by 1 and adapt the cpupri implementation accordingly. This will save a useless for loop with an atomic_read in cpupri_find_fitness() calling __cpupri_find(). New mapping: p->rt_priority p->prio newpri cpupri -1 -1 (CPUPRI_INVALID) 100 0 (CPUPRI_NORMAL) 1 98 98 1 ... 49 50 50 49 50 49 49 50 ... 99 0 0 99 Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200922083934.19275-3-dietmar.eggemann@arm.com
2020-10-29sched/cpupri: Remove pri_to_cpu[CPUPRI_IDLE]Dietmar Eggemann
pri_to_cpu[CPUPRI_IDLE=0] isn't used since cpupri_set(..., newpri) is never called with newpri = MAX_PRIO (140). Current mapping: p->rt_priority p->prio newpri cpupri -1 -1 (CPUPRI_INVALID) 140 0 (CPUPRI_IDLE) 100 1 (CPUPRI_NORMAL) 1 98 98 3 ... 49 50 50 51 50 49 49 52 ... 99 0 0 101 Even when cpupri was introduced with commit 6e0534f27819 ("sched: use a 2-d bitmap for searching lowest-pri CPU") in v2.6.27, only (1) CPUPRI_INVALID (-1), (2) MAX_RT_PRIO (100), (3) an RT prio (RT1..RT99) were used as newprio in cpupri_set(..., newpri) -> convert_prio(newpri). MAX_RT_PRIO is used only in dec_rt_tasks() -> dec_rt_prio() -> dec_rt_prio_smp() -> cpupri_set() in case of !rt_rq->rt_nr_running. I.e. it stands for a non-rt task, including the IDLE task. Commit 57785df5ac53 ("sched: Fix task priority bug") removed code in v2.6.33 which did set the priority of the IDLE task to MAX_PRIO. Although this happened after the introduction of cpupri, it didn't have an effect on the values used for cpupri_set(..., newpri). Remove CPUPRI_IDLE and adapt the cpupri implementation accordingly. This will save a useless for loop with an atomic_read in cpupri_find_fitness() calling __cpupri_find(). New mapping: p->rt_priority p->prio newpri cpupri -1 -1 (CPUPRI_INVALID) 100 0 (CPUPRI_NORMAL) 1 98 98 2 ... 49 50 50 50 50 49 49 51 ... 99 0 0 100 Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200922083934.19275-2-dietmar.eggemann@arm.com
2020-10-29sched/deadline: Fix sched_dl_global_validate()Peng Liu
When change sched_rt_{runtime, period}_us, we validate that the new settings should at least accommodate the currently allocated -dl bandwidth: sched_rt_handler() --> sched_dl_bandwidth_validate() { new_bw = global_rt_runtime()/global_rt_period(); for_each_possible_cpu(cpu) { dl_b = dl_bw_of(cpu); if (new_bw < dl_b->total_bw) <------- ret = -EBUSY; } } But under CONFIG_SMP, dl_bw is per root domain , but not per CPU, dl_b->total_bw is the allocated bandwidth of the whole root domain. Instead, we should compare dl_b->total_bw against "cpus*new_bw", where 'cpus' is the number of CPUs of the root domain. Also, below annotation(in kernel/sched/sched.h) implied implementation only appeared in SCHED_DEADLINE v2[1], then deadline scheduler kept evolving till got merged(v9), but the annotation remains unchanged, meaningless and misleading, update it. * With respect to SMP, the bandwidth is given on a per-CPU basis, * meaning that: * - dl_bw (< 100%) is the bandwidth of the system (group) on each CPU; * - dl_total_bw array contains, in the i-eth element, the currently * allocated bandwidth on the i-eth CPU. [1]: https://lore.kernel.org/lkml/1267385230.13676.101.camel@Palantir/ Fixes: 332ac17ef5bf ("sched/deadline: Add bandwidth management for SCHED_DEADLINE tasks") Signed-off-by: Peng Liu <iwtbavbm@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com> Acked-by: Juri Lelli <juri.lelli@redhat.com> Link: https://lkml.kernel.org/r/db6bbda316048cda7a1bbc9571defde193a8d67e.1602171061.git.iwtbavbm@gmail.com
2020-10-29sched/deadline: Optimize sched_dl_global_validate()Peng Liu
Under CONFIG_SMP, dl_bw is per root domain, but not per CPU. When checking or updating dl_bw, currently iterating every CPU is overdoing, just need iterate each root domain once. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Peng Liu <iwtbavbm@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com> Acked-by: Juri Lelli <juri.lelli@redhat.com> Link: https://lkml.kernel.org/r/78d21ee792cc48ff79e8cd62a5f26208463684d6.1602171061.git.iwtbavbm@gmail.com
2020-10-29sched/fair: Improve the accuracy of sched_stat_wait statisticsjun qian
When the sched_schedstat changes from 0 to 1, some sched se maybe already in the runqueue, the se->statistics.wait_start will be 0. So it will let the (rq_of(cfs_rq)) - se->statistics.wait_start) wrong. We need to avoid this scenario. Signed-off-by: jun qian <qianjun.kernel@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Yafang Shao <laoar.shao@gmail.com> Link: https://lkml.kernel.org/r/20201015064846.19809-1-qianjun.kernel@gmail.com
2020-10-29task_work: Use TIF_NOTIFY_SIGNAL if availableJens Axboe
If the arch supports TIF_NOTIFY_SIGNAL, then use that for TWA_SIGNAL as it's more efficient than using the signal delivery method. This is especially true on threaded applications, where ->sighand is shared across threads, but it's also lighter weight on non-shared cases. io_uring is a heavy consumer of TWA_SIGNAL based task_work. A test with threads shows a nice improvement running an io_uring based echo server. stock kernel: 0.01% <= 0.1 milliseconds 95.86% <= 0.2 milliseconds 98.27% <= 0.3 milliseconds 99.71% <= 0.4 milliseconds 100.00% <= 0.5 milliseconds 100.00% <= 0.6 milliseconds 100.00% <= 0.7 milliseconds 100.00% <= 0.8 milliseconds 100.00% <= 0.9 milliseconds 100.00% <= 1.0 milliseconds 100.00% <= 1.1 milliseconds 100.00% <= 2 milliseconds 100.00% <= 3 milliseconds 100.00% <= 3 milliseconds 1378930.00 requests per second ~1600% CPU 1.38M requests/second, and all 16 CPUs are maxed out. patched kernel: 0.01% <= 0.1 milliseconds 98.24% <= 0.2 milliseconds 99.47% <= 0.3 milliseconds 99.99% <= 0.4 milliseconds 100.00% <= 0.5 milliseconds 100.00% <= 0.6 milliseconds 100.00% <= 0.7 milliseconds 100.00% <= 0.8 milliseconds 100.00% <= 0.9 milliseconds 100.00% <= 1.2 milliseconds 1666111.38 requests per second ~1450% CPU 1.67M requests/second, and we're no longer just hammering on the sighand lock. The original reporter states: "For 5.7.15 my benchmark achieves 1.6M qps and system cpu is at ~80%. for 5.7.16 or later it achieves only 1M qps and the system cpu is is at ~100%" with the only difference there being that TWA_SIGNAL is used unconditionally in 5.7.16, since it's required to be able to handle the inability to run task_work if the application is waiting in the kernel already on an event that needs task_work run to be satisfied. Also see commit 0ba9c9edcd15. Reported-by: Roman Gershman <romger@amazon.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20201026203230.386348-5-axboe@kernel.dk
2020-10-29entry: Add support for TIF_NOTIFY_SIGNALJens Axboe
Add TIF_NOTIFY_SIGNAL handling in the generic entry code, which if set, will return true if signal_pending() is used in a wait loop. That causes an exit of the loop so that notify_signal tracehooks can be run. If the wait loop is currently inside a system call, the system call is restarted once task_work has been processed. In preparation for only having arch_do_signal() handle syscall restarts if _TIF_SIGPENDING isn't set, rename it to arch_do_signal_or_restart(). Pass in a boolean that tells the architecture specific signal handler if it should attempt to get a signal, or just process a potential syscall restart. For !CONFIG_GENERIC_ENTRY archs, add the TIF_NOTIFY_SIGNAL handling to get_signal(). This is done to minimize the needed architecture changes to support this feature. Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20201026203230.386348-3-axboe@kernel.dk
2020-10-29signal: Add task_sigpending() helperJens Axboe
This is in preparation for maintaining signal_pending() as the decider of whether or not a schedule() loop should be broken, or continue sleeping. This is different than the core signal use cases, which really need to know whether an actual signal is pending or not. task_sigpending() returns non-zero if TIF_SIGPENDING is set. Only core kernel use cases should care about the distinction between the two, make sure those use the task_sigpending() helper. Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20201026203230.386348-2-axboe@kernel.dk
2020-10-28bpf: Permit cond_resched for some iteratorsYonghong Song
Commit e679654a704e ("bpf: Fix a rcu_sched stall issue with bpf task/task_file iterator") tries to fix rcu stalls warning which is caused by bpf task_file iterator when running "bpftool prog". rcu: INFO: rcu_sched self-detected stall on CPU rcu: \x097-....: (20999 ticks this GP) idle=302/1/0x4000000000000000 softirq=1508852/1508852 fqs=4913 \x09(t=21031 jiffies g=2534773 q=179750) NMI backtrace for cpu 7 CPU: 7 PID: 184195 Comm: bpftool Kdump: loaded Tainted: G W 5.8.0-00004-g68bfc7f8c1b4 #6 Hardware name: Quanta Twin Lakes MP/Twin Lakes Passive MP, BIOS F09_3A17 05/03/2019 Call Trace: <IRQ> dump_stack+0x57/0x70 nmi_cpu_backtrace.cold+0x14/0x53 ? lapic_can_unplug_cpu.cold+0x39/0x39 nmi_trigger_cpumask_backtrace+0xb7/0xc7 rcu_dump_cpu_stacks+0xa2/0xd0 rcu_sched_clock_irq.cold+0x1ff/0x3d9 ? tick_nohz_handler+0x100/0x100 update_process_times+0x5b/0x90 tick_sched_timer+0x5e/0xf0 __hrtimer_run_queues+0x12a/0x2a0 hrtimer_interrupt+0x10e/0x280 __sysvec_apic_timer_interrupt+0x51/0xe0 asm_call_on_stack+0xf/0x20 </IRQ> sysvec_apic_timer_interrupt+0x6f/0x80 ... task_file_seq_next+0x52/0xa0 bpf_seq_read+0xb9/0x320 vfs_read+0x9d/0x180 ksys_read+0x5f/0xe0 do_syscall_64+0x38/0x60 entry_SYSCALL_64_after_hwframe+0x44/0xa9 The fix is to limit the number of bpf program runs to be one million. This fixed the program in most cases. But we also found under heavy load, which can increase the wallclock time for bpf_seq_read(), the warning may still be possible. For example, calling bpf_delay() in the "while" loop of bpf_seq_read(), which will introduce artificial delay, the warning will show up in my qemu run. static unsigned q; volatile unsigned *p = &q; volatile unsigned long long ll; static void bpf_delay(void) { int i, j; for (i = 0; i < 10000; i++) for (j = 0; j < 10000; j++) ll += *p; } There are two ways to fix this issue. One is to reduce the above one million threshold to say 100,000 and hopefully rcu warning will not show up any more. Another is to introduce a target feature which enables bpf_seq_read() calling cond_resched(). This patch took second approach as the first approach may cause more -EAGAIN failures for read() syscalls. Note that not all bpf_iter targets can permit cond_resched() in bpf_seq_read() as some, e.g., netlink seq iterator, rcu read lock critical section spans through seq_ops->next() -> seq_ops->show() -> seq_ops->next(). For the kernel code with the above hack, "bpftool p" roughly takes 38 seconds to finish on my VM with 184 bpf program runs. Using the following command, I am able to collect the number of context switches: perf stat -e context-switches -- ./bpftool p >& log Without this patch, 69 context-switches With this patch, 75 context-switches This patch added additional 6 context switches, roughly every 6 seconds to reschedule, to avoid lengthy no-rescheduling which may cause the above RCU warnings. Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20201028061054.1411116-1-yhs@fb.com
2020-10-28don't dump the threads that had been already exiting when zapped.Al Viro
Coredump logics needs to report not only the registers of the dumping thread, but (since 2.5.43) those of other threads getting killed. Doing that might require extra state saved on the stack in asm glue at kernel entry; signal delivery logics does that (we need to be able to save sigcontext there, at the very least) and so does seccomp. That covers all callers of do_coredump(). Secondary threads get hit with SIGKILL and caught as soon as they reach exit_mm(), which normally happens in signal delivery, so those are also fine most of the time. Unfortunately, it is possible to end up with secondary zapped when it has already entered exit(2) (or, worse yet, is oopsing). In those cases we reach exit_mm() when mm->core_state is already set, but the stack contents is not what we would have in signal delivery. At least on two architectures (alpha and m68k) it leads to infoleaks - we end up with a chunk of kernel stack written into coredump, with the contents consisting of normal C stack frames of the call chain leading to exit_mm() instead of the expected copy of userland registers. In case of alpha we leak 312 bytes of stack. Other architectures (including the regset-using ones) might have similar problems - the normal user of regsets is ptrace and the state of tracee at the time of such calls is special in the same way signal delivery is. Note that had the zapper gotten to the exiting thread slightly later, it wouldn't have been included into coredump anyway - we skip the threads that have already cleared their ->mm. So let's pretend that zapper always loses the race. IOW, have exit_mm() only insert into the dumper list if we'd gotten there from handling a fatal signal[*] As the result, the callers of do_exit() that have *not* gone through get_signal() are not seen by coredump logics as secondary threads. Which excludes voluntary exit()/oopsen/traps/etc. The dumper thread itself is unaffected by that, so seccomp is fine. [*] originally I intended to add a new flag in tsk->flags, but ebiederman pointed out that PF_SIGNALED is already doing just what we need. Cc: stable@vger.kernel.org Fixes: d89f3847def4 ("[PATCH] thread-aware coredumps, 2.5.43-C3") History-tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-10-28genirq/irqdomain: Implement get_name() method on irqchip fwnodesDavid Woodhouse
Prerequisite to make x86 more irqdomain compliant. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20201024213535.443185-23-dwmw2@infradead.org
2020-10-28Merge tag 'trace-v5.10-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fix from Steven Rostedt: "Fix synthetic event "strcat" overrun New synthetic event code used strcat() and miscalculated the ending, causing the concatenation to write beyond the allocated memory. Instead of using strncat(), the code is switched over to seq_buf which has all the mechanisms in place to protect against writing more than what is allocated, and cleans up the code a bit" * tag 'trace-v5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing, synthetic events: Replace buggy strcat() with seq_buf operations
2020-10-28futex: Fix incorrect should_fail_futex() handlingMateusz Nosek
If should_futex_fail() returns true in futex_wake_pi(), then the 'ret' variable is set to -EFAULT and then immediately overwritten. So the failure injection is non-functional. Fix it by actually leaving the function and returning -EFAULT. The Fixes tag is kinda blury because the initial commit which introduced failure injection was already sloppy, but the below mentioned commit broke it completely. [ tglx: Massaged changelog ] Fixes: 6b4f4bc9cb22 ("locking/futex: Allow low-level atomic operations to return -EAGAIN") Signed-off-by: Mateusz Nosek <mateusznosek0@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20200927000858.24219-1-mateusznosek0@gmail.com
2020-10-27audit: trigger accompanying records when no rules presentRichard Guy Briggs
When there are no audit rules registered, mandatory records (config, etc.) are missing their accompanying records (syscall, proctitle, etc.). This is due to audit context dummy set on syscall entry based on absence of rules that signals that no other records are to be printed. Clear the dummy bit if any record is generated, open coding this in audit_log_start(). The proctitle context and dummy checks are pointless since the proctitle record will not be printed if no syscall records are printed. The fds array is reset to -1 after the first syscall to indicate it isn't valid any more, but was never set to -1 when the context was allocated to indicate it wasn't yet valid. Check ctx->pwd in audit_log_name(). The audit_inode* functions can be called without going through getname_flags() or getname_kernel() that sets audit_names and cwd, so set the cwd in audit_alloc_name() if it has not already been done so due to audit_names being valid and purge all other audit_getcwd() calls. Revert the LSM dump_common_audit_data() LSM_AUDIT_DATA_* cases from the ghak96 patch since they are no longer necessary due to cwd coverage in audit_alloc_name(). Thanks to bauen1 <j2468h@googlemail.com> for reporting LSM situations in which context->cwd is not valid, inadvertantly fixed by the ghak96 patch. Please see upstream github issue https://github.com/linux-audit/audit-kernel/issues/120 This is also related to upstream github issue https://github.com/linux-audit/audit-kernel/issues/96 Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com>
2020-10-27audit: fix a kernel-doc markupMauro Carvalho Chehab
typo: kauditd_print_skb -> kauditd_printk_skb Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> Signed-off-by: Paul Moore <paul@paul-moore.com>
2020-10-27PM: sleep: fix typo in kernel/power/process.cJackie Zamow
Fix a typo in a comment in freeze_processes(). Signed-off-by: Jackie Zamow <jackie.zamow@gmail.com> [ rjw: Subject and changelog edits ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-10-27tracing, synthetic events: Replace buggy strcat() with seq_buf operationsSteven Rostedt (VMware)
There was a memory corruption bug happening while running the synthetic event selftests: kmemleak: Cannot insert 0xffff8c196fa2afe5 into the object search tree (overlaps existing) CPU: 5 PID: 6866 Comm: ftracetest Tainted: G W 5.9.0-rc5-test+ #577 Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v03.03 07/14/2016 Call Trace: dump_stack+0x8d/0xc0 create_object.cold+0x3b/0x60 slab_post_alloc_hook+0x57/0x510 ? tracing_map_init+0x178/0x340 __kmalloc+0x1b1/0x390 tracing_map_init+0x178/0x340 event_hist_trigger_func+0x523/0xa40 trigger_process_regex+0xc5/0x110 event_trigger_write+0x71/0xd0 vfs_write+0xca/0x210 ksys_write+0x70/0xf0 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7fef0a63a487 Code: 64 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24 RSP: 002b:00007fff76f18398 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 0000000000000039 RCX: 00007fef0a63a487 RDX: 0000000000000039 RSI: 000055eb3b26d690 RDI: 0000000000000001 RBP: 000055eb3b26d690 R08: 000000000000000a R09: 0000000000000038 R10: 000055eb3b2cdb80 R11: 0000000000000246 R12: 0000000000000039 R13: 00007fef0a70b500 R14: 0000000000000039 R15: 00007fef0a70b700 kmemleak: Kernel memory leak detector disabled kmemleak: Object 0xffff8c196fa2afe0 (size 8): kmemleak: comm "ftracetest", pid 6866, jiffies 4295082531 kmemleak: min_count = 1 kmemleak: count = 0 kmemleak: flags = 0x1 kmemleak: checksum = 0 kmemleak: backtrace: __kmalloc+0x1b1/0x390 tracing_map_init+0x1be/0x340 event_hist_trigger_func+0x523/0xa40 trigger_process_regex+0xc5/0x110 event_trigger_write+0x71/0xd0 vfs_write+0xca/0x210 ksys_write+0x70/0xf0 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 The cause came down to a use of strcat() that was adding an string that was shorten, but the strcat() did not take that into account. strcat() is extremely dangerous as it does not care how big the buffer is. Replace it with seq_buf operations that prevent the buffer from being overwritten if what is being written is bigger than the buffer. Fixes: 10819e25799a ("tracing: Handle synthetic event array field type checking correctly") Reviewed-by: Tom Zanussi <zanussi@kernel.org> Tested-by: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-10-26stop_machine, rcu: Mark functions as notraceZong Li
Some architectures assume that the stopped CPUs don't make function calls to traceable functions when they are in the stopped state. See also commit cb9d7fd51d9f ("watchdog: Mark watchdog touch functions as notrace"). Violating this assumption causes kernel crashes when switching tracer on RISC-V. Mark rcu_momentary_dyntick_idle() and stop_machine_yield() notrace to prevent this. Fixes: 4ecf0a43e729 ("processor: get rid of cpu_relax_yield") Fixes: 366237e7b083 ("stop_machine: Provide RCU quiescent state in multi_cpu_stop()") Signed-off-by: Zong Li <zong.li@sifive.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Atish Patra <atish.patra@wdc.com> Tested-by: Colin Ian King <colin.king@canonical.com> Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Acked-by: Paul E. McKenney <paulmck@kernel.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20201021073839.43935-1-zong.li@sifive.com
2020-10-26time: Prevent undefined behaviour in timespec64_to_ns()Zeng Tao
UBSAN reports: Undefined behaviour in ./include/linux/time64.h:127:27 signed integer overflow: 17179869187 * 1000000000 cannot be represented in type 'long long int' Call Trace: timespec64_to_ns include/linux/time64.h:127 [inline] set_cpu_itimer+0x65c/0x880 kernel/time/itimer.c:180 do_setitimer+0x8e/0x740 kernel/time/itimer.c:245 __x64_sys_setitimer+0x14c/0x2c0 kernel/time/itimer.c:336 do_syscall_64+0xa1/0x540 arch/x86/entry/common.c:295 Commit bd40a175769d ("y2038: itimer: change implementation to timespec64") replaced the original conversion which handled time clamping correctly with timespec64_to_ns() which has no overflow protection. Fix it in timespec64_to_ns() as this is not necessarily limited to the usage in itimers. [ tglx: Added comment and adjusted the fixes tag ] Fixes: 361a3bf00582 ("time64: Add time64.h header and define struct timespec64") Signed-off-by: Zeng Tao <prime.zeng@hisilicon.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/1598952616-6416-1-git-send-email-prime.zeng@hisilicon.com
2020-10-26timers: Remove unused inline funtion debug_timer_free()YueHaibing
There is no caller in tree, remove it. Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20200909134749.32300-1-yuehaibing@huawei.com
2020-10-26hrtimer: Remove unused inline function debug_hrtimer_free()YueHaibing
There is no caller in tree, remove it. Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20200909134850.21940-1-yuehaibing@huawei.com
2020-10-26time/sched_clock: Mark sched_clock_read_begin/retry() as notraceQuanyang Wang
Since sched_clock_read_begin() and sched_clock_read_retry() are called by notrace function sched_clock(), they shouldn't be traceable either, or else ftrace_graph_caller will run into a dead loop on the path as below (arm for instance): ftrace_graph_caller() prepare_ftrace_return() function_graph_enter() ftrace_push_return_trace() trace_clock_local() sched_clock() sched_clock_read_begin/retry() Fixes: 1b86abc1c645 ("sched_clock: Expose struct clock_read_data") Signed-off-by: Quanyang Wang <quanyang.wang@windriver.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20200929082027.16787-1-quanyang.wang@windriver.com
2020-10-26timekeeping: Convert jiffies_seq to seqcount_raw_spinlock_tDavidlohr Bueso
Use the new api and associate the seqcounter to the jiffies_lock enabling lockdep support - although for this particular case the write-side locking and non-preemptibility are quite obvious. Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20201021190749.19363-1-dave@stgolabs.net
2020-10-25treewide: Convert macro and uses of __section(foo) to __section("foo")Joe Perches
Use a more generic form for __section that requires quotes to avoid complications with clang and gcc differences. Remove the quote operator # from compiler_attributes.h __section macro. Convert all unquoted __section(foo) uses to quoted __section("foo"). Also convert __attribute__((section("foo"))) uses to __section("foo") even if the __attribute__ has multiple list entry forms. Conversion done using the script at: https://lore.kernel.org/lkml/75393e5ddc272dc7403de74d645e6c6e0f4e70eb.camel@perches.com/2-convert_section.pl Signed-off-by: Joe Perches <joe@perches.com> Reviewed-by: Nick Desaulniers <ndesaulniers@gooogle.com> Reviewed-by: Miguel Ojeda <ojeda@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-25kernel/sys.c: fix prototype of prctl_get_tid_address()Rasmus Villemoes
tid_addr is not a "pointer to (pointer to int in userspace)"; it is in fact a "pointer to (pointer to int in userspace) in userspace". So sparse rightfully complains about passing a kernel pointer to put_user(). Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-25Merge tag 'timers-urgent-2020-10-25' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fixes from Thomas Gleixner: "A time namespace fix and a matching selftest. The futex absolute timeouts which are based on CLOCK_MONOTONIC require time namespace corrected. This was missed in the original time namesapce support" * tag 'timers-urgent-2020-10-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: selftests/timens: Add a test for futex() futex: Adjust absolute futex timeouts with per time namespace offset
2020-10-25Merge tag 'sched-urgent-2020-10-25' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Thomas Gleixner: "Two scheduler fixes: - A trivial build fix for sched_feat() to compile correctly with CONFIG_JUMP_LABEL=n - Replace a zero lenght array with a flexible array" * tag 'sched-urgent-2020-10-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/features: Fix !CONFIG_JUMP_LABEL case sched: Replace zero-length array with flexible-array
2020-10-25Merge tag 'safesetid-5.10' of git://github.com/micah-morton/linuxLinus Torvalds
Pull SafeSetID updates from Micah Morton: "The changes are mostly contained to within the SafeSetID LSM, with the exception of a few 1-line changes to change some ns_capable() calls to ns_capable_setid() -- causing a flag (CAP_OPT_INSETID) to be set that is examined by SafeSetID code and nothing else in the kernel. The changes to SafeSetID internally allow for setting up GID transition security policies, as already existed for UIDs" * tag 'safesetid-5.10' of git://github.com/micah-morton/linux: LSM: SafeSetID: Fix warnings reported by test bot LSM: SafeSetID: Add GID security policy handling LSM: Signal to SafeSetID when setting group IDs
2020-10-25Merge tag '20201024-v4-5.10' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/wtarreau/prandom Pull random32 updates from Willy Tarreau: "Make prandom_u32() less predictable. This is the cleanup of the latest series of prandom_u32 experimentations consisting in using SipHash instead of Tausworthe to produce the randoms used by the network stack. The changes to the files were kept minimal, and the controversial commit that used to take noise from the fast_pool (f227e3ec3b5c) was reverted. Instead, a dedicated "net_rand_noise" per_cpu variable is fed from various sources of activities (networking, scheduling) to perturb the SipHash state using fast, non-trivially predictable data, instead of keeping it fully deterministic. The goal is essentially to make any occasional memory leakage or brute-force attempt useless. The resulting code was verified to be very slightly faster on x86_64 than what is was with the controversial commit above, though this remains barely above measurement noise. It was also tested on i386 and arm, and build- tested only on arm64" Link: https://lore.kernel.org/netdev/20200808152628.GA27941@SDF.ORG/ * tag '20201024-v4-5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/wtarreau/prandom: random32: add a selftest for the prandom32 code random32: add noise from network and scheduling activity random32: make prandom_u32() output unpredictable
2020-10-24Merge tag 'dma-mapping-5.10-1' of git://git.infradead.org/users/hch/dma-mappingLinus Torvalds
Pull dma-mapping fixes from Christoph Hellwig: - document the new dma_{alloc,free}_pages() API - two fixups for the dma-mapping.h split * tag 'dma-mapping-5.10-1' of git://git.infradead.org/users/hch/dma-mapping: dma-mapping: document dma_{alloc,free}_pages dma-mapping: move more functions to dma-map-ops.h ARM/sa1111: add a missing include of dma-map-ops.h
2020-10-24random32: add noise from network and scheduling activityWilly Tarreau
With the removal of the interrupt perturbations in previous random32 change (random32: make prandom_u32() output unpredictable), the PRNG has become 100% deterministic again. While SipHash is expected to be way more robust against brute force than the previous Tausworthe LFSR, there's still the risk that whoever has even one temporary access to the PRNG's internal state is able to predict all subsequent draws till the next reseed (roughly every minute). This may happen through a side channel attack or any data leak. This patch restores the spirit of commit f227e3ec3b5c ("random32: update the net random state on interrupt and activity") in that it will perturb the internal PRNG's statee using externally collected noise, except that it will not pick that noise from the random pool's bits nor upon interrupt, but will rather combine a few elements along the Tx path that are collectively hard to predict, such as dev, skb and txq pointers, packet length and jiffies values. These ones are combined using a single round of SipHash into a single long variable that is mixed with the net_rand_state upon each invocation. The operation was inlined because it produces very small and efficient code, typically 3 xor, 2 add and 2 rol. The performance was measured to be the same (even very slightly better) than before the switch to SipHash; on a 6-core 12-thread Core i7-8700k equipped with a 40G NIC (i40e), the connection rate dropped from 556k/s to 555k/s while the SYN cookie rate grew from 5.38 Mpps to 5.45 Mpps. Link: https://lore.kernel.org/netdev/20200808152628.GA27941@SDF.ORG/ Cc: George Spelvin <lkml@sdf.org> Cc: Amit Klein <aksecurity@gmail.com> Cc: Eric Dumazet <edumazet@google.com> Cc: "Jason A. Donenfeld" <Jason@zx2c4.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: tytso@mit.edu Cc: Florian Westphal <fw@strlen.de> Cc: Marc Plumb <lkml.mplumb@gmail.com> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> Signed-off-by: Willy Tarreau <w@1wt.eu>
2020-10-24random32: make prandom_u32() output unpredictableGeorge Spelvin
Non-cryptographic PRNGs may have great statistical properties, but are usually trivially predictable to someone who knows the algorithm, given a small sample of their output. An LFSR like prandom_u32() is particularly simple, even if the sample is widely scattered bits. It turns out the network stack uses prandom_u32() for some things like random port numbers which it would prefer are *not* trivially predictable. Predictability led to a practical DNS spoofing attack. Oops. This patch replaces the LFSR with a homebrew cryptographic PRNG based on the SipHash round function, which is in turn seeded with 128 bits of strong random key. (The authors of SipHash have *not* been consulted about this abuse of their algorithm.) Speed is prioritized over security; attacks are rare, while performance is always wanted. Replacing all callers of prandom_u32() is the quick fix. Whether to reinstate a weaker PRNG for uses which can tolerate it is an open question. Commit f227e3ec3b5c ("random32: update the net random state on interrupt and activity") was an earlier attempt at a solution. This patch replaces it. Reported-by: Amit Klein <aksecurity@gmail.com> Cc: Willy Tarreau <w@1wt.eu> Cc: Eric Dumazet <edumazet@google.com> Cc: "Jason A. Donenfeld" <Jason@zx2c4.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: tytso@mit.edu Cc: Florian Westphal <fw@strlen.de> Cc: Marc Plumb <lkml.mplumb@gmail.com> Fixes: f227e3ec3b5c ("random32: update the net random state on interrupt and activity") Signed-off-by: George Spelvin <lkml@sdf.org> Link: https://lore.kernel.org/netdev/20200808152628.GA27941@SDF.ORG/ [ willy: partial reversal of f227e3ec3b5c; moved SIPROUND definitions to prandom.h for later use; merged George's prandom_seed() proposal; inlined siprand_u32(); replaced the net_rand_state[] array with 4 members to fix a build issue; cosmetic cleanups to make checkpatch happy; fixed RANDOM32_SELFTEST build ] Signed-off-by: Willy Tarreau <w@1wt.eu>
2020-10-23Merge tag 'trace-v5.10-3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing ring-buffer fix from Steven Rostedt: "The success return value of ring_buffer_resize() is stated to be zero and checked that way. But it was incorrectly returning the size allocated. Also, a fix to a comment" * tag 'trace-v5.10-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: ring-buffer: Update the description for ring_buffer_wait ring-buffer: Return 0 on success from ring_buffer_resize()
2020-10-23Merge tag 'pm-5.10-rc1-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull more power management updates from Rafael Wysocki: "First of all, the adaptive voltage scaling (AVS) drivers go to new platform-specific locations as planned (this part was reported to have merge conflicts against the new arm-soc updates in linux-next). In addition to that, there are some fixes (intel_idle, intel_pstate, RAPL, acpi_cpufreq), the addition of on/off notifiers and idle state accounting support to the generic power domains (genpd) code and some janitorial changes all over. Specifics: - Move the AVS drivers to new platform-specific locations and get rid of the drivers/power/avs directory (Ulf Hansson). - Add on/off notifiers and idle state accounting support to the generic power domains (genpd) framework (Ulf Hansson, Lina Iyer). - Ulf will maintain the PM domain part of cpuidle-psci (Ulf Hansson). - Make intel_idle disregard ACPI _CST if it cannot use the data returned by that method (Mel Gorman). - Modify intel_pstate to avoid leaving useless sysfs directory structure behind if it cannot be registered (Chen Yu). - Fix domain detection in the RAPL power capping driver and prevent it from failing to enumerate the Psys RAPL domain (Zhang Rui). - Allow acpi-cpufreq to use ACPI _PSD information with Family 19 and later AMD chips (Wei Huang). - Update the driver assumptions comment in intel_idle and fix a kerneldoc comment in the runtime PM framework (Alexander Monakov, Bean Huo). - Avoid unnecessary resets of the cached frequency in the schedutil cpufreq governor to reduce overhead (Wei Wang). - Clean up the cpufreq core a bit (Viresh Kumar). - Make assorted minor janitorial changes (Daniel Lezcano, Geert Uytterhoeven, Hubert Jasudowicz, Tom Rix). - Clean up and optimize the cpupower utility somewhat (Colin Ian King, Martin Kaistra)" * tag 'pm-5.10-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (23 commits) PM: sleep: remove unreachable break PM: AVS: Drop the avs directory and the corresponding Kconfig PM: AVS: qcom-cpr: Move the driver to the qcom specific drivers PM: runtime: Fix typo in pm_runtime_set_active() helper comment PM: domains: Fix build error for genpd notifiers powercap: Fix typo in Kconfig "Plance" -> "Plane" cpufreq: schedutil: restore cached freq when next_f is not changed acpi-cpufreq: Honor _PSD table setting on new AMD CPUs PM: AVS: smartreflex Move driver to soc specific drivers PM: AVS: rockchip-io: Move the driver to the rockchip specific drivers PM: domains: enable domain idle state accounting PM: domains: Add curly braces to delimit comment + statement block PM: domains: Add support for PM domain on/off notifiers for genpd powercap/intel_rapl: enumerate Psys RAPL domain together with package RAPL domain powercap/intel_rapl: Fix domain detection intel_idle: Ignore _CST if control cannot be taken from the platform cpuidle: Remove pointless stub intel_idle: mention assumption that WBINVD is not needed MAINTAINERS: Add section for cpuidle-psci PM domain cpufreq: intel_pstate: Delete intel_pstate sysfs if failed to register the driver ...
2020-10-23Merge tag 'net-5.10-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Cross-tree/merge window issues: - rtl8150: don't incorrectly assign random MAC addresses; fix late in the 5.9 cycle started depending on a return code from a function which changed with the 5.10 PR from the usb subsystem Current release regressions: - Revert "virtio-net: ethtool configurable RXCSUM", it was causing crashes at probe when control vq was not negotiated/available Previous release regressions: - ixgbe: fix probing of multi-port 10 Gigabit Intel NICs with an MDIO bus, only first device would be probed correctly - nexthop: Fix performance regression in nexthop deletion by effectively switching from recently added synchronize_rcu() to synchronize_rcu_expedited() - netsec: ignore 'phy-mode' device property on ACPI systems; the property is not populated correctly by the firmware, but firmware configures the PHY so just keep boot settings Previous releases - always broken: - tcp: fix to update snd_wl1 in bulk receiver fast path, addressing bulk transfers getting "stuck" - icmp: randomize the global rate limiter to prevent attackers from getting useful signal - r8169: fix operation under forced interrupt threading, make the driver always use hard irqs, even on RT, given the handler is light and only wants to schedule napi (and do so through a _irqoff() variant, preferably) - bpf: Enforce pointer id generation for all may-be-null register type to avoid pointers erroneously getting marked as null-checked - tipc: re-configure queue limit for broadcast link - net/sched: act_tunnel_key: fix OOB write in case of IPv6 ERSPAN tunnels - fix various issues in chelsio inline tls driver Misc: - bpf: improve just-added bpf_redirect_neigh() helper api to support supplying nexthop by the caller - in case BPF program has already done a lookup we can avoid doing another one - remove unnecessary break statements - make MCTCP not select IPV6, but rather depend on it" * tag 'net-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (62 commits) tcp: fix to update snd_wl1 in bulk receiver fast path net: Properly typecast int values to set sk_max_pacing_rate netfilter: nf_fwd_netdev: clear timestamp in forwarding path ibmvnic: save changed mac address to adapter->mac_addr selftests: mptcp: depends on built-in IPv6 Revert "virtio-net: ethtool configurable RXCSUM" rtnetlink: fix data overflow in rtnl_calcit() net: ethernet: mtk-star-emac: select REGMAP_MMIO net: hdlc_raw_eth: Clear the IFF_TX_SKB_SHARING flag after calling ether_setup net: hdlc: In hdlc_rcv, check to make sure dev is an HDLC device bpf, libbpf: Guard bpf inline asm from bpf_tail_call_static bpf, selftests: Extend test_tc_redirect to use modified bpf_redirect_neigh() bpf: Fix bpf_redirect_neigh helper api to support supplying nexthop mptcp: depends on IPV6 but not as a module sfc: move initialisation of efx->filter_sem to efx_init_struct() mpls: load mpls_gso after mpls_iptunnel net/sched: act_tunnel_key: fix OOB write in case of IPv6 ERSPAN tunnels net/sched: act_gate: Unlock ->tcfa_lock in tc_setup_flow_action() net: dsa: bcm_sf2: make const array static, makes object smaller mptcp: MPTCP_IPV6 should depend on IPV6 instead of selecting it ...
2020-10-23Merge tag 'arch-cleanup-2020-10-22' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull arch task_work cleanups from Jens Axboe: "Two cleanups that don't fit other categories: - Finally get the task_work_add() cleanup done properly, so we don't have random 0/1/false/true/TWA_SIGNAL confusing use cases. Updates all callers, and also fixes up the documentation for task_work_add(). - While working on some TIF related changes for 5.11, this TIF_NOTIFY_RESUME cleanup fell out of that. Remove some arch duplication for how that is handled" * tag 'arch-cleanup-2020-10-22' of git://git.kernel.dk/linux-block: task_work: cleanup notification modes tracehook: clear TIF_NOTIFY_RESUME in tracehook_notify_resume()
2020-10-22Merge tag 'kbuild-v5.10' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull Kbuild updates from Masahiro Yamada: - Support 'make compile_commands.json' to generate the compilation database more easily, avoiding stale entries - Support 'make clang-analyzer' and 'make clang-tidy' for static checks using clang-tidy - Preprocess scripts/modules.lds.S to allow CONFIG options in the module linker script - Drop cc-option tests from compiler flags supported by our minimal GCC/Clang versions - Use always 12-digits commit hash for CONFIG_LOCALVERSION_AUTO=y - Use sha1 build id for both BFD linker and LLD - Improve deb-pkg for reproducible builds and rootless builds - Remove stale, useless scripts/namespace.pl - Turn -Wreturn-type warning into error - Fix build error of deb-pkg when CONFIG_MODULES=n - Replace 'hostname' command with more portable 'uname -n' - Various Makefile cleanups * tag 'kbuild-v5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (34 commits) kbuild: Use uname for LINUX_COMPILE_HOST detection kbuild: Only add -fno-var-tracking-assignments for old GCC versions kbuild: remove leftover comment for filechk utility treewide: remove DISABLE_LTO kbuild: deb-pkg: clean up package name variables kbuild: deb-pkg: do not build linux-headers package if CONFIG_MODULES=n kbuild: enforce -Werror=return-type scripts: remove namespace.pl builddeb: Add support for all required debian/rules targets builddeb: Enable rootless builds builddeb: Pass -n to gzip for reproducible packages kbuild: split the build log of kallsyms kbuild: explicitly specify the build id style scripts/setlocalversion: make git describe output more reliable kbuild: remove cc-option test of -Werror=date-time kbuild: remove cc-option test of -fno-stack-check kbuild: remove cc-option test of -fno-strict-overflow kbuild: move CFLAGS_{KASAN,UBSAN,KCSAN} exports to relevant Makefiles kbuild: remove redundant CONFIG_KASAN check from scripts/Makefile.kasan kbuild: do not create built-in objects for external module builds ...
2020-10-22Merge tag 'modules-for-v5.10' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux Pull modules updates from Jessica Yu: "Code cleanups: more informative error messages and statically initialize init_free_wq to avoid a workqueue warning" * tag 'modules-for-v5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux: module: statically initialize init section freeing data module: Add more error message for failed kernel module loading