summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2023-01-11genirq/msi: Add msi_device_has_isolated_msi()Jason Gunthorpe
This will replace irq_domain_check_msi_remap() in following patches. The new API makes it more clear what "msi_remap" actually means from a functional perspective instead of identifying an implementation specific HW feature. Isolated MSI means that HW modeled by an irq_domain on the path from the initiating device to the CPU will validate that the MSI message specifies an interrupt number that the device is authorized to trigger. This must block devices from triggering interrupts they are not authorized to trigger. Currently authorization means the MSI vector is one assigned to the device. This is interesting for securing VFIO use cases where a rouge MSI (eg created by abusing a normal PCI MemWr DMA) must not allow the VFIO userspace to impact outside its security domain, eg userspace triggering interrupts on kernel drivers, a VM triggering interrupts on the hypervisor, or a VM triggering interrupts on another VM. As this is actually modeled as a per-irq_domain property, not a global platform property, correct the interface to accept the device parameter and scan through only the part of the irq_domains hierarchy originating from the source device. Locate the new code in msi.c as it naturally only works with CONFIG_GENERIC_MSI_IRQ, which also requires CONFIG_IRQ_DOMAIN and IRQ_DOMAIN_HIERARCHY. Link: https://lore.kernel.org/r/1-v3-3313bb5dd3a3+10f11-secure_msi_jgg@nvidia.com Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-01-11genirq: Add might_sleep() to disable_irq()Manfred Spraul
With the introduction of threaded interrupt handlers, it is virtually never safe to call disable_irq() from non-premptible context. Thus: Update the documentation, add an explicit might_sleep() to catch any offenders. This is more obvious and straight forward than the implicit might_sleep() check deeper down in the disable_irq() call chain. Fixes: 3aa551c9b4c4 ("genirq: add threaded interrupt handler support") Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20221216150441.200533-3-manfred@colorfullife.com
2023-01-11timers: Prevent union confusion from unexpected restart_syscall()Jann Horn
The nanosleep syscalls use the restart_block mechanism, with a quirk: The `type` and `rmtp`/`compat_rmtp` fields are set up unconditionally on syscall entry, while the rest of the restart_block is only set up in the unlikely case that the syscall is actually interrupted by a signal (or pseudo-signal) that doesn't have a signal handler. If the restart_block was set up by a previous syscall (futex(..., FUTEX_WAIT, ...) or poll()) and hasn't been invalidated somehow since then, this will clobber some of the union fields used by futex_wait_restart() and do_restart_poll(). If userspace afterwards wrongly calls the restart_syscall syscall, futex_wait_restart()/do_restart_poll() will read struct fields that have been clobbered. This doesn't actually lead to anything particularly interesting because none of the union fields contain trusted kernel data, and futex(..., FUTEX_WAIT, ...) and poll() aren't syscalls where it makes much sense to apply seccomp filters to their arguments. So the current consequences are just of the "if userspace does bad stuff, it can damage itself, and that's not a problem" flavor. But still, it seems like a hazard for future developers, so invalidate the restart_block when partly setting it up in the nanosleep syscalls. Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20230105134403.754986-1-jannh@google.com
2023-01-11printk: adjust string limit macrosJohn Ogness
The various internal size limit macros have names and/or values that do not fit well to their current usage. Rename the macros so that their purpose is clear and, if needed, provide a more appropriate value. In general, the new macros and values will lead to less memory usage. The new macros are... PRINTK_MESSAGE_MAX: This is the maximum size for a formatted message on a console, devkmsg, or syslog. It does not matter which format the message has (normal or extended). It replaces the use of CONSOLE_EXT_LOG_MAX for console and devkmsg. It replaces the use of CONSOLE_LOG_MAX for syslog. Historically, normal messages have been allowed to print up to 1kB, whereas extended messages have been allowed to print up to 8kB. However, the difference in lengths of these message types is not significant and in multi-line records, normal messages are probably larger. Also, because 1kB is only slightly above the allowed record size, multi-line normal messages could be easily truncated during formatting. This new macro should be significantly larger than the allowed record size to allow sufficient space for extended or multi-line prefix text. A value of 2kB should be plenty of space. For normal messages this represents a doubling of the historically allowed amount. For extended messages it reduces the excessive 8kB size, thus reducing memory usage needed for message formatting. PRINTK_PREFIX_MAX: This is the maximum size allowed for a record prefix (used by console and syslog). It replaces PREFIX_MAX. The value is left unchanged. PRINTKRB_RECORD_MAX: This is the maximum size allowed to be reserved for a record in the ringbuffer. It is used by all readers and writers with the printk ringbuffer. It replaces LOG_LINE_MAX. Previously this was set to "1kB - PREFIX_MAX", which makes some sense if 1kB is the limit for normal message output and prefixes are enabled. However, with the allowance of larger output and the existence of multi-line records, the value is rather bizarre. Round the value up to 1kB. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20230109100800.1085541-9-john.ogness@linutronix.de
2023-01-11printk: use printk_buffers for devkmsgJohn Ogness
Replace the buffers in struct devkmsg_user with a struct printk_buffers. This reduces the number of buffers to keep track of. As a side-effect, @text_buf was 8kB large, even though it only needed to be the max size of a ringbuffer record. By switching to struct printk_buffers, ~7kB less memory is allocated when opening /dev/kmsg. And since struct printk_buffers will be used now, reduce duplicate code by calling printk_get_next_message() to handle the record reading and formatting. Note that since /dev/kmsg never suppresses records based on loglevel, printk_get_next_message() is extended with an extra bool argument to specify if suppression is allowed. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20230109100800.1085541-8-john.ogness@linutronix.de
2023-01-11printk: introduce console_prepend_dropped() for dropped messagesJohn Ogness
Currently "dropped messages" are separately printed immediately before printing the printk message. Since normal consoles are now using an output buffer that is much larger than previously, the "dropped message" could be prepended to the printk message and then output everything in a single write() call. Introduce a helper function console_prepend_dropped() to prepend an existing message with a "dropped message". This simplifies the code by allowing all message formatting to be handled together and then only requires a single write() call to output the full message. And since this helper does not require any locking, it can be used in the future for other console printing contexts as well. Note that console_prepend_dropped() is defined as a NOP for !CONFIG_PRINTK. Although the function will never be called for !CONFIG_PRINTK, compiling the function can lead to warnings of "always true" conditionals due to the size macro values used in !CONFIG_PRINTK. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20230109100800.1085541-7-john.ogness@linutronix.de
2023-01-11printk: introduce printk_get_next_message() and printk_messageJohn Ogness
Code for performing the console output is intermixed with code that is formatting the output for that console. Introduce a new helper function printk_get_next_message() to handle the reading and formatting of the printk text. The helper does not require any locking so that in the future it can be used for other printing contexts as well. This also introduces a new struct printk_message to wrap the struct printk_buffers, adding metadata about its contents. This allows users of printk_get_next_message() to receive all relevant information about the message that was read and formatted. Why is struct printk_message a wrapper struct? It is intentional that a wrapper struct is introduced instead of adding the metadata directly to struct printk_buffers. The upcoming atomic consoles support multiple printing contexts per CPU. This means that while a CPU is formatting a message, it can be interrupted and the interrupting context may also format a (possibly different) message. Since the printk buffers are rather large, there will only be one struct printk_buffers per CPU and it must be shared by the possible contexts of that CPU. If the metadata was part of struct printk_buffers, interrupting contexts would clobber the metadata being prepared by the interrupted context. This could be handled by robustifying the message formatting functions to cope with metadata unexpectedly changing. However, this would require significant amounts of extra data copying, also adding significant complexity to the code. Instead, the metadata can live on the stack of the formatting context and the message formatting functions do not need to be concerned about the metadata changing underneath them. Note that the message formatting functions can handle unexpected text buffer changes. So it is perfectly OK if a shared text buffer is clobbered by an interrupting context. The atomic console implementation will recognize the interruption and avoid printing the (probably garbage) text buffer. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20230109100800.1085541-6-john.ogness@linutronix.de
2023-01-11printk: introduce struct printk_buffersJohn Ogness
Introduce a new struct printk_buffers to contain all the buffers needed to read and format a printk message for output. Putting the buffers inside a struct reduces the number of buffer pointers that need to be tracked. Also, it allows usage of the sizeof() macro for the buffer sizes, rather than expecting certain sized buffers being passed in. Note that since the output buffer for normal consoles is now CONSOLE_EXT_LOG_MAX instead of CONSOLE_LOG_MAX, multi-line messages that may have been previously truncated will now be printed in full. This should be considered a feature and not a bug since the CONSOLE_LOG_MAX restriction was about limiting static buffer usage rather than limiting printed text. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20230109100800.1085541-5-john.ogness@linutronix.de
2023-01-11printk: move size limit macros into internal.hJohn Ogness
The size limit macros are located further down in printk.c and behind ifdef conditionals. This complicates their usage for upcoming changes. Move the macros into internal.h so that they are still invisible outside of printk, but easily accessible for printk. Also, the maximum size of formatted extended messages does not need to be known by any code outside of printk, so move it to internal.h as well. And like CONSOLE_LOG_MAX, for !CONFIG_PRINTK set CONSOLE_EXT_LOG_MAX to 0 to reduce the static memory footprint. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20230109100800.1085541-2-john.ogness@linutronix.de
2023-01-10bpf: btf: limit logging of ignored BTF mismatchesConnor O'Brien
Enabling CONFIG_MODULE_ALLOW_BTF_MISMATCH is an indication that BTF mismatches are expected and module loading should proceed anyway. Logging with pr_warn() on every one of these "benign" mismatches creates unnecessary noise when many such modules are loaded. Instead, handle this case with a single log warning that BTF info may be unavailable. Mismatches also result in calls to __btf_verifier_log() via __btf_verifier_log_type() or btf_verifier_log_member(), adding several additional lines of logging per mismatched module. Add checks to these paths to skip logging for module BTF mismatches in the "allow mismatch" case. All existing logging behavior is preserved in the default CONFIG_MODULE_ALLOW_BTF_MISMATCH=n case. Signed-off-by: Connor O'Brien <connoro@google.com> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20230107025331.3240536-1-connoro@google.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-01-10cgroup/cpuset: fix a few kernel-doc warnings & coding styleRandy Dunlap
Fix kernel-doc notation warnings: kernel/cgroup/cpuset.c:1309: warning: Excess function parameter 'cpuset' description in 'update_parent_subparts_cpumask' kernel/cgroup/cpuset.c:3909: warning: expecting prototype for cpuset_mem_spread_node(). Prototype was for cpuset_spread_node() instead Also drop a blank line before EXPORT_SYMBOL_GPL() to be consistent with kernel coding style. Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Waiman Long <longman@redhat.com> Cc: Zefan Li <lizefan.x@bytedance.com> Cc: Tejun Heo <tj@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: cgroups@vger.kernel.org Signed-off-by: Tejun Heo <tj@kernel.org>
2023-01-10Merge tag 'xtensa-20230110' of https://github.com/jcmvbkbc/linux-xtensaLinus Torvalds
Pull xtensa fixes from Max Filippov: - fix xtensa allmodconfig build broken by the kcsan test - drop unused members of struct thread_struct * tag 'xtensa-20230110' of https://github.com/jcmvbkbc/linux-xtensa: xtensa: drop unused members of struct thread_struct kcsan: test: don't put the expect array on the stack
2023-01-09bpf: remove the do_idr_lock parameter from bpf_prog_free_id()Paul Moore
It was determined that the do_idr_lock parameter to bpf_prog_free_id() was not necessary as it should always be true. Suggested-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Paul Moore <paul@paul-moore.com> Acked-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20230106154400.74211-2-paul@paul-moore.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-01-09bpf: restore the ebpf program ID for BPF_AUDIT_UNLOAD and ↵Paul Moore
PERF_BPF_EVENT_PROG_UNLOAD When changing the ebpf program put() routines to support being called from within IRQ context the program ID was reset to zero prior to calling the perf event and audit UNLOAD record generators, which resulted in problems as the ebpf program ID was bogus (always zero). This patch addresses this problem by removing an unnecessary call to bpf_prog_free_id() in __bpf_prog_offload_destroy() and adjusting __bpf_prog_put() to only call bpf_prog_free_id() after audit and perf have finished their bpf program unload tasks in bpf_prog_put_deferred(). For the record, no one can determine, or remember, why it was necessary to free the program ID, and remove it from the IDR, prior to executing bpf_prog_put_deferred(); regardless, both Stanislav and Alexei agree that the approach in this patch should be safe. It is worth noting that when moving the bpf_prog_free_id() call, the do_idr_lock parameter was forced to true as the ebpf devs determined this was the correct as the do_idr_lock should always be true. The do_idr_lock parameter will be removed in a follow-up patch, but it was kept here to keep the patch small in an effort to ease any stable backports. I also modified the bpf_audit_prog() logic used to associate the AUDIT_BPF record with other associated records, e.g. @ctx != NULL. Instead of keying off the operation, it now keys off the execution context, e.g. '!in_irg && !irqs_disabled()', which is much more appropriate and should help better connect the UNLOAD operations with the associated audit state (other audit records). Cc: stable@vger.kernel.org Fixes: d809e134be7a ("bpf: Prepare bpf_prog_put() to be called from irq context.") Reported-by: Burn Alting <burn.alting@iinet.net.au> Reported-by: Jiri Olsa <olsajiri@gmail.com> Suggested-by: Stanislav Fomichev <sdf@google.com> Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Signed-off-by: Paul Moore <paul@paul-moore.com> Acked-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20230106154400.74211-1-paul@paul-moore.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-01-09rcu: Allow up to five minutes expedited RCU CPU stall-warning timeoutsPaul E. McKenney
The maximum value of RCU CPU stall-warning timeouts has historically been five minutes (300 seconds). However, the recently introduced expedited RCU CPU stall-warning timeout is instead limited to 21 seconds. This causes problems for CI/fuzzing services such as syzkaller by obscuring the issue in question with expedited RCU CPU stall-warning timeout splats. This commit therefore sets the RCU_EXP_CPU_STALL_TIMEOUT Kconfig options upper bound to 300000 milliseconds, which is 300 seconds (AKA 5 minutes). [ paulmck: Apply feedback from Hillf Danton. ] [ paulmck: Apply feedback from Geert Uytterhoeven. ] Reported-by: Dave Chinner <david@fromorbit.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Tested-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-09sched/core: Use kfree_rcu() in do_set_cpus_allowed()Waiman Long
Commit 851a723e45d1 ("sched: Always clear user_cpus_ptr in do_set_cpus_allowed()") may call kfree() if user_cpus_ptr was previously set. Unfortunately, some of the callers of do_set_cpus_allowed() may have pi_lock held when calling it. So the following splats may be printed especially when running with a PREEMPT_RT kernel: WARNING: possible circular locking dependency detected BUG: sleeping function called from invalid context To avoid these problems, kfree_rcu() is used instead. An internal cpumask_rcuhead union is created for the sole purpose of facilitating the use of kfree_rcu() to free the cpumask. Since user_cpus_ptr is not being used in non-SMP configs, the newly introduced alloc_user_cpus_ptr() helper will return NULL in this case and sched_setaffinity() is modified to handle this special case. Fixes: 851a723e45d1 ("sched: Always clear user_cpus_ptr in do_set_cpus_allowed()") Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20221231041120.440785-3-longman@redhat.com
2023-01-09sched/core: Fix use-after-free bug in dup_user_cpus_ptr()Waiman Long
Since commit 07ec77a1d4e8 ("sched: Allow task CPU affinity to be restricted on asymmetric systems"), the setting and clearing of user_cpus_ptr are done under pi_lock for arm64 architecture. However, dup_user_cpus_ptr() accesses user_cpus_ptr without any lock protection. Since sched_setaffinity() can be invoked from another process, the process being modified may be undergoing fork() at the same time. When racing with the clearing of user_cpus_ptr in __set_cpus_allowed_ptr_locked(), it can lead to user-after-free and possibly double-free in arm64 kernel. Commit 8f9ea86fdf99 ("sched: Always preserve the user requested cpumask") fixes this problem as user_cpus_ptr, once set, will never be cleared in a task's lifetime. However, this bug was re-introduced in commit 851a723e45d1 ("sched: Always clear user_cpus_ptr in do_set_cpus_allowed()") which allows the clearing of user_cpus_ptr in do_set_cpus_allowed(). This time, it will affect all arches. Fix this bug by always clearing the user_cpus_ptr of the newly cloned/forked task before the copying process starts and check the user_cpus_ptr state of the source task under pi_lock. Note to stable, this patch won't be applicable to stable releases. Just copy the new dup_user_cpus_ptr() function over. Fixes: 07ec77a1d4e8 ("sched: Allow task CPU affinity to be restricted on asymmetric systems") Fixes: 851a723e45d1 ("sched: Always clear user_cpus_ptr in do_set_cpus_allowed()") Reported-by: David Wang 王标 <wangbiao3@xiaomi.com> Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Peter Zijlstra <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20221231041120.440785-2-longman@redhat.com
2023-01-07sched/core: Fix arch_scale_freq_tick() on tickless systemsYair Podemsky
In order for the scheduler to be frequency invariant we measure the ratio between the maximum CPU frequency and the actual CPU frequency. During long tickless periods of time the calculations that keep track of that might overflow, in the function scale_freq_tick(): if (check_shl_overflow(acnt, 2*SCHED_CAPACITY_SHIFT, &acnt)) goto error; eventually forcing the kernel to disable the feature for all CPUs, and show the warning message: "Scheduler frequency invariance went wobbly, disabling!". Let's avoid that by limiting the frequency invariant calculations to CPUs with regular tick. Fixes: e2b0d619b400 ("x86, sched: check for counters overflow in frequency invariant accounting") Suggested-by: "Peter Zijlstra (Intel)" <peterz@infradead.org> Signed-off-by: Yair Podemsky <ypodemsk@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Acked-by: Giovanni Gherdovich <ggherdovich@suse.cz> Link: https://lore.kernel.org/r/20221130125121.34407-1-ypodemsk@redhat.com
2023-01-07sched/membarrier: Introduce MEMBARRIER_CMD_GET_REGISTRATIONSMichal Clapinski
Provide a method to query previously issued registrations. Signed-off-by: Michal Clapinski <mclapinski@google.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Paul E. McKenney <paulmck@kernel.org> Link: https://lore.kernel.org/r/20221207164338.1535591-2-mclapinski@google.com
2023-01-07cpufreq, sched/util: Optimize operations with single CPU capacity lookupLukasz Luba
The max CPU capacity is the same for all CPUs sharing frequency domain. There is a way to avoid heavy operations in a loop for each CPU by leveraging this knowledge. Thus, simplify the looping code in the sugov_next_freq_shared() and drop heavy multiplications. Instead, use simple max() to get the highest utilization from these CPUs. This is useful for platforms with many (4 or 6) little CPUs. We avoid heavy 2*PD_CPU_NUM multiplications in that loop, which is called billions of times, since it's not limited by the schedutil time delta filter in sugov_should_update_freq(). When there was no need to change frequency the code bailed out, not updating the sg_policy::last_freq_update_time. Then every visit after delta_ns time longer than the sg_policy::freq_update_delay_ns goes through and triggers the next frequency calculation code. Although, if the next frequency, as outcome of that, would be the same as current frequency, we won't update the sg_policy::last_freq_update_time and the story will be repeated (in a very short period, sometimes a few microseconds). The max CPU capacity must be fetched every time we are called, due to difficulties during the policy setup, where we are not able to get the normalized CPU capacity at the right time. The fetched CPU capacity value is than used in sugov_iowait_apply() to calculate the right boost. This required a few changes in the local functions and arguments. The capacity value should hopefully be fetched once when needed and then passed over CPU registers to those functions. Signed-off-by: Lukasz Luba <lukasz.luba@arm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20221208160256.859-2-lukasz.luba@arm.com Cc: Peter Zijlstra <peterz@infradead.org> Cc: Patrick Bellasi <patrick.bellasi@arm.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Viresh Kumar <viresh.kumar@linaro.org>
2023-01-07sched/core: Reorganize ttwu_do_wakeup() and ttwu_do_activate()Chengming Zhou
ttwu_do_activate() is used for a complete wakeup, in which we will activate_task() and use ttwu_do_wakeup() to mark the task runnable and perform wakeup-preemption, also call class->task_woken() callback and update the rq->idle_stamp. Since ttwu_runnable() is not a complete wakeup, don't need all those done in ttwu_do_wakeup(), so we can move those to ttwu_do_activate() to simplify ttwu_do_wakeup(), making it only mark the task runnable to be reused in ttwu_runnable() and try_to_wake_up(). This patch should not have any functional changes. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20221223103257.4962-2-zhouchengming@bytedance.com
2023-01-07sched/core: Micro-optimize ttwu_runnable()Chengming Zhou
ttwu_runnable() is used as a fast wakeup path when the wakee task is running on CPU or runnable on RQ, in both cases we can just set its state to TASK_RUNNING to prevent a sleep. If the wakee task is on_cpu running, we don't need to update_rq_clock() or check_preempt_curr(). But if the wakee task is on_rq && !on_cpu (e.g. an IRQ hit before the task got to schedule() and the task been preempted), we should check_preempt_curr() to see if it can preempt the current running. This also removes the class->task_woken() callback from ttwu_runnable(), which wasn't required per the RT/DL implementations: any required push operation would have been queued during class->set_next_task() when p got preempted. ttwu_runnable() also loses the update to rq->idle_stamp, as by definition the rq cannot be idle in this scenario. Suggested-by: Valentin Schneider <vschneid@redhat.com> Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Link: https://lore.kernel.org/r/20221223103257.4962-1-zhouchengming@bytedance.com
2023-01-06workqueue: Make show_pwq() use run-length encodingPaul E. McKenney
The show_pwq() function dumps out a pool_workqueue structure's activity, including the pending work-queue handlers: Showing busy workqueues and worker pools: workqueue events: flags=0x0 pwq 0: cpus=0 node=0 flags=0x1 nice=0 active=10/256 refcnt=11 in-flight: 7:test_work_func, 64:test_work_func, 249:test_work_func pending: test_work_func, test_work_func, test_work_func1, test_work_func1, test_work_func1, test_work_func1, test_work_func1 When large systems are facing certain types of hang conditions, it is not unusual for this "pending" list to contain runs of hundreds of identical function names. This "wall of text" is difficult to read, and worse yet, it can be interleaved with other output such as stack traces. Therefore, make show_pwq() use run-length encoding so that the above printout instead looks like this: Showing busy workqueues and worker pools: workqueue events: flags=0x0 pwq 0: cpus=0 node=0 flags=0x1 nice=0 active=10/256 refcnt=11 in-flight: 7:test_work_func, 64:test_work_func, 249:test_work_func pending: 2*test_work_func, 5*test_work_func1 When no comma would be printed, including the WORK_STRUCT_LINKED case, a new run is started unconditionally. This output is more readable, places less stress on the hardware, firmware, and software on the console-log path, and reduces interference with other output. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Dave Jones <davej@codemonkey.org.uk> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-01-06bpf: Skip task with pid=1 in send_signal_common()Hao Sun
The following kernel panic can be triggered when a task with pid=1 attaches a prog that attempts to send killing signal to itself, also see [1] for more details: Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b CPU: 3 PID: 1 Comm: systemd Not tainted 6.1.0-09652-g59fe41b5255f #148 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x100/0x178 lib/dump_stack.c:106 panic+0x2c4/0x60f kernel/panic.c:275 do_exit.cold+0x63/0xe4 kernel/exit.c:789 do_group_exit+0xd4/0x2a0 kernel/exit.c:950 get_signal+0x2460/0x2600 kernel/signal.c:2858 arch_do_signal_or_restart+0x78/0x5d0 arch/x86/kernel/signal.c:306 exit_to_user_mode_loop kernel/entry/common.c:168 [inline] exit_to_user_mode_prepare+0x15f/0x250 kernel/entry/common.c:203 __syscall_exit_to_user_mode_work kernel/entry/common.c:285 [inline] syscall_exit_to_user_mode+0x1d/0x50 kernel/entry/common.c:296 do_syscall_64+0x44/0xb0 arch/x86/entry/common.c:86 entry_SYSCALL_64_after_hwframe+0x63/0xcd So skip task with pid=1 in bpf_send_signal_common() to avoid the panic. [1] https://lore.kernel.org/bpf/20221222043507.33037-1-sunhao.th@gmail.com Signed-off-by: Hao Sun <sunhao.th@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/bpf/20230106084838.12690-1-sunhao.th@gmail.com
2023-01-06bpf: Skip invalid kfunc call in backtrack_insnHao Sun
The verifier skips invalid kfunc call in check_kfunc_call(), which would be captured in fixup_kfunc_call() if such insn is not eliminated by dead code elimination. However, this can lead to the following warning in backtrack_insn(), also see [1]: ------------[ cut here ]------------ verifier backtracking bug WARNING: CPU: 6 PID: 8646 at kernel/bpf/verifier.c:2756 backtrack_insn kernel/bpf/verifier.c:2756 __mark_chain_precision kernel/bpf/verifier.c:3065 mark_chain_precision kernel/bpf/verifier.c:3165 adjust_reg_min_max_vals kernel/bpf/verifier.c:10715 check_alu_op kernel/bpf/verifier.c:10928 do_check kernel/bpf/verifier.c:13821 [inline] do_check_common kernel/bpf/verifier.c:16289 [...] So make backtracking conservative with this by returning ENOTSUPP. [1] https://lore.kernel.org/bpf/CACkBjsaXNceR8ZjkLG=dT3P=4A8SBsg0Z5h5PWLryF5=ghKq=g@mail.gmail.com/ Reported-by: syzbot+4da3ff23081bafe74fc2@syzkaller.appspotmail.com Signed-off-by: Hao Sun <sunhao.th@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20230104014709.9375-1-sunhao.th@gmail.com
2023-01-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
No conflicts. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-05Merge tag 'net-6.2-rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from bpf, wifi, and netfilter. Current release - regressions: - bpf: fix nullness propagation for reg to reg comparisons, avoid null-deref - inet: control sockets should not use current thread task_frag - bpf: always use maximal size for copy_array() - eth: bnxt_en: don't link netdev to a devlink port for VFs Current release - new code bugs: - rxrpc: fix a couple of potential use-after-frees - netfilter: conntrack: fix IPv6 exthdr error check - wifi: iwlwifi: fw: skip PPAG for JF, avoid FW crashes - eth: dsa: qca8k: various fixes for the in-band register access - eth: nfp: fix schedule in atomic context when sync mc address - eth: renesas: rswitch: fix getting mac address from device tree - mobile: ipa: use proper endpoint mask for suspend Previous releases - regressions: - tcp: add TIME_WAIT sockets in bhash2, fix regression caught by Jiri / python tests - net: tc: don't intepret cls results when asked to drop, fix oob-access - vrf: determine the dst using the original ifindex for multicast - eth: bnxt_en: - fix XDP RX path if BPF adjusted packet length - fix HDS (header placement) and jumbo thresholds for RX packets - eth: ice: xsk: do not use xdp_return_frame() on tx_buf->raw_buf, avoid memory corruptions Previous releases - always broken: - ulp: prevent ULP without clone op from entering the LISTEN status - veth: fix race with AF_XDP exposing old or uninitialized descriptors - bpf: - pull before calling skb_postpull_rcsum() (fix checksum support and avoid a WARN()) - fix panic due to wrong pageattr of im->image (when livepatch and kretfunc coexist) - keep a reference to the mm, in case the task is dead - mptcp: fix deadlock in fastopen error path - netfilter: - nf_tables: perform type checking for existing sets - nf_tables: honor set timeout and garbage collection updates - ipset: fix hash:net,port,net hang with /0 subnet - ipset: avoid hung task warning when adding/deleting entries - selftests: net: - fix cmsg_so_mark.sh test hang on non-x86 systems - fix the arp_ndisc_evict_nocarrier test for IPv6 - usb: rndis_host: secure rndis_query check against int overflow - eth: r8169: fix dmar pte write access during suspend/resume with WOL - eth: lan966x: fix configuration of the PCS - eth: sparx5: fix reading of the MAC address - eth: qed: allow sleep in qed_mcp_trace_dump() - eth: hns3: - fix interrupts re-initialization after VF FLR - fix handling of promisc when MAC addr table gets full - refine the handling for VF heartbeat - eth: mlx5: - properly handle ingress QinQ-tagged packets on VST - fix io_eq_size and event_eq_size params validation on big endian - fix RoCE setting at HCA level if not supported at all - don't turn CQE compression on by default for IPoIB - eth: ena: - fix toeplitz initial hash key value - account for the number of XDP-processed bytes in interface stats - fix rx_copybreak value update Misc: - ethtool: harden phy stat handling against buggy drivers - docs: netdev: convert maintainer's doc from FAQ to a normal document" * tag 'net-6.2-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (112 commits) caif: fix memory leak in cfctrl_linkup_request() inet: control sockets should not use current thread task_frag net/ulp: prevent ULP without clone op from entering the LISTEN status qed: allow sleep in qed_mcp_trace_dump() MAINTAINERS: Update maintainers for ptp_vmw driver usb: rndis_host: Secure rndis_query check against int overflow net: dpaa: Fix dtsec check for PCS availability octeontx2-pf: Fix lmtst ID used in aura free drivers/net/bonding/bond_3ad: return when there's no aggregator netfilter: ipset: Rework long task execution when adding/deleting entries netfilter: ipset: fix hash:net,port,net hang with /0 subnet net: sparx5: Fix reading of the MAC address vxlan: Fix memory leaks in error path net: sched: htb: fix htb_classify() kernel-doc net: sched: cbq: dont intepret cls results when asked to drop net: sched: atm: dont intepret cls results when asked to drop dt-bindings: net: marvell,orion-mdio: Fix examples dt-bindings: net: sun8i-emac: Add phy-supply property net: ipa: use proper endpoint mask for suspend selftests: net: return non-zero for failures reported in arp_ndisc_evict_nocarrier ...
2023-01-05clocksource: Improve "skew is too large" messagesPaul E. McKenney
When clocksource_watchdog() detects excessive clocksource skew compared to the watchdog clocksource, it marks the clocksource under test as unstable and prints several lines worth of message. But that message is unclear to anyone unfamiliar with the code: clocksource: timekeeping watchdog on CPU2: Marking clocksource 'wdtest-ktime' as unstable because the skew is too large: clocksource: 'kvm-clock' wd_nsec: 400744390 wd_now: 612625c2c wd_last: 5fa7f7c66 mask: ffffffffffffffff clocksource: 'wdtest-ktime' cs_nsec: 600744034 cs_now: 173081397a292d4f cs_last: 17308139565a8ced mask: ffffffffffffffff clocksource: 'kvm-clock' (not 'wdtest-ktime') is current clocksource. Therefore, add the following line near the end of that message: Clocksource 'wdtest-ktime' skewed 199999644 ns (199 ms) over watchdog 'kvm-clock' interval of 400744390 ns (400 ms) This new line clearly indicates the amount of skew between the two clocksources, along with the duration of the time interval over which the skew occurred, both in nanoseconds and milliseconds. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: John Stultz <jstultz@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Stephen Boyd <sboyd@kernel.org> Cc: Feng Tang <feng.tang@intel.com>
2023-01-05rcu: Align the output of RCU CPU stall warning messagesZhen Lei
Time stamps are added to the output in kernels built with CONFIG_PRINTK_TIME=y, which causes misaligned output. Therefore, replace pr_cont() with pr_err(), which fixes alignment and gets rid of a couple of despised pr_cont() calls. Before: [ 37.567343] rcu: INFO: rcu_preempt self-detected stall on CPU [ 37.567839] rcu: 0-....: (1500 ticks this GP) idle=*** [ 37.568270] (t=1501 jiffies g=4717 q=28 ncpus=4) [ 37.568668] CPU: 0 PID: 313 Comm: test0 Not tainted 6.1.0-rc4 #8 After: [ 36.762074] rcu: INFO: rcu_preempt self-detected stall on CPU [ 36.762543] rcu: 0-....: (1499 ticks this GP) idle=*** [ 36.763003] rcu: (t=1500 jiffies g=5097 q=27 ncpus=4) [ 36.763522] CPU: 0 PID: 313 Comm: test0 Not tainted 6.1.0-rc4 #9 Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-05rcu: Add RCU stall diagnosis informationZhen Lei
Because RCU CPU stall warnings are driven from the scheduling-clock interrupt handler, a workload consisting of a very large number of short-duration hardware interrupts can result in misleading stall-warning messages. On systems supporting only a single level of interrupts, that is, where interrupts handlers cannot be interrupted, this can produce misleading diagnostics. The stack traces will show the innocent-bystander interrupted task, not the interrupts that are at the very least exacerbating the stall. This situation can be improved by displaying the number of interrupts and the CPU time that they have consumed. Diagnosing other types of stalls can be eased by also providing the count of softirqs and the CPU time that they consumed as well as the number of context switches and the task-level CPU time consumed. Consider the following output given this change: rcu: INFO: rcu_preempt self-detected stall on CPU rcu: 0-....: (1250 ticks this GP) <omitted> rcu: hardirqs softirqs csw/system rcu: number: 624 45 0 rcu: cputime: 69 1 2425 ==> 2500(ms) This output shows that the number of hard and soft interrupts is small, there are no context switches, and the system takes up a lot of time. This indicates that the current task is looping with preemption disabled. The impact on system performance is negligible because snapshot is recorded only once for all continuous RCU stalls. This added debugging information is suppressed by default and can be enabled by building the kernel with CONFIG_RCU_CPU_STALL_CPUTIME=y or by booting with rcupdate.rcu_cpu_stall_cputime=1. Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Reviewed-by: Mukesh Ojha <quic_mojha@quicinc.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-05sched: Add helper nr_context_switches_cpu()Zhen Lei
Add a function nr_context_switches_cpu() that returns number of context switches since boot on the specified CPU. This information will be used to diagnose RCU CPU stalls. Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ben Segall <bsegall@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-05torture: Fix hang during kthread shutdown phaseJoel Fernandes (Google)
During rcutorture shutdown, the rcu_torture_cleanup() function calls torture_cleanup_begin(), which sets the fullstop global variable to FULLSTOP_RMMOD. This causes the rcutorture threads for readers and fakewriters to exit all of their "while" loops and start shutting down. They then call torture_kthread_stopping(), which in turn waits for kthread_stop() to be called. However, rcu_torture_cleanup() has not yet called kthread_stop() on those threads, and before it gets a chance to do so, multiple instances of torture_kthread_stopping() invoke schedule_timeout_interruptible(1) in a tight loop. Tracing confirms that TIMER_SOFTIRQ can then continuously execute timer callbacks. If that TIMER_SOFTIRQ preempts the task executing rcu_torture_cleanup(), that task might never invoke kthread_stop(). This commit improves this situation by increasing the timeout passed to schedule_timeout_interruptible() from one jiffy to 1/20th of a second. This change prevents TIMER_SOFTIRQ from monopolizing its CPU, thus allowing rcu_torture_cleanup() to carry out the needed kthread_stop() invocations. Testing has shown 100 runs of TREE07 passing reliably, as oppose to the tens-of-percent failure rates seen beforehand. Cc: Paul McKenney <paulmck@kernel.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Zhouyi Zhou <zhouzhouyi@gmail.com> Cc: <stable@vger.kernel.org> # 6.0.x Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Tested-by: Zhouyi Zhou <zhouzhouyi@gmail.com> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-05rcutorture: Drop sparse lock-acquisition annotationsPaul E. McKenney
The sparse __acquires() and __releases() annotations provide very little value. The argument is ignored, so sparse cannot tell the differences between acquiring one lock and releasing another on the one hand and acquiring and releasing a given lock on the other. In addition, lockdep annotations provide much more precision, for but one example, actually knowing which lock is held. This commit therefore removes the __acquires() and __releases() annotations from rcutorture. Reported-by: Tejun Heo <tj@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-05locktorture: Make the rt_boost factor a tunableJoel Fernandes (Google)
The rt boosting in locktorture has a factor variable s currently large enough that boosting only happens once every minute or so. Add a tunable to reduce the factor so that boosting happens more often, to test paths and arrive at failure modes earlier. With this change, I can set the factor to like 50 and have the boosting happens every 10 seconds or so. Tested with boot parameters: locktorture.torture_type=mutex_lock locktorture.onoff_interval=1 locktorture.nwriters_stress=8 locktorture.stutter=0 locktorture.rt_boost=1 locktorture.rt_boost_factor=50 locktorture.nlocks=3 Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-05locktorture: Allow non-rtmutex lock types to be boostedJoel Fernandes (Google)
Currently RT boosting is only done for rtmutex_lock, however with proxy execution, we also have the mutex_lock participating in priorities. To exercise the testing better, add RT boosting to other lock testing types as well, using a new knob (rt_boost). Tested with boot parameters: locktorture.torture_type=mutex_lock locktorture.onoff_interval=1 locktorture.nwriters_stress=8 locktorture.stutter=0 locktorture.rt_boost=1 locktorture.rt_boost_factor=1 locktorture.nlocks=3 Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-05refscale: Add tests using SLAB_TYPESAFE_BY_RCUPaul E. McKenney
This commit adds three read-side-only tests of three use cases featuring SLAB_TYPESAFE_BY_RCU: One using per-object reference counting, one using per-object locking, and one using per-object sequence locking. [ paulmck: Apply feedback from kernel test robot. ] Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-05sched/topology: Add __init for sched_init_domains()Bing Huang
sched_init_domains() is only used in initialization Signed-off-by: Bing Huang <huangbing@kylinos.cn> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230105014943.9857-1-huangbing775@126.com
2023-01-05locking/qspinlock: Micro-optimize pending state waiting for unlockGuo Ren
When we're pending, we only care about lock value. The xchg_tail wouldn't affect the pending state. That means the hardware thread could stay in a sleep state and leaves the rest execution units' resources of pipeline to other hardware threads. This situation is the SMT scenarios in the same core. Not an entering low-power state situation. Of course, the granularity between cores is "cacheline", but the granularity between SMT hw threads of the same core could be "byte" which internal LSU handles. For example, when a hw-thread yields the resources of the core to other hw-threads, this patch could help the hw-thread stay in the sleep state and prevent it from being woken up by other hw-threads xchg_tail. Signed-off-by: Guo Ren <guoren@linux.alibaba.com> Signed-off-by: Guo Ren <guoren@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Waiman Long <longman@redhat.com> Link: https://lore.kernel.org/r/20230105021952.3090070-1-guoren@kernel.org Cc: Peter Zijlstra <peterz@infradead.org>
2023-01-04Merge tag 'for-netdev' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== bpf-next 2023-01-04 We've added 45 non-merge commits during the last 21 day(s) which contain a total of 50 files changed, 1454 insertions(+), 375 deletions(-). The main changes are: 1) Fixes, improvements and refactoring of parts of BPF verifier's state equivalence checks, from Andrii Nakryiko. 2) Fix a few corner cases in libbpf's BTF-to-C converter in particular around padding handling and enums, also from Andrii Nakryiko. 3) Add BPF_F_NO_TUNNEL_KEY extension to bpf_skb_set_tunnel_key to better support decap on GRE tunnel devices not operating in collect metadata, from Christian Ehrig. 4) Improve x86 JIT's codegen for PROBE_MEM runtime error checks, from Dave Marchevsky. 5) Remove the need for trace_printk_lock for bpf_trace_printk and bpf_trace_vprintk helpers, from Jiri Olsa. 6) Add proper documentation for BPF_MAP_TYPE_SOCK{MAP,HASH} maps, from Maryam Tahhan. 7) Improvements in libbpf's btf_parse_elf error handling, from Changbin Du. 8) Bigger batch of improvements to BPF tracing code samples, from Daniel T. Lee. 9) Add LoongArch support to libbpf's bpf_tracing helper header, from Hengqi Chen. 10) Fix a libbpf compiler warning in perf_event_open_probe on arm32, from Khem Raj. 11) Optimize bpf_local_storage_elem by removing 56 bytes of padding, from Martin KaFai Lau. 12) Use pkg-config to locate libelf for resolve_btfids build, from Shen Jiamin. 13) Various libbpf improvements around API documentation and errno handling, from Xin Liu. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (45 commits) libbpf: Return -ENODATA for missing btf section libbpf: Add LoongArch support to bpf_tracing.h libbpf: Restore errno after pr_warn. libbpf: Added the description of some API functions libbpf: Fix invalid return address register in s390 samples/bpf: Use BPF_KSYSCALL macro in syscall tracing programs samples/bpf: Fix tracex2 by using BPF_KSYSCALL macro samples/bpf: Change _kern suffix to .bpf with syscall tracing program samples/bpf: Use vmlinux.h instead of implicit headers in syscall tracing program samples/bpf: Use kyscall instead of kprobe in syscall tracing program bpf: rename list_head -> graph_root in field info types libbpf: fix errno is overwritten after being closed. bpf: fix regs_exact() logic in regsafe() to remap IDs correctly bpf: perform byte-by-byte comparison only when necessary in regsafe() bpf: reject non-exact register type matches in regsafe() bpf: generalize MAYBE_NULL vs non-MAYBE_NULL rule bpf: reorganize struct bpf_reg_state fields bpf: teach refsafe() to take into account ID remapping bpf: Remove unused field initialization in bpf's ctl_table selftests/bpf: Add jit probe_mem corner case tests to s390x denylist ... ==================== Link: https://lore.kernel.org/r/20230105000926.31350-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-04workqueue: Add a new flag to spot the potential UAF errorRichard Clark
Currently if the user queues a new work item unintentionally into a wq after the destroy_workqueue(wq), the work still can be queued and scheduled without any noticeable kernel message before the end of a RCU grace period. As a debug-aid facility, this commit adds a new flag __WQ_DESTROYING to spot that issue by triggering a kernel WARN message. Signed-off-by: Richard Clark <richard.xnu.clark@gmail.com> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-01-04cgroup/cpuset: no need to explicitly init a global static variableDaniel Vacek
cpuset_rwsem is a static variable defined with DEFINE_STATIC_PERCPU_RWSEM(). It's initialized at build time and so there's no need for explicit runtime init leaking one percpu int. Signed-off-by: Daniel Vacek <neelx@redhat.com> Reviewed-by: Aaron Tomlin <atomlin@atomlin.com> Acked-by: Mukesh Ojha <quic_mojha@quicinc.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-01-03clocksource: Improve read-back-delay messagePaul E. McKenney
When cs_watchdog_read() is unable to get a qualifying clocksource read within the limit set by max_cswd_read_retries, it prints a message and marks the clocksource under test as unstable. But that message is unclear to anyone unfamiliar with the code: clocksource: timekeeping watchdog on CPU13: wd-tsc-wd read-back delay 1000614ns, attempt 3, marking unstable Therefore, add some context so that the message appears as follows: clocksource: timekeeping watchdog on CPU13: wd-tsc-wd excessive read-back delay of 1000614ns vs. limit of 125000ns, wd-wd read-back delay only 27ns, attempt 3, marking tsc unstable Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: John Stultz <jstultz@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Stephen Boyd <sboyd@kernel.org> Cc: Feng Tang <feng.tang@intel.com>
2023-01-03clocksource: Loosen clocksource watchdog constraintsPaul E. McKenney
Currently, MAX_SKEW_USEC is set to 100 microseconds, which has worked reasonably well. However, NTP is willing to tolerate 500 microseconds of skew per second, and a clocksource that is good enough for NTP should be good enough for the clocksource watchdog. The watchdog's skew is controlled by MAX_SKEW_USEC and the CLOCKSOURCE_WATCHDOG_MAX_SKEW_US Kconfig option. However, these values are doubled before being associated with a clocksource's ->uncertainty_margin, and the ->uncertainty_margin values of the pair of clocksource's being compared are summed before checking against the skew. Therefore, set both MAX_SKEW_USEC and the default for the CLOCKSOURCE_WATCHDOG_MAX_SKEW_US Kconfig option to 125 microseconds of skew per second, resulting in 500 microseconds of skew per second in the clocksource watchdog's skew comparison. Suggested-by Rik van Riel <riel@surriel.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-03clocksource: Print clocksource name when clocksource is tested unstableYunying Sun
Some "TSC fall back to HPET" messages appear on systems having more than 2 NUMA nodes: clocksource: timekeeping watchdog on CPU168: hpet read-back delay of 4296200ns, attempt 4, marking unstable The "hpet" here is misleading the clocksource watchdog is really doing repeated reads of "hpet" in order to check for unrelated delays. Therefore, print the name of the clocksource under test, prefixed by "wd-" and suffixed by "-wd", for example, "wd-tsc-wd". Signed-off-by: Yunying Sun <yunying.sun@intel.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-03refscale: Provide for initialization failurePaul E. McKenney
Current tests all have init() functions that are guaranteed to succeed. But upcoming tests will need to allocate memory, thus possibly failing. This commit therefore handles init() function failure. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-03torture: Seed torture_random_state on CPUPaul E. McKenney
The DEFINE_TORTURE_RANDOM_PERCPU() macro defines per-CPU random-number generators for torture testing, but the seeds for each CPU's instance will be identical if they are first used at the same time. This commit therefore adds the CPU number to the mix when reseeding. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-03rcu-tasks: Handle queue-shrink/callback-enqueue race conditionZqiang
The rcu_tasks_need_gpcb() determines whether or not: (1) There are callbacks needing another grace period, (2) There are callbacks ready to be invoked, and (3) It would be a good time to shrink back down to a single-CPU callback list. This third case is interesting because some other CPU might be adding new callbacks, which might suddenly make this a very bad time to be shrinking. This is currently handled by requiring call_rcu_tasks_generic() to enqueue callbacks under the protection of rcu_read_lock() and requiring rcu_tasks_need_gpcb() to wait for an RCU grace period to elapse before finalizing the transition. This works well in practice. Unfortunately, the current code assumes that a grace period whose end is detected by the poll_state_synchronize_rcu() in the second "if" condition actually ended before the earlier code counted the callbacks queued on CPUs other than CPU 0 (local variable "ncbsnz"). Given the current code, it is possible that a long-delayed call_rcu_tasks_generic() invocation will queue a callback on a non-zero CPU after these CPUs have had their callbacks counted and zero has been stored to ncbsnz. Such a callback would trigger the WARN_ON_ONCE() in the second "if" statement. To see this, consider the following sequence of events: o CPU 0 invokes rcu_tasks_one_gp(), and counts fewer than rcu_task_collapse_lim callbacks. It sees at least one callback queued on some other CPU, thus setting ncbsnz to a non-zero value. o CPU 1 invokes call_rcu_tasks_generic() and loads 42 from ->percpu_enqueue_lim. It therefore decides to enqueue its callback onto CPU 1's callback list, but is delayed. o CPU 0 sees the rcu_task_cb_adjust is non-zero and that the number of callbacks does not exceed rcu_task_collapse_lim. It therefore checks percpu_enqueue_lim, and sees that its value is greater than the value one. CPU 0 therefore starts the shift back to a single callback list. It sets ->percpu_enqueue_lim to 1, but CPU 1 has already read the old value of 42. It also gets a grace-period state value from get_state_synchronize_rcu(). o CPU 0 sees that ncbsnz is non-zero in its second "if" statement, so it declines to finalize the shrink operation. o CPU 0 again invokes rcu_tasks_one_gp(), and counts fewer than rcu_task_collapse_lim callbacks. It also sees that there are no callback queued on any other CPU, and thus sets ncbsnz to zero. o CPU 1 resumes execution and enqueues its callback onto its own list. This invalidates the value of ncbsnz. o CPU 0 sees the rcu_task_cb_adjust is non-zero and that the number of callbacks does not exceed rcu_task_collapse_lim. It therefore checks percpu_enqueue_lim, but sees that its value is already unity. It therefore does not get a new grace-period state value. o CPU 0 sees that rcu_task_cb_adjust is non-zero, ncbsnz is zero, and that poll_state_synchronize_rcu() says that the grace period has completed. it therefore finalizes the shrink operation, setting ->percpu_dequeue_lim to the value one. o CPU 0 does a debug check, scanning the other CPUs' callback lists. It sees that CPU 1's list has a callback, so it (rightly) triggers the WARN_ON_ONCE(). After all, the new value of ->percpu_dequeue_lim says to not bother looking at CPU 1's callback list, which means that this callback will never be invoked. This can result in hangs and maybe even OOMs. Based on long experience with rcutorture, this is an extremely low-probability race condition, but it really can happen, especially in preemptible kernels or within guest OSes. This commit therefore checks for completion of the grace period before counting callbacks. With this change, in the above failure scenario CPU 0 would know not to prematurely end the shrink operation because the grace period would not have completed before the count operation started. [ paulmck: Adjust grace-period end rather than adding RCU reader. ] [ paulmck: Avoid spurious WARN_ON_ONCE() with ->percpu_dequeue_lim check. ] Signed-off-by: Zqiang <qiang1.zhang@intel.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-03rcu-tasks: Make rude RCU-Tasks work well with CPU hotplugZqiang
The synchronize_rcu_tasks_rude() function invokes rcu_tasks_rude_wait_gp() to wait one rude RCU-tasks grace period. The rcu_tasks_rude_wait_gp() function in turn checks if there is only a single online CPU. If so, it will immediately return, because a call to synchronize_rcu_tasks_rude() is by definition a grace period on a single-CPU system. (We could have blocked!) Unfortunately, this check uses num_online_cpus() without synchronization, which can result in too-short grace periods. To see this, consider the following scenario: CPU0 CPU1 (going offline) migration/1 task: cpu_stopper_thread -> take_cpu_down -> _cpu_disable (dec __num_online_cpus) ->cpuhp_invoke_callback preempt_disable access old_data0 task1 del old_data0 ..... synchronize_rcu_tasks_rude() task1 schedule out .... task2 schedule in rcu_tasks_rude_wait_gp() ->__num_online_cpus == 1 ->return .... task1 schedule in ->free old_data0 preempt_enable When CPU1 decrements __num_online_cpus, its value becomes 1. However, CPU1 has not finished going offline, and will take one last trip through the scheduler and the idle loop before it actually stops executing instructions. Because synchronize_rcu_tasks_rude() is mostly used for tracing, and because both the scheduler and the idle loop can be traced, this means that CPU0's prematurely ended grace period might disrupt the tracing on CPU1. Given that this disruption might include CPU1 executing instructions in memory that was just now freed (and maybe reallocated), this is a matter of some concern. This commit therefore removes that problematic single-CPU check from the rcu_tasks_rude_wait_gp() function. This dispenses with the single-CPU optimization, but there is no evidence indicating that this optimization is important. In addition, synchronize_rcu_tasks_generic() contains a similar optimization (albeit only for early boot), which also splats. (As in exactly why are you invoking synchronize_rcu_tasks_rude() so early in boot, anyway???) It is OK for the synchronize_rcu_tasks_rude() function's check to be unsynchronized because the only times that this check can evaluate to true is when there is only a single CPU running with preemption disabled. While in the area, this commit also fixes a minor bug in which a call to synchronize_rcu_tasks_rude() would instead be attributed to synchronize_rcu_tasks(). [ paulmck: Add "synchronize_" prefix and "()" suffix. ] Signed-off-by: Zqiang <qiang1.zhang@intel.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-03rcu-tasks: Fix synchronize_rcu_tasks() VS zap_pid_ns_processes()Frederic Weisbecker
RCU Tasks and PID-namespace unshare can interact in do_exit() in a complicated circular dependency: 1) TASK A calls unshare(CLONE_NEWPID), this creates a new PID namespace that every subsequent child of TASK A will belong to. But TASK A doesn't itself belong to that new PID namespace. 2) TASK A forks() and creates TASK B. TASK A stays attached to its PID namespace (let's say PID_NS1) and TASK B is the first task belonging to the new PID namespace created by unshare() (let's call it PID_NS2). 3) Since TASK B is the first task attached to PID_NS2, it becomes the PID_NS2 child reaper. 4) TASK A forks() again and creates TASK C which get attached to PID_NS2. Note how TASK C has TASK A as a parent (belonging to PID_NS1) but has TASK B (belonging to PID_NS2) as a pid_namespace child_reaper. 5) TASK B exits and since it is the child reaper for PID_NS2, it has to kill all other tasks attached to PID_NS2, and wait for all of them to die before getting reaped itself (zap_pid_ns_process()). 6) TASK A calls synchronize_rcu_tasks() which leads to synchronize_srcu(&tasks_rcu_exit_srcu). 7) TASK B is waiting for TASK C to get reaped. But TASK B is under a tasks_rcu_exit_srcu SRCU critical section (exit_notify() is between exit_tasks_rcu_start() and exit_tasks_rcu_finish()), blocking TASK A. 8) TASK C exits and since TASK A is its parent, it waits for it to reap TASK C, but it can't because TASK A waits for TASK B that waits for TASK C. Pid_namespace semantics can hardly be changed at this point. But the coverage of tasks_rcu_exit_srcu can be reduced instead. The current task is assumed not to be concurrently reapable at this stage of exit_notify() and therefore tasks_rcu_exit_srcu can be temporarily relaxed without breaking its constraints, providing a way out of the deadlock scenario. [ paulmck: Fix build failure by adding additional declaration. ] Fixes: 3f95aa81d265 ("rcu: Make TASKS_RCU handle tasks that are almost done exiting") Reported-by: Pengfei Xu <pengfei.xu@intel.com> Suggested-by: Boqun Feng <boqun.feng@gmail.com> Suggested-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Suggested-by: Paul E. McKenney <paulmck@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Eric W . Biederman <ebiederm@xmission.com> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-03rcu-tasks: Remove preemption disablement around srcu_read_[un]lock() callsFrederic Weisbecker
Ever since the following commit: 5a41344a3d83 ("srcu: Simplify __srcu_read_unlock() via this_cpu_dec()") SRCU doesn't rely anymore on preemption to be disabled in order to modify the per-CPU counter. And even then it used to be done from the API itself. Therefore and after checking further, it appears to be safe to remove the preemption disablement around __srcu_read_[un]lock() in exit_tasks_rcu_start() and exit_tasks_rcu_finish() Suggested-by: Boqun Feng <boqun.feng@gmail.com> Suggested-by: Paul E. McKenney <paulmck@kernel.org> Suggested-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>