summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2022-07-19can: sja1000: Add Quirk for RZ/N1 SJA1000 CAN controllerBiju Das
As per Chapter 6.5.16 of the RZ/N1 Peripheral Manual, The SJA1000 CAN controller does not support Clock Divider Register compared to the reference Philips SJA1000 device. This patch adds a device quirk to handle this difference. Link: https://lore.kernel.org/all/20220710115248.190280-4-biju.das.jz@bp.renesas.com Signed-off-by: Biju Das <biju.das.jz@bp.renesas.com> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2022-07-19dt-bindings: can: nxp,sja1000: Document RZ/N1{D,S} supportBiju Das
Add CAN binding documentation for Renesas RZ/N1 SoC. The SJA1000 CAN controller on RZ/N1 SoC has some differences compared to others like it has no clock divider register (CDR) support and it has no HW loopback (HW doesn't see tx messages on rx), so introduced a new compatible 'renesas,rzn1-sja1000' to handle these differences. Link: https://lore.kernel.org/all/20220710115248.190280-3-biju.das.jz@bp.renesas.com Signed-off-by: Biju Das <biju.das.jz@bp.renesas.com> Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2022-07-19dt-bindings: can: sja1000: Convert to json-schemaBiju Das
Convert the NXP SJA1000 CAN Controller Device Tree binding documentation to json-schema. Update the example to match reality. Link: https://lore.kernel.org/all/20220710115248.190280-2-biju.das.jz@bp.renesas.com Signed-off-by: Biju Das <biju.das.jz@bp.renesas.com> Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2022-07-19rcu/nocb: Avoid polling when my_rdp->nocb_head_rdp list is emptyZqiang
Currently, if the 'rcu_nocb_poll' kernel boot parameter is enabled, all rcuog kthreads enter polling mode. However, if all of a given group of rcuo kthreads correspond to CPUs that have been de-offloaded, the corresponding rcuog kthread will nonetheless still wake up periodically, unnecessarily consuming power and perturbing workloads. Fortunately, this situation is easily detected by the fact that the rcuog kthread's CPU's rcu_data structure's ->nocb_head_rdp list is empty. This commit saves power and avoids unnecessarily perturbing workloads by putting an rcuog kthread to sleep during any time period when all of its rcuo kthreads' CPUs are de-offloaded. Co-developed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Zqiang <qiang1.zhang@intel.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
2022-07-19rcu/nocb: Add option to opt rcuo kthreads out of RT priorityUladzislau Rezki (Sony)
This commit introduces a RCU_NOCB_CPU_CB_BOOST Kconfig option that prevents rcuo kthreads from running at real-time priority, even in kernels built with RCU_BOOST. This capability is important to devices needing low-latency (as in a few milliseconds) response from expedited RCU grace periods, but which are not running a classic real-time workload. On such devices, permitting the rcuo kthreads to run at real-time priority results in unacceptable latencies imposed on the application tasks, which run as SCHED_OTHER. See for example the following trace output: <snip> <...>-60 [006] d..1 2979.028717: rcu_batch_start: rcu_preempt CBs=34619 bl=270 <snip> If that rcuop kthread were permitted to run at real-time SCHED_FIFO priority, it would monopolize its CPU for hundreds of milliseconds while invoking those 34619 RCU callback functions, which would cause an unacceptably long latency spike for many application stacks on Android platforms. However, some existing real-time workloads require that callback invocation run at SCHED_FIFO priority, for example, those running on systems with heavy SCHED_OTHER background loads. (It is the real-time system's administrator's responsibility to make sure that important real-time tasks run at a higher priority than do RCU's kthreads.) Therefore, this new RCU_NOCB_CPU_CB_BOOST Kconfig option defaults to "y" on kernels built with PREEMPT_RT and defaults to "n" otherwise. The effect is to preserve current behavior for real-time systems, but for other systems to allow expedited RCU grace periods to run with real-time priority while continuing to invoke RCU callbacks as SCHED_OTHER. As you would expect, this RCU_NOCB_CPU_CB_BOOST Kconfig option has no effect except on CPUs with offloaded RCU callbacks. Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Acked-by: Joel Fernandes (Google) <joel@joelfernandes.org> Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
2022-07-19rcu: Add nocb_cb_kthread check to rcu_is_callbacks_kthread()Zqiang
Callbacks are invoked in RCU kthreads when calbacks are offloaded (rcu_nocbs boot parameter) or when RCU's softirq handler has been offloaded to rcuc kthreads (use_softirq==0). The current code allows for the rcu_nocbs case but not the use_softirq case. This commit adds support for the use_softirq case. Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Zqiang <qiang1.zhang@intel.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
2022-07-19rcu/nocb: Add an option to offload all CPUs on bootJoel Fernandes
Systems built with CONFIG_RCU_NOCB_CPU=y but booted without either the rcu_nocbs= or rcu_nohz_full= kernel-boot parameters will not have callback offloading on any of the CPUs, nor can any of the CPUs be switched to enable callback offloading at runtime. Although this is intentional, it would be nice to have a way to offload all the CPUs without having to make random bootloaders specify either the rcu_nocbs= or the rcu_nohz_full= kernel-boot parameters. This commit therefore provides a new CONFIG_RCU_NOCB_CPU_DEFAULT_ALL Kconfig option that switches the default so as to offload callback processing on all of the CPUs. This default can still be overridden using the rcu_nocbs= and rcu_nohz_full= kernel-boot parameters. Reviewed-by: Kalesh Singh <kaleshsingh@google.com> Reviewed-by: Uladzislau Rezki <urezki@gmail.com> (In v4.1, fixed issues with CONFIG maze reported by kernel test robot). Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Joel Fernandes <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
2022-07-19rcu/nocb: Fix NOCB kthreads spawn failure with rcu_nocb_rdp_deoffload() ↵Zqiang
direct call If the rcuog/o[p] kthreads spawn failed, the offloaded rdp needs to be explicitly deoffloaded, otherwise the target rdp is still considered offloaded even though nothing actually handles the callbacks. Signed-off-by: Zqiang <qiang1.zhang@intel.com> Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Uladzislau Rezki <uladzislau.rezki@sony.com> Cc: Joel Fernandes <joel@joelfernandes.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
2022-07-19rcu/nocb: Invert rcu_state.barrier_mutex VS hotplug lock locking orderZqiang
In case of failure to spawn either rcuog or rcuo[p] kthreads for a given rdp, rcu_nocb_rdp_deoffload() needs to be called with the hotplug lock and the barrier_mutex held. However cpus write lock is already held while calling rcutree_prepare_cpu(). It's not possible to call rcu_nocb_rdp_deoffload() from there with just locking the barrier_mutex or this would result in a locking inversion against rcu_nocb_cpu_deoffload() which holds both locks in the reverse order. Simply solve this with inverting the locking order inside rcu_nocb_cpu_[de]offload(). This will be a pre-requisite to toggle NOCB states toward cpusets anyway. Signed-off-by: Zqiang <qiang1.zhang@intel.com> Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Uladzislau Rezki <uladzislau.rezki@sony.com> Cc: Joel Fernandes <joel@joelfernandes.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
2022-07-19rcu/nocb: Add/del rdp to iterate from rcuog itselfFrederic Weisbecker
NOCB rdp's are part of a group whose list is iterated by the corresponding rdp leader. This list is RCU traversed because an rdp can be either added or deleted concurrently. Upon addition, a new iteration to the list after a synchronization point (a pair of LOCK/UNLOCK ->nocb_gp_lock) is forced to make sure: 1) we didn't miss a new element added in the middle of an iteration 2) we didn't ignore a whole subset of the list due to an element being quickly deleted and then re-added. 3) we prevent from probably other surprises... Although this layout is expected to be safe, it doesn't help anybody to sleep well. Simplify instead the nocb state toggling with moving the list modification from the nocb (de-)offloading workqueue to the rcuog kthreads instead. Whenever the rdp leader is expected to (re-)set the SEGCBLIST_KTHREAD_GP flag of a target rdp, the latter is queued so that the leader handles the flag flip along with adding or deleting the target rdp to the list to iterate. This way the list modification and iteration happen from the same kthread and those operations can't race altogether. As a bonus, the flags for each rdp don't need to be checked locklessly before each iteration, which is one less opportunity to produce nightmares. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Uladzislau Rezki <uladzislau.rezki@sony.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Zqiang <qiang1.zhang@intel.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
2022-07-19Merge branch 'md-fixes' of ↵Jens Axboe
https://git.kernel.org/pub/scm/linux/kernel/git/song/md into block-5.19 Pull MD fix from Song. * 'md-fixes' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md: md/raid5: missing error code in setup_conf()
2022-07-19rcu/tree: Add comment to describe GP-done condition in fqs loopNeeraj Upadhyay
Add a comment to explain why !rcu_preempt_blocked_readers_cgp() condition is required on root rnp node, for GP completion check in rcu_gp_fqs_loop(). Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-07-19rcu: Initialize first_gp_fqs at declaration in rcu_gp_fqs()Paul E. McKenney
This commit saves a line of code by initializing the rcu_gp_fqs() function's first_gp_fqs local variable in its declaration. Reported-by: Frederic Weisbecker <frederic@kernel.org> Reported-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-07-19rcu/kvfree: Remove useless monitor_todo flagJoel Fernandes (Google)
monitor_todo is not needed as the work struct already tracks if work is pending. Just use that to know if work is pending using schedule_delayed_work() helper. Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
2022-07-19rcu: Cleanup RCU urgency state for offline CPUZqiang
When a CPU is slow to provide a quiescent state for a given grace period, RCU takes steps to encourage that CPU to get with the quiescent-state program in a more timely fashion. These steps include these flags in the rcu_data structure: 1. ->rcu_urgent_qs, which causes the scheduling-clock interrupt to request an otherwise pointless context switch from the scheduler. 2. ->rcu_need_heavy_qs, which causes both cond_resched() and RCU's context-switch hook to do an immediate momentary quiscent state. 3. ->rcu_need_heavy_qs, which causes the scheduler-clock tick to be enabled even on nohz_full CPUs with only one runnable task. These flags are of course cleared once the corresponding CPU has passed through a quiescent state. Unless that quiescent state is the CPU going offline, which means that when the CPU comes back online, it will needlessly consume additional CPU time and incur additional latency, which constitutes a minor but very real performance bug. This commit therefore adds the call to rcu_disable_urgency_upon_qs() that clears these flags to the CPU-hotplug offlining code path. Signed-off-by: Zqiang <qiang1.zhang@intel.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
2022-07-19rcu: tiny: Record kvfree_call_rcu() call stack for KASANJohannes Berg
When running KASAN with Tiny RCU (e.g. under ARCH=um, where a working KASAN patch is now available), we don't get any information on the original kfree_rcu() (or similar) caller when a problem is reported, as Tiny RCU doesn't record this. Add the recording, which required pulling kvfree_call_rcu() out of line for the KASAN case since the recording function (kasan_record_aux_stack_noalloc) is neither exported, nor can we include kasan.h into rcutiny.h. without KASAN, the patch has no size impact (ARCH=um kernel): text data bss dec hex filename 6151515 4423154 33148520 43723189 29b29b5 linux 6151515 4423154 33148520 43723189 29b29b5 linux + patch with KASAN, the impact on my build was minimal: text data bss dec hex filename 13915539 7388050 33282304 54585893 340ea25 linux 13911266 7392114 33282304 54585684 340e954 linux + patch -4273 +4064 +-0 -209 Acked-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-07-19locking/csd_lock: Change csdlock_debug from early_param to __setupChen Zhongjin
The csdlock_debug kernel-boot parameter is parsed by the early_param() function csdlock_debug(). If set, csdlock_debug() invokes static_branch_enable() to enable csd_lock_wait feature, which triggers a panic on arm64 for kernels built with CONFIG_SPARSEMEM=y and CONFIG_SPARSEMEM_VMEMMAP=n. With CONFIG_SPARSEMEM_VMEMMAP=n, __nr_to_section is called in static_key_enable() and returns NULL, resulting in a NULL dereference because mem_section is initialized only later in sparse_init(). This is also a problem for powerpc because early_param() functions are invoked earlier than jump_label_init(), also resulting in static_key_enable() failures. These failures cause the warning "static key 'xxx' used before call to jump_label_init()". Thus, early_param is too early for csd_lock_wait to run static_branch_enable(), so changes it to __setup to fix these. Fixes: 8d0968cc6b8f ("locking/csd_lock: Add boot parameter for controlling CSD lock debugging") Cc: stable@vger.kernel.org Reported-by: Chen jingwen <chenjingwen6@huawei.com> Signed-off-by: Chen Zhongjin <chenzhongjin@huawei.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-07-19srcu: Make expedited RCU grace periods block even less frequentlyNeeraj Upadhyay
The purpose of commit 282d8998e997 ("srcu: Prevent expedited GPs and blocking readers from consuming CPU") was to prevent a long series of never-blocking expedited SRCU grace periods from blocking kernel-live-patching (KLP) progress. Although it was successful, it also resulted in excessive boot times on certain embedded workloads running under qemu with the "-bios QEMU_EFI.fd" command line. Here "excessive" means increasing the boot time up into the three-to-four minute range. This increase in boot time was due to the more than 6000 back-to-back invocations of synchronize_rcu_expedited() within the KVM host OS, which in turn resulted from qemu's emulation of a long series of MMIO accesses. Commit 640a7d37c3f4 ("srcu: Block less aggressively for expedited grace periods") did not significantly help this particular use case. Zhangfei Gao and Shameerali Kolothum Thodi did experiments varying the value of SRCU_MAX_NODELAY_PHASE with HZ=250 and with various values of non-sleeping per phase counts on a system with preemption enabled, and observed the following boot times: +──────────────────────────+────────────────+ | SRCU_MAX_NODELAY_PHASE | Boot time (s) | +──────────────────────────+────────────────+ | 100 | 30.053 | | 150 | 25.151 | | 200 | 20.704 | | 250 | 15.748 | | 500 | 11.401 | | 1000 | 11.443 | | 10000 | 11.258 | | 1000000 | 11.154 | +──────────────────────────+────────────────+ Analysis on the experiment results show additional improvements with CPU-bound delays approaching one jiffy in duration. This improvement was also seen when number of per-phase iterations were scaled to one jiffy. This commit therefore scales per-grace-period phase number of non-sleeping polls so that non-sleeping polls extend for about one jiffy. In addition, the delay-calculation call to srcu_get_delay() in srcu_gp_end() is replaced with a simple check for an expedited grace period. This change schedules callback invocation immediately after expedited grace periods complete, which results in greatly improved boot times. Testing done by Marc and Zhangfei confirms that this change recovers most of the performance degradation in boottime; for CONFIG_HZ_250 configuration, specifically, boot times improve from 3m50s to 41s on Marc's setup; and from 2m40s to ~9.7s on Zhangfei's setup. In addition to the changes to default per phase delays, this change adds 3 new kernel parameters - srcutree.srcu_max_nodelay, srcutree.srcu_max_nodelay_phase, and srcutree.srcu_retry_check_delay. This allows users to configure the srcu grace period scanning delays in order to more quickly react to additional use cases. Fixes: 640a7d37c3f4 ("srcu: Block less aggressively for expedited grace periods") Fixes: 282d8998e997 ("srcu: Prevent expedited GPs and blocking readers from consuming CPU") Reported-by: Zhangfei Gao <zhangfei.gao@linaro.org> Reported-by: yueluck <yueluck@163.com> Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Tested-by: Marc Zyngier <maz@kernel.org> Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org> Link: https://lore.kernel.org/all/20615615-0013-5adc-584f-2b1d5c03ebfc@linaro.org/ Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-07-19rcu: Forbid RCU_STRICT_GRACE_PERIOD in TINY_RCU kernelsPaul E. McKenney
The RCU_STRICT_GRACE_PERIOD Kconfig option does nothing in kernels built with CONFIG_TINY_RCU=y, so this commit adjusts the dependencies to disallow this combination. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
2022-07-19srcu: Block less aggressively for expedited grace periodsPaul E. McKenney
Commit 282d8998e997 ("srcu: Prevent expedited GPs and blocking readers from consuming CPU") fixed a problem where a long-running expedited SRCU grace period could block kernel live patching. It did so by giving up on expediting once a given SRCU expedited grace period grew too old. Unfortunately, this added excessive delays to boots of virtual embedded systems specifying "-bios QEMU_EFI.fd" to qemu. This commit therefore makes the transition away from expediting less aggressive, increasing the per-grace-period phase number of non-sleeping polls of readers from one to three and increasing the required grace-period age from one jiffy (actually from zero to one jiffies) to two jiffies (actually from one to two jiffies). Fixes: 282d8998e997 ("srcu: Prevent expedited GPs and blocking readers from consuming CPU") Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reported-by: Zhangfei Gao <zhangfei.gao@linaro.org> Reported-by: chenxiang (M)" <chenxiang66@hisilicon.com> Cc: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Link: https://lore.kernel.org/all/20615615-0013-5adc-584f-2b1d5c03ebfc@linaro.org/
2022-07-19rcu: Immediately boost preempted readers for strict grace periodsZqiang
The intent of the CONFIG_RCU_STRICT_GRACE_PERIOD Konfig option is to cause normal grace periods to complete quickly in order to better catch errors resulting from improperly leaking pointers from RCU read-side critical sections. However, kernels built with this option enabled still wait for some hundreds of milliseconds before boosting RCU readers that have been preempted within their current critical section. The value of this delay is set by the CONFIG_RCU_BOOST_DELAY Kconfig option, which defaults to 500 milliseconds. This commit therefore causes kernels build with strict grace periods to ignore CONFIG_RCU_BOOST_DELAY. This causes rcu_initiate_boost() to start boosting immediately after all CPUs on a given leaf rcu_node structure have passed through their quiescent states. Signed-off-by: Zqiang <qiang1.zhang@intel.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
2022-07-19rcu: Add rnp->cbovldmask check in rcutree_migrate_callbacks()Zqiang
Currently, the rcu_node structure's ->cbovlmask field is set in call_rcu() when a given CPU is suffering from callback overload. But if that CPU goes offline, the outgoing CPU's callbacks is migrated to the running CPU, which is likely to overload the running CPU. However, that CPU's bit in its leaf rcu_node structure's ->cbovlmask field remains zero. Initially, this is OK because the outgoing CPU's bit remains set. However, that bit will be cleared at the next end of a grace period, at which time it is quite possible that the running CPU will still be overloaded. If the running CPU invokes call_rcu(), then overload will be checked for and the bit will be set. Except that there is no guarantee that the running CPU will invoke call_rcu(), in which case the next grace period will fail to take the running CPU's overload condition into account. Plus, because the bit is not set, the end of the grace period won't check for overload on this CPU. This commit therefore adds a call to check_cb_ovld_locked() in rcutree_migrate_callbacks() to set the running CPU's ->cbovlmask bit appropriately. Signed-off-by: Zqiang <qiang1.zhang@intel.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
2022-07-19rcu: Avoid tracing a few functions executed in stop machinePatrick Wang
Stop-machine recently started calling additional functions while waiting: ---------------------------------------------------------------- Former stop machine wait loop: do { cpu_relax(); => macro ... } while (curstate != STOPMACHINE_EXIT); ----------------------------------------------------------------- Current stop machine wait loop: do { stop_machine_yield(cpumask); => function (notraced) ... touch_nmi_watchdog(); => function (notraced, inside calls also notraced) ... rcu_momentary_dyntick_idle(); => function (notraced, inside calls traced) } while (curstate != MULTI_STOP_EXIT); ------------------------------------------------------------------ These functions (and the functions that they call) must be marked notrace to prevent them from being updated while they are executing. The consequences of failing to mark these functions can be severe: rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: rcu: 1-...!: (0 ticks this GP) idle=14f/1/0x4000000000000000 softirq=3397/3397 fqs=0 rcu: 3-...!: (0 ticks this GP) idle=ee9/1/0x4000000000000000 softirq=5168/5168 fqs=0 (detected by 0, t=8137 jiffies, g=5889, q=2 ncpus=4) Task dump for CPU 1: task:migration/1 state:R running task stack: 0 pid: 19 ppid: 2 flags:0x00000000 Stopper: multi_cpu_stop+0x0/0x18c <- stop_machine_cpuslocked+0x128/0x174 Call Trace: Task dump for CPU 3: task:migration/3 state:R running task stack: 0 pid: 29 ppid: 2 flags:0x00000000 Stopper: multi_cpu_stop+0x0/0x18c <- stop_machine_cpuslocked+0x128/0x174 Call Trace: rcu: rcu_preempt kthread timer wakeup didn't happen for 8136 jiffies! g5889 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 rcu: Possible timer handling issue on cpu=2 timer-softirq=594 rcu: rcu_preempt kthread starved for 8137 jiffies! g5889 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=2 rcu: Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior. rcu: RCU grace-period kthread stack dump: task:rcu_preempt state:I stack: 0 pid: 14 ppid: 2 flags:0x00000000 Call Trace: schedule+0x56/0xc2 schedule_timeout+0x82/0x184 rcu_gp_fqs_loop+0x19a/0x318 rcu_gp_kthread+0x11a/0x140 kthread+0xee/0x118 ret_from_exception+0x0/0x14 rcu: Stack dump where RCU GP kthread last ran: Task dump for CPU 2: task:migration/2 state:R running task stack: 0 pid: 24 ppid: 2 flags:0x00000000 Stopper: multi_cpu_stop+0x0/0x18c <- stop_machine_cpuslocked+0x128/0x174 Call Trace: This commit therefore marks these functions notrace: rcu_preempt_deferred_qs() rcu_preempt_need_deferred_qs() rcu_preempt_deferred_qs_irqrestore() [ paulmck: Apply feedback from Neeraj Upadhyay. ] Signed-off-by: Patrick Wang <patrick.wang.shcn@gmail.com> Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
2022-07-19rcu: Decrease FQS scan wait time in case of callback overloadingPaul E. McKenney
The force-quiesce-state loop function rcu_gp_fqs_loop() checks for callback overloading and does an immediate initial scan for idle CPUs if so. However, subsequent rescans will be carried out at as leisurely a rate as they always are, as specified by the rcutree.jiffies_till_next_fqs module parameter. It might be tempting to just continue immediately rescanning, but this turns the RCU grace-period kthread into a CPU hog. It might also be tempting to reduce the time between rescans to a single jiffy, but this can be problematic on larger systems. This commit therefore divides the normal time between rescans by three, rounding up. Thus a small system running at HZ=1000 that is suffering from callback overload will wait only one jiffy instead of the normal three between rescans. [ paulmck: Apply Neeraj Upadhyay feedback. ] Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
2022-07-19Merge branch 'can-slcan-checkpatch-cleanups'Marc Kleine-Budde
Marc Kleine-Budde says: ==================== can: slcan: checkpatch cleanups This is a patch series consisting of various checkpatch cleanups for the slcan driver. ==================== Link: https://lore.kernel.org/all/20220704125954.1587880-1-mkl@pengutronix.de Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2022-07-19can: slcan: clean up if/elseMarc Kleine-Budde
Remove braces after if() for single statement blocks, also remove else after return() in if() block. Link: https://lore.kernel.org/all/20220704125954.1587880-6-mkl@pengutronix.de Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2022-07-19can: slcan: convert comparison to NULL into !valMarc Kleine-Budde
All comparison to NULL could be written "!val", convert them to make checkpatch happy. Link: https://lore.kernel.org/all/20220704125954.1587880-5-mkl@pengutronix.de Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2022-07-19can: slcan: fix whitespace issuesMarc Kleine-Budde
Add and remove whitespace to make checkpatch happy. Link: https://lore.kernel.org/all/20220704125954.1587880-4-mkl@pengutronix.de Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2022-07-19can: slcan: slcan_init() convert printk(LEVEL ...) to pr_level()Marc Kleine-Budde
Convert the last printk(LEVEL ...) to pr_level(). Link: https://lore.kernel.org/all/20220704125954.1587880-3-mkl@pengutronix.de Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2022-07-19can: slcan: convert comments to network style commentsMarc Kleine-Budde
Convert all comments to network subsystem style comments. Link: https://lore.kernel.org/all/20220704125954.1587880-2-mkl@pengutronix.de Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2022-07-19can: slcan: use scnprintf() as a hardening measureDan Carpenter
The snprintf() function returns the number of bytes which *would* have been copied if there were no space. So, since this code does not check the return value, there if the buffer was not large enough then there would be a buffer overflow two lines later when it does: actual = sl->tty->ops->write(sl->tty, sl->xbuff, n); Use scnprintf() instead because that returns the number of bytes which were actually copied. Fixes: 52f9ac85b876 ("can: slcan: allow to send commands to the adapter") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Link: https://lore.kernel.org/all/YsVA9KoY/ZSvNGYk@kili Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2022-07-19arm64/mm: use GENMASK_ULL for TTBR_BADDR_MASK_52Joey Gouly
The comment says this should be GENMASK_ULL(47, 12), so do that! GENMASK_ULL() is available in assembly since: 95b980d62d52 ("linux/bits.h: make BIT(), GENMASK(), and friends available in assembly") Signed-off-by: Joey Gouly <joey.gouly@arm.com> Link: https://lore.kernel.org/all/20171221164851.edxq536yobjuagwe@armageddon.cambridge.arm.com/ Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Kristina Martsenko <kristina.martsenko@arm.com> Reviewed-by: Kristina Martsenko <kristina.martsenko@arm.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Link: https://lore.kernel.org/r/20220708140056.10123-1-joey.gouly@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2022-07-19arm64: errata: Remove AES hwcap for COMPAT tasksJames Morse
Cortex-A57 and Cortex-A72 have an erratum where an interrupt that occurs between a pair of AES instructions in aarch32 mode may corrupt the ELR. The task will subsequently produce the wrong AES result. The AES instructions are part of the cryptographic extensions, which are optional. User-space software will detect the support for these instructions from the hwcaps. If the platform doesn't support these instructions a software implementation should be used. Remove the hwcap bits on affected parts to indicate user-space should not use the AES instructions. Acked-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: James Morse <james.morse@arm.com> Link: https://lore.kernel.org/r/20220714161523.279570-3-james.morse@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2022-07-19arm64: numa: Don't check node against MAX_NUMNODESGavin Shan
When the NUMA nodes are sorted by checking ACPI SRAT (GICC AFFINITY) sub-table, it's impossible for acpi_map_pxm_to_node() to return any value, which is greater than or equal to MAX_NUMNODES. Lets drop the unnecessary check in acpi_numa_gicc_affinity_init(). No functional change intended. Signed-off-by: Gavin Shan <gshan@redhat.com> Link: https://lore.kernel.org/r/20220718064232.3464373-1-gshan@redhat.com Signed-off-by: Will Deacon <will@kernel.org>
2022-07-19KVM: x86: Protect the unused bits in MSR exiting flagsAaron Lewis
The flags for KVM_CAP_X86_USER_SPACE_MSR and KVM_X86_SET_MSR_FILTER have no protection for their unused bits. Without protection, future development for these features will be difficult. Add the protection needed to make it possible to extend these features in the future. Signed-off-by: Aaron Lewis <aaronlewis@google.com> Message-Id: <20220714161314.1715227-1-aaronlewis@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-19md/raid5: missing error code in setup_conf()Dan Carpenter
Return -ENOMEM if the allocation fails. Don't return success. Fixes: 8fbcba6b999b ("md/raid5: Cleanup setup_conf() error returns") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org>
2022-07-19drivers/perf: arm_spe: Fix consistency of SYS_PMSCR_EL1.CXAnshuman Khandual
The arm_spe_pmu driver will enable SYS_PMSCR_EL1.CX in order to add CONTEXT packets into the traces, if the owner of the perf event runs with required capabilities i.e CAP_PERFMON or CAP_SYS_ADMIN via perfmon_capable() helper. The value of this bit is computed in the arm_spe_event_to_pmscr() function but the check for capabilities happens in the pmu event init callback i.e arm_spe_pmu_event_init(). This suggests that the value of the CX bit should remain consistent for the duration of the perf session. However, the function arm_spe_event_to_pmscr() may be called later during the event start callback i.e arm_spe_pmu_start() when the "current" process is not the owner of the perf session, hence the CX bit setting is currently not consistent. One way to fix this, is by caching the required value of the CX bit during the initialization of the PMU event, so that it remains consistent for the duration of the session. It uses currently unused 'event->hw.flags' element to cache perfmon_capable() value, which can be referred during event start callback to compute SYS_PMSCR_EL1.CX. This ensures consistent availability of context packets in the trace as per event owner capabilities. Drop BIT(SYS_PMSCR_EL1_CX_SHIFT) check in arm_spe_pmu_event_init(), because now CX bit cannot be set in arm_spe_event_to_pmscr() with perfmon_capable() disabled. Cc: Will Deacon <will@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Alexey Budankov <alexey.budankov@linux.intel.com> Cc: linux-arm-kernel@lists.infradead.org Cc: linux-kernel@vger.kernel.org Fixes: d5d9696b0380 ("drivers/perf: Add support for ARMv8.2 Statistical Profiling Extension") Reported-by: German Gomez <german.gomez@arm.com> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com> Link: https://lore.kernel.org/r/20220714061302.2715102-1-anshuman.khandual@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2022-07-19libbpf: fix an snprintf() overflow checkDan Carpenter
The snprintf() function returns the number of bytes it *would* have copied if there were enough space. So it can return > the sizeof(gen->attach_target). Fixes: 67234743736a ("libbpf: Generate loader program out of BPF ELF file.") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/r/YtZ+oAySqIhFl6/J@kili Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-07-19regulator: core: Fix off-on-delay-us for always-on/boot-on regulatorsChristian Kohlschütter
Regulators marked with "regulator-always-on" or "regulator-boot-on" as well as an "off-on-delay-us", may run into cycling issues that are hard to detect. This is caused by the "last_off" state not being initialized in this case. Fix the "last_off" initialization by setting it to the current kernel time upon initialization, regardless of always_on/boot_on state. Signed-off-by: Christian Kohlschütter <christian@kohlschutter.com> Link: https://lore.kernel.org/r/FAFD5B39-E9C4-47C7-ACF1-2A04CD59758D@kohlschutter.com Signed-off-by: Mark Brown <broonie@kernel.org>
2022-07-19selftests/bpf: fix a test for snprintf() overflowDan Carpenter
The snprintf() function returns the number of bytes which *would* have been copied if there were space. In other words, it can be > sizeof(pin_path). Fixes: c0fa1b6c3efc ("bpf: btf: Add BTF tests") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/r/YtZ+aD/tZMkgOUw+@kili Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-07-19perf: RISC-V: Add of_node_put() when breaking out of for_each_of_cpu_node()Liang He
In pmu_sbi_setup_irqs(), we should call of_node_put() for the 'cpu' when breaking out of for_each_of_cput_node() as its refcount will be automatically increased and decreased during the iteration. Fixes: 4905ec2fb7e6 ("RISC-V: Add sscofpmf extension support") Signed-off-by: Liang He <windhl@126.com> Reviewed-by: Atish Patra <atishp@rivosinc.com> Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Link: https://lore.kernel.org/r/20220715130330.443363-1-windhl@126.com Signed-off-by: Will Deacon <will@kernel.org>
2022-07-19bpf, docs: document BPF_MAP_TYPE_HASH and variantsDonald Hunter
Add documentation for BPF_MAP_TYPE_HASH including kernel version introduced, usage and examples. Document BPF_MAP_TYPE_PERCPU_HASH, BPF_MAP_TYPE_LRU_HASH and BPF_MAP_TYPE_LRU_PERCPU_HASH variations. Note that this file is included in the BPF documentation by the glob in Documentation/bpf/maps.rst v3: Fix typos reported by Stanislav Fomichev and Yonghong Song. Add note about iteration and deletion as requested by Yonghong Song. v2: Describe memory allocation semantics as suggested by Stanislav Fomichev. Fix u64 typo reported by Stanislav Fomichev. Cut down usage examples to only show usage in context. Updated patch description to follow style recommendation, reported by Bagas Sanjaya. Signed-off-by: Donald Hunter <donald.hunter@gmail.com> Reviewed-by: Stanislav Fomichev <sdf@google.com> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20220718125847.1390-1-donald.hunter@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-07-19intel_idle: Add a new flag to initialize the AMX stateChang S. Bae
The non-initialized AMX state can be the cause of C-state demotion from C6 to C1E. This low-power idle state may improve power savings and thus result in a higher available turbo frequency budget. This behavior is implementation-specific. Initialize the state for the C6 entrance of Sapphire Rapids as needed. Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Tested-by: Zhang Rui <rui.zhang@intel.com> Link: https://lkml.kernel.org/r/20220614164116.5196-1-chang.seok.bae@intel.com
2022-07-19selftests/bpf: test eager BPF ringbuf size adjustment logicAndrii Nakryiko
Add test validating that libbpf adjusts (and reflects adjusted) ringbuf size early, before bpf_object is loaded. Also make sure we can't successfully resize ringbuf map after bpf_object is loaded. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20220715230952.2219271-2-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-07-19libbpf: make RINGBUF map size adjustments more eagerlyAndrii Nakryiko
Make libbpf adjust RINGBUF map size (rounding it up to closest power-of-2 of page_size) more eagerly: during open phase when initializing the map and on explicit calls to bpf_map__set_max_entries(). Such approach allows user to check actual size of BPF ringbuf even before it's created in the kernel, but also it prevents various edge case scenarios where BPF ringbuf size can get out of sync with what it would be in kernel. One of them (reported in [0]) is during an attempt to pin/reuse BPF ringbuf. Move adjust_ringbuf_sz() helper closer to its first actual use. The implementation of the helper is unchanged. Also make detection of whether bpf_object is already loaded more robust by checking obj->loaded explicitly, given that map->fd can be < 0 even if bpf_object is already loaded due to ability to disable map creation with bpf_map__set_autocreate(map, false). [0] Closes: https://github.com/libbpf/libbpf/pull/530 Fixes: 0087a681fa8c ("libbpf: Automatically fix up BPF_MAP_TYPE_RINGBUF size, if necessary") Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20220715230952.2219271-1-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-07-19bpf: fix bpf_skb_pull_data documentationJoanne Koong
Fix documentation for bpf_skb_pull_data() helper for when len == 0. Fixes: fa15601ab31e ("bpf: add documentation for eBPF helpers (33-41)") Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Acked-by: Quentin Monnet <quentin@isovalent.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/r/20220715193800.3940070-1-joannelkoong@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-07-19libbpf: fallback to tracefs mount point if debugfs is not mountedAndrii Nakryiko
Teach libbpf to fallback to tracefs mount point (/sys/kernel/tracing) if debugfs (/sys/kernel/debug/tracing) isn't mounted. Acked-by: Yonghong Song <yhs@fb.com> Suggested-by: Connor O'Brien <connoro@google.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20220715185736.898848-1-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-07-19bpf: Don't redirect packets with invalid pkt_lenZhengchao Shao
Syzbot found an issue [1]: fq_codel_drop() try to drop a flow whitout any skbs, that is, the flow->head is null. The root cause, as the [2] says, is because that bpf_prog_test_run_skb() run a bpf prog which redirects empty skbs. So we should determine whether the length of the packet modified by bpf prog or others like bpf_prog_test is valid before forwarding it directly. LINK: [1] https://syzkaller.appspot.com/bug?id=0b84da80c2917757915afa89f7738a9d16ec96c5 LINK: [2] https://www.spinics.net/lists/netdev/msg777503.html Reported-by: syzbot+7a12909485b94426aceb@syzkaller.appspotmail.com Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com> Reviewed-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20220715115559.139691-1-shaozhengchao@huawei.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-07-19x86/fpu: Add a helper to prepare AMX state for low-power CPU idleChang S. Bae
When a CPU enters an idle state, a non-initialized AMX register state may be the cause of preventing a deeper low-power state. Other extended register states whether initialized or not do not impact the CPU idle state. The new helper can ensure the AMX state is initialized before the CPU is idle, and it will be used by the intel idle driver. Check the AMX_TILE feature bit before using XGETBV1 as a chain of dependencies was established via cpuid_deps[]: AMX->XFD->XGETBV1. Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20220608164748.11864-2-chang.seok.bae@intel.com
2022-07-19Merge branch 'BPF array map fixes and improvements'Alexei Starovoitov
Andrii Nakryiko says: ==================== Fix 32-bit overflow in value pointer calculations in BPF array map. And then raise obsolete limit on array map value size. Add selftest making sure this is working as intended. v1->v2: - fix broken patch #1 (no mask_index use in helper, as stated in commit message; and add missing semicolon). ==================== Acked-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>