summaryrefslogtreecommitdiff
path: root/tools
AgeCommit message (Collapse)Author
2024-06-19selftests: add selftest for the SRv6 End.DX4 behavior with netfilterJianguo Wu
this selftest is designed for evaluating the SRv6 End.DX4 behavior used with netfilter(rpfilter), in this example, for implementing IPv4 L3 VPN use cases. Signed-off-by: Jianguo Wu <wujianguo@chinatelecom.cn> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-06-19selftests: openvswitch: Set value to nla flags.Adrian Moreno
Netlink flags, although they don't have payload at the netlink level, are represented as having "True" as value in pyroute2. Without it, trying to add a flow with a flag-type action (e.g: pop_vlan) fails with the following traceback: Traceback (most recent call last): File "[...]/ovs-dpctl.py", line 2498, in <module> sys.exit(main(sys.argv)) ^^^^^^^^^^^^^^ File "[...]/ovs-dpctl.py", line 2487, in main ovsflow.add_flow(rep["dpifindex"], flow) File "[...]/ovs-dpctl.py", line 2136, in add_flow reply = self.nlm_request( ^^^^^^^^^^^^^^^^^ File "[...]/pyroute2/netlink/nlsocket.py", line 822, in nlm_request return tuple(self._genlm_request(*argv, **kwarg)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "[...]/pyroute2/netlink/generic/__init__.py", line 126, in nlm_request return tuple(super().nlm_request(*argv, **kwarg)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "[...]/pyroute2/netlink/nlsocket.py", line 1124, in nlm_request self.put(msg, msg_type, msg_flags, msg_seq=msg_seq) File "[...]/pyroute2/netlink/nlsocket.py", line 389, in put self.sendto_gate(msg, addr) File "[...]/pyroute2/netlink/nlsocket.py", line 1056, in sendto_gate msg.encode() File "[...]/pyroute2/netlink/__init__.py", line 1245, in encode offset = self.encode_nlas(offset) ^^^^^^^^^^^^^^^^^^^^^^^^ File "[...]/pyroute2/netlink/__init__.py", line 1560, in encode_nlas nla_instance.setvalue(cell[1]) File "[...]/pyroute2/netlink/__init__.py", line 1265, in setvalue nlv.setvalue(nla_tuple[1]) ~~~~~~~~~^^^ IndexError: list index out of range Signed-off-by: Adrian Moreno <amorenoz@redhat.com> Acked-by: Aaron Conole <aconole@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-18Merge tag 'linux_kselftest-fixes-6.10-rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest Pull kselftest fixes from Shuah Khan: - filesystems: warn_unused_result warnings - seccomp: format-zero-length warnings - fchmodat2: clang build warnings due to-static-libasan - openat2: clang build warnings due to static-libasan, LOCAL_HDRS * tag 'linux_kselftest-fixes-6.10-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest: selftests/fchmodat2: fix clang build failure due to -static-libasan selftests/openat2: fix clang build failures: -static-libasan, LOCAL_HDRS selftests: seccomp: fix format-zero-length warnings selftests: filesystems: fix warn_unused_result build warnings
2024-06-18selftests: openvswitch: Use bash as interpreterSimon Horman
openvswitch.sh makes use of substitutions of the form ${ns:0:1}, to obtain the first character of $ns. Empirically, this is works with bash but not dash. When run with dash these evaluate to an empty string and printing an error to stdout. # dash -c 'ns=client; echo "${ns:0:1}"' 2>error # cat error dash: 1: Bad substitution # bash -c 'ns=client; echo "${ns:0:1}"' 2>error c # cat error This leads to tests that neither pass nor fail. F.e. TEST: arp_ping [START] adding sandbox 'test_arp_ping' Adding DP/Bridge IF: sbx:test_arp_ping dp:arpping {, , } create namespaces ./openvswitch.sh: 282: eval: Bad substitution TEST: ct_connect_v4 [START] adding sandbox 'test_ct_connect_v4' Adding DP/Bridge IF: sbx:test_ct_connect_v4 dp:ct4 {, , } ./openvswitch.sh: 322: eval: Bad substitution create namespaces Resolve this by making openvswitch.sh a bash script. Fixes: 918423fda910 ("selftests: openvswitch: add an initial flow programming case") Signed-off-by: Simon Horman <horms@kernel.org> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Link: https://lore.kernel.org/r/20240617-ovs-selftest-bash-v1-1-7ae6ccd3617b@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-18sched_ext: Add selftestsDavid Vernet
Add basic selftests. Signed-off-by: David Vernet <dvernet@meta.com> Acked-by: Tejun Heo <tj@kernel.org>
2024-06-18sched_ext: Documentation: scheduler: Document extensible scheduler classTejun Heo
Add Documentation/scheduler/sched-ext.rst which gives a high-level overview and pointers to the examples. v6: - Add paragraph explaining debug dump. v5: - Updated to reflect /sys/kernel interface change. Kconfig options added. v4: - README improved, reformatted in markdown and renamed to README.md. v3: - Added tools/sched_ext/README. - Dropped _example prefix from scheduler names. v2: - Apply minor edits suggested by Bagas. Caveats section dropped as all of them are addressed. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com> Cc: Bagas Sanjaya <bagasdotme@gmail.com>
2024-06-18sched_ext: Add vtime-ordered priority queue to dispatch_q'sTejun Heo
Currently, a dsq is always a FIFO. A task which is dispatched earlier gets consumed or executed earlier. While this is sufficient when dsq's are used for simple staging areas for tasks which are ready to execute, it'd make dsq's a lot more useful if they can implement custom ordering. This patch adds a vtime-ordered priority queue to dsq's. When the BPF scheduler dispatches a task with the new scx_bpf_dispatch_vtime() helper, it can specify the vtime tha the task should be inserted at and the task is inserted into the priority queue in the dsq which is ordered according to time_before64() comparison of the vtime values. A DSQ can either be a FIFO or priority queue and automatically switches between the two depending on whether scx_bpf_dispatch() or scx_bpf_dispatch_vtime() is used. Using the wrong variant while the DSQ already has the other type queued is not allowed and triggers an ops error. Built-in DSQs must always be FIFOs. This makes it very easy for the BPF schedulers to implement proper vtime based scheduling within each dsq very easy and efficient at a negligible cost in terms of code complexity and overhead. scx_simple and scx_example_flatcg are updated to default to weighted vtime scheduling (the latter within each cgroup). FIFO scheduling can be selected with -f option. v4: - As allowing mixing priority queue and FIFO on the same DSQ sometimes led to unexpected starvations, DSQs now error out if both modes are used at the same time and the built-in DSQs are no longer allowed to be priority queues. - Explicit type struct scx_dsq_node added to contain fields needed to be linked on DSQs. This will be used to implement stateful iterator. - Tasks are now always linked on dsq->list whether the DSQ is in FIFO or PRIQ mode. This confines PRIQ related complexities to the enqueue and dequeue paths. Other paths only need to look at dsq->list. This will also ease implementing BPF iterator. - Print p->scx.dsq_flags in debug dump. v3: - SCX_TASK_DSQ_ON_PRIQ flag is moved from p->scx.flags into its own p->scx.dsq_flags. The flag is protected with the dsq lock unlike other flags in p->scx.flags. This led to flag corruption in some cases. - Add comments explaining the interaction between using consumption of p->scx.slice to determine vtime progress and yielding. v2: - p->scx.dsq_vtime was not initialized on load or across cgroup migrations leading to some tasks being stalled for extended period of time depending on how saturated the machine is. Fixed. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com>
2024-06-18sched_ext: Implement core-sched supportTejun Heo
The core-sched support is composed of the following parts: - task_struct->scx.core_sched_at is added. This is a timestamp which can be used to order tasks. Depending on whether the BPF scheduler implements custom ordering, it tracks either global FIFO ordering of all tasks or local-DSQ ordering within the dispatched tasks on a CPU. - prio_less() is updated to call scx_prio_less() when comparing SCX tasks. scx_prio_less() calls ops.core_sched_before() if available or uses the core_sched_at timestamp. For global FIFO ordering, the BPF scheduler doesn't need to do anything. Otherwise, it should implement ops.core_sched_before() which reflects the ordering. - When core-sched is enabled, balance_scx() balances all SMT siblings so that they all have tasks dispatched if necessary before pick_task_scx() is called. pick_task_scx() picks between the current task and the first dispatched task on the local DSQ based on availability and the core_sched_at timestamps. Note that FIFO ordering is expected among the already dispatched tasks whether running or on the local DSQ, so this path always compares core_sched_at instead of calling into ops.core_sched_before(). qmap_core_sched_before() is added to scx_qmap. It scales the distances from the heads of the queues to compare the tasks across different priority queues and seems to behave as expected. v3: Fixed build error when !CONFIG_SCHED_SMT reported by Andrea Righi. v2: Sched core added the const qualifiers to prio_less task arguments. Explicitly drop them for ops.core_sched_before() task arguments. BPF enforces access control through the verifier, so the qualifier isn't actually operative and only gets in the way when interacting with various helpers. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Reviewed-by: Josh Don <joshdon@google.com> Cc: Andrea Righi <andrea.righi@canonical.com>
2024-06-18sched_ext: Implement sched_ext_ops.cpu_online/offline()Tejun Heo
Add ops.cpu_online/offline() which are invoked when CPUs come online and offline respectively. As the enqueue path already automatically bypasses tasks to the local dsq on a deactivated CPU, BPF schedulers are guaranteed to see tasks only on CPUs which are between online() and offline(). If the BPF scheduler doesn't implement ops.cpu_online/offline(), the scheduler is automatically exited with SCX_ECODE_RESTART | SCX_ECODE_RSN_HOTPLUG. Userspace can implement CPU hotpplug support trivially by simply reinitializing and reloading the scheduler. scx_qmap is updated to print out online CPUs on hotplug events. Other schedulers are updated to restart based on ecode. v3: - The previous implementation added @reason to sched_class.rq_on/offline() to distinguish between CPU hotplug events and topology updates. This was buggy and fragile as the methods are skipped if the current state equals the target state. Instead, add scx_rq_[de]activate() which are directly called from sched_cpu_de/activate(). This also allows ops.cpu_on/offline() to sleep which can be useful. - ops.dispatch() could be called on a CPU that the BPF scheduler was told to be offline. The dispatch patch is updated to bypass in such cases. v2: - To accommodate lock ordering change between scx_cgroup_rwsem and cpus_read_lock(), CPU hotplug operations are put into its own SCX_OPI block and enabled eariler during scx_ope_enable() so that cpus_read_lock() can be dropped before acquiring scx_cgroup_rwsem. - Auto exit with ECODE added. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com>
2024-06-18sched_ext: Implement sched_ext_ops.cpu_acquire/release()David Vernet
Scheduler classes are strictly ordered and when a higher priority class has tasks to run, the lower priority ones lose access to the CPU. Being able to monitor and act on these events are necessary for use cases includling strict core-scheduling and latency management. This patch adds two operations ops.cpu_acquire() and .cpu_release(). The former is invoked when a CPU becomes available to the BPF scheduler and the opposite for the latter. This patch also implements scx_bpf_reenqueue_local() which can be called from .cpu_release() to trigger requeueing of all tasks in the local dsq of the CPU so that the tasks can be reassigned to other available CPUs. scx_pair is updated to use .cpu_acquire/release() along with %SCX_KICK_WAIT to make the pair scheduling guarantee strict even when a CPU is preempted by a higher priority scheduler class. scx_qmap is updated to use .cpu_acquire/release() to empty the local dsq of a preempted CPU. A similar approach can be adopted by BPF schedulers that want to have a tight control over latency. v4: Use the new SCX_KICK_IDLE to wake up a CPU after re-enqueueing. v3: Drop the const qualifier from scx_cpu_release_args.task. BPF enforces access control through the verifier, so the qualifier isn't actually operative and only gets in the way when interacting with various helpers. v2: Add p->scx.kf_mask annotation to allow calling scx_bpf_reenqueue_local() from ops.cpu_release() nested inside ops.init() and other sleepable operations. Signed-off-by: David Vernet <dvernet@meta.com> Reviewed-by: Tejun Heo <tj@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com>
2024-06-18sched_ext: Implement tickless supportTejun Heo
Allow BPF schedulers to indicate tickless operation by setting p->scx.slice to SCX_SLICE_INF. A CPU whose current task has infinte slice goes into tickless operation. scx_central is updated to use tickless operations for all tasks and instead use a BPF timer to expire slices. This also uses the SCX_ENQ_PREEMPT and task state tracking added by the previous patches. Currently, there is no way to pin the timer on the central CPU, so it may end up on one of the worker CPUs; however, outside of that, the worker CPUs can go tickless both while running sched_ext tasks and idling. With schbench running, scx_central shows: root@test ~# grep ^LOC /proc/interrupts; sleep 10; grep ^LOC /proc/interrupts LOC: 142024 656 664 449 Local timer interrupts LOC: 161663 663 665 449 Local timer interrupts Without it: root@test ~ [SIGINT]# grep ^LOC /proc/interrupts; sleep 10; grep ^LOC /proc/interrupts LOC: 188778 3142 3793 3993 Local timer interrupts LOC: 198993 5314 6323 6438 Local timer interrupts While scx_central itself is too barebone to be useful as a production scheduler, a more featureful central scheduler can be built using the same approach. Google's experience shows that such an approach can have significant benefits for certain applications such as VM hosting. v4: Allow operation even if BPF_F_TIMER_CPU_PIN is not available. v3: Pin the central scheduler's timer on the central_cpu using BPF_F_TIMER_CPU_PIN. v2: Convert to BPF inline iterators. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com>
2024-06-18sched_ext: Make watchdog handle ops.dispatch() looping stallTejun Heo
The dispatch path retries if the local DSQ is still empty after ops.dispatch() either dispatched or consumed a task. This is both out of necessity and for convenience. It has to retry because the dispatch path might lose the tasks to dequeue while the rq lock is released while trying to migrate tasks across CPUs, and the retry mechanism makes ops.dispatch() implementation easier as it only needs to make some forward progress each iteration. However, this makes it possible for ops.dispatch() to stall CPUs by repeatedly dispatching ineligible tasks. If all CPUs are stalled that way, the watchdog or sysrq handler can't run and the system can't be saved. Let's address the issue by breaking out of the dispatch loop after 32 iterations. It is unlikely but not impossible for ops.dispatch() to legitimately go over the iteration limit. We want to come back to the dispatch path in such cases as not doing so risks stalling the CPU by idling with runnable tasks pending. As the previous task is still current in balance_scx(), resched_curr() doesn't do anything - it will just get cleared. Let's instead use scx_kick_bpf() which will trigger reschedule after switching to the next task which will likely be the idle task. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com>
2024-06-18sched_ext: Add a central scheduler which makes all scheduling decisions on ↵Tejun Heo
one CPU This patch adds a new example scheduler, scx_central, which demonstrates central scheduling where one CPU is responsible for making all scheduling decisions in the system using scx_bpf_kick_cpu(). The central CPU makes scheduling decisions for all CPUs in the system, queues tasks on the appropriate local dsq's and preempts the worker CPUs. The worker CPUs in turn preempt the central CPU when it needs tasks to run. Currently, every CPU depends on its own tick to expire the current task. A follow-up patch implementing tickless support for sched_ext will allow the worker CPUs to go full tickless so that they can run completely undisturbed. v3: - Kumar fixed a bug where the dispatch path could overflow the dispatch buffer if too many are dispatched to the fallback DSQ. - Use the new SCX_KICK_IDLE to wake up non-central CPUs. - Dropped '-p' option. v2: - Use RESIZABLE_ARRAY() instead of fixed MAX_CPUS and use SCX_BUG[_ON]() to simplify error handling. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com> Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com> Cc: Julia Lawall <julia.lawall@inria.fr>
2024-06-18sched_ext: Implement scx_bpf_kick_cpu() and task preemption supportTejun Heo
It's often useful to wake up and/or trigger reschedule on other CPUs. This patch adds scx_bpf_kick_cpu() kfunc helper that BPF scheduler can call to kick the target CPU into the scheduling path. As a sched_ext task relinquishes its CPU only after its slice is depleted, this patch also adds SCX_KICK_PREEMPT and SCX_ENQ_PREEMPT which clears the slice of the target CPU's current task to guarantee that sched_ext's scheduling path runs on the CPU. If SCX_KICK_IDLE is specified, the target CPU is kicked iff the CPU is idle to guarantee that the target CPU will go through at least one full sched_ext scheduling cycle after the kicking. This can be used to wake up idle CPUs without incurring unnecessary overhead if it isn't currently idle. As a demonstration of how backward compatibility can be supported using BPF CO-RE, tools/sched_ext/include/scx/compat.bpf.h is added. It provides __COMPAT_scx_bpf_kick_cpu_IDLE() which uses SCX_KICK_IDLE if available or becomes a regular kicking otherwise. This allows schedulers to use the new SCX_KICK_IDLE while maintaining support for older kernels. The plan is to temporarily use compat helpers to ease API updates and drop them after a few kernel releases. v5: - SCX_KICK_IDLE added. Note that this also adds a compat mechanism for schedulers so that they can support kernels without SCX_KICK_IDLE. This is useful as a demonstration of how new feature flags can be added in a backward compatible way. - kick_cpus_irq_workfn() reimplemented so that it touches the pending cpumasks only as necessary to reduce kicking overhead on machines with a lot of CPUs. - tools/sched_ext/include/scx/compat.bpf.h added. v4: - Move example scheduler to its own patch. v3: - Make scx_example_central switch all tasks by default. - Convert to BPF inline iterators. v2: - Julia Lawall reported that scx_example_central can overflow the dispatch buffer and malfunction. As scheduling for other CPUs can't be handled by the automatic retry mechanism, fix by implementing an explicit overflow and retry handling. - Updated to use generic BPF cpumask helpers. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com>
2024-06-18tools/sched_ext: Add scx_show_state.pyTejun Heo
There are states which are interesting but don't quite fit the interface exposed under /sys/kernel/sched_ext. Add tools/scx_show_state.py to show them. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com>
2024-06-18sched_ext: Print debug dump after an error exitTejun Heo
If a BPF scheduler triggers an error, the scheduler is aborted and the system is reverted to the built-in scheduler. In the process, a lot of information which may be useful for figuring out what happened can be lost. This patch adds debug dump which captures information which may be useful for debugging including runqueue and runnable thread states at the time of failure. The following shows a debug dump after triggering the watchdog: root@test ~# os/work/tools/sched_ext/build/bin/scx_qmap -t 100 stats : enq=1 dsp=0 delta=1 deq=0 stats : enq=90 dsp=90 delta=0 deq=0 stats : enq=156 dsp=156 delta=0 deq=0 stats : enq=218 dsp=218 delta=0 deq=0 stats : enq=255 dsp=255 delta=0 deq=0 stats : enq=271 dsp=271 delta=0 deq=0 stats : enq=284 dsp=284 delta=0 deq=0 stats : enq=293 dsp=293 delta=0 deq=0 DEBUG DUMP ================================================================================ kworker/u32:12[320] triggered exit kind 1026: runnable task stall (stress[1530] failed to run for 6.841s) Backtrace: scx_watchdog_workfn+0x136/0x1c0 process_scheduled_works+0x2b5/0x600 worker_thread+0x269/0x360 kthread+0xeb/0x110 ret_from_fork+0x36/0x40 ret_from_fork_asm+0x1a/0x30 QMAP FIFO[0]: QMAP FIFO[1]: QMAP FIFO[2]: 1436 QMAP FIFO[3]: QMAP FIFO[4]: CPU states ---------- CPU 0 : nr_run=1 ops_qseq=244 curr=swapper/0[0] class=idle_sched_class QMAP: dsp_idx=1 dsp_cnt=0 R stress[1530] -6841ms scx_state/flags=3/0x1 ops_state/qseq=2/20 sticky/holding_cpu=-1/-1 dsq_id=(n/a) cpus=ff QMAP: force_local=0 asm_sysvec_apic_timer_interrupt+0x16/0x20 CPU 2 : nr_run=2 ops_qseq=142 curr=swapper/2[0] class=idle_sched_class QMAP: dsp_idx=1 dsp_cnt=0 R sshd[1703] -5905ms scx_state/flags=3/0x9 ops_state/qseq=2/88 sticky/holding_cpu=-1/-1 dsq_id=(n/a) cpus=ff QMAP: force_local=1 __x64_sys_ppoll+0xf6/0x120 do_syscall_64+0x7b/0x150 entry_SYSCALL_64_after_hwframe+0x76/0x7e R fish[1539] -4141ms scx_state/flags=3/0x9 ops_state/qseq=2/124 sticky/holding_cpu=-1/-1 dsq_id=(n/a) cpus=ff QMAP: force_local=1 futex_wait+0x60/0xe0 do_futex+0x109/0x180 __x64_sys_futex+0x117/0x190 do_syscall_64+0x7b/0x150 entry_SYSCALL_64_after_hwframe+0x76/0x7e CPU 3 : nr_run=2 ops_qseq=162 curr=kworker/u32:12[320] class=ext_sched_class QMAP: dsp_idx=1 dsp_cnt=0 *R kworker/u32:12[320] +0ms scx_state/flags=3/0xd ops_state/qseq=0/0 sticky/holding_cpu=-1/-1 dsq_id=(n/a) cpus=ff QMAP: force_local=0 scx_dump_state+0x613/0x6f0 scx_ops_error_irq_workfn+0x1f/0x40 irq_work_run_list+0x82/0xd0 irq_work_run+0x14/0x30 __sysvec_irq_work+0x40/0x140 sysvec_irq_work+0x60/0x70 asm_sysvec_irq_work+0x16/0x20 scx_watchdog_workfn+0x15f/0x1c0 process_scheduled_works+0x2b5/0x600 worker_thread+0x269/0x360 kthread+0xeb/0x110 ret_from_fork+0x36/0x40 ret_from_fork_asm+0x1a/0x30 R kworker/3:2[1436] +0ms scx_state/flags=3/0x9 ops_state/qseq=2/160 sticky/holding_cpu=-1/-1 dsq_id=(n/a) cpus=08 QMAP: force_local=0 kthread+0xeb/0x110 ret_from_fork+0x36/0x40 ret_from_fork_asm+0x1a/0x30 CPU 7 : nr_run=0 ops_qseq=76 curr=swapper/7[0] class=idle_sched_class ================================================================================ EXIT: runnable task stall (stress[1530] failed to run for 6.841s) It shows that CPU 3 was running the watchdog when it triggered the error condition and the scx_qmap thread has been queued on CPU 0 for over 5 seconds but failed to run. It also prints out scx_qmap specific information - e.g. which tasks are queued on each FIFO and so on using the dump_*() ops. This dump has proved pretty useful for developing and debugging BPF schedulers. Debug dump is generated automatically when the BPF scheduler exits due to an error. The debug buffer used in such cases is determined by sched_ext_ops.exit_dump_len and defaults to 32k. If the debug dump overruns the available buffer, the output is truncated and marked accordingly. Debug dump output can also be read through the sched_ext_dump tracepoint. When read through the tracepoint, there is no length limit. SysRq-D can be used to trigger debug dump at any time while a BPF scheduler is loaded. This is non-destructive - the scheduler keeps running afterwards. The output can be read through the sched_ext_dump tracepoint. v2: - The size of exit debug dump buffer can now be customized using sched_ext_ops.exit_dump_len. - sched_ext_ops.dump*() added to enable dumping of BPF scheduler specific information. - Tracpoint output and SysRq-D triggering added. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com>
2024-06-18sched_ext: Allow BPF schedulers to disallow specific tasks from joining ↵Tejun Heo
SCHED_EXT BPF schedulers might not want to schedule certain tasks - e.g. kernel threads. This patch adds p->scx.disallow which can be set by BPF schedulers in such cases. The field can be changed anytime and setting it in ops.prep_enable() guarantees that the task can never be scheduled by sched_ext. scx_qmap is updated with the -d option to disallow a specific PID: # echo $$ 1092 # grep -E '(policy)|(ext\.enabled)' /proc/self/sched policy : 0 ext.enabled : 0 # ./set-scx 1092 # grep -E '(policy)|(ext\.enabled)' /proc/self/sched policy : 7 ext.enabled : 0 Run "scx_qmap -p -d 1092" in another terminal. # cat /sys/kernel/sched_ext/nr_rejected 1 # grep -E '(policy)|(ext\.enabled)' /proc/self/sched policy : 0 ext.enabled : 0 # ./set-scx 1092 setparam failed for 1092 (Permission denied) - v4: Refreshed on top of tip:sched/core. - v3: Update description to reflect /sys/kernel/sched_ext interface change. - v2: Use atomic_long_t instead of atomic64_t for scx_kick_cpus_pnt_seqs to accommodate 32bit archs. Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-by: Barret Rhoden <brho@google.com> Reviewed-by: David Vernet <dvernet@meta.com> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com>
2024-06-18sched_ext: Implement runnable task stall watchdogDavid Vernet
The most common and critical way that a BPF scheduler can misbehave is by failing to run runnable tasks for too long. This patch implements a watchdog. * All tasks record when they become runnable. * A watchdog work periodically scans all runnable tasks. If any task has stayed runnable for too long, the BPF scheduler is aborted. * scheduler_tick() monitors whether the watchdog itself is stuck. If so, the BPF scheduler is aborted. Because the watchdog only scans the tasks which are currently runnable and usually very infrequently, the overhead should be negligible. scx_qmap is updated so that it can be told to stall user and/or kernel tasks. A detected task stall looks like the following: sched_ext: BPF scheduler "qmap" errored, disabling sched_ext: runnable task stall (dbus-daemon[953] failed to run for 6.478s) scx_check_timeout_workfn+0x10e/0x1b0 process_one_work+0x287/0x560 worker_thread+0x234/0x420 kthread+0xe9/0x100 ret_from_fork+0x1f/0x30 A detected watchdog stall: sched_ext: BPF scheduler "qmap" errored, disabling sched_ext: runnable task stall (watchdog failed to check in for 5.001s) scheduler_tick+0x2eb/0x340 update_process_times+0x7a/0x90 tick_sched_timer+0xd8/0x130 __hrtimer_run_queues+0x178/0x3b0 hrtimer_interrupt+0xfc/0x390 __sysvec_apic_timer_interrupt+0xb7/0x2b0 sysvec_apic_timer_interrupt+0x90/0xb0 asm_sysvec_apic_timer_interrupt+0x1b/0x20 default_idle+0x14/0x20 arch_cpu_idle+0xf/0x20 default_idle_call+0x50/0x90 do_idle+0xe8/0x240 cpu_startup_entry+0x1d/0x20 kernel_init+0x0/0x190 start_kernel+0x0/0x392 start_kernel+0x324/0x392 x86_64_start_reservations+0x2a/0x2c x86_64_start_kernel+0x104/0x109 secondary_startup_64_no_verify+0xce/0xdb Note that this patch exposes scx_ops_error[_type]() in kernel/sched/ext.h to inline scx_notify_sched_tick(). v4: - While disabling, cancel_delayed_work_sync(&scx_watchdog_work) was being called before forward progress was guaranteed and thus could lead to system lockup. Relocated. - While enabling, it was comparing msecs against jiffies without conversion leading to spurious load failures on lower HZ kernels. Fixed. - runnable list management is now used by core bypass logic and moved to the patch implementing sched_ext core. v3: - bpf_scx_init_member() was incorrectly comparing ops->timeout_ms against SCX_WATCHDOG_MAX_TIMEOUT which is in jiffies without conversion leading to spurious load failures in lower HZ kernels. Fixed. v2: - Julia Lawall noticed that the watchdog code was mixing msecs and jiffies. Fix by using jiffies for everything. Signed-off-by: David Vernet <dvernet@meta.com> Reviewed-by: Tejun Heo <tj@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com> Cc: Julia Lawall <julia.lawall@inria.fr>
2024-06-18sched_ext: Add scx_simple and scx_example_qmap example schedulersTejun Heo
Add two simple example BPF schedulers - simple and qmap. * simple: In terms of scheduling, it behaves identical to not having any operation implemented at all. The two operations it implements are only to improve visibility and exit handling. On certain homogeneous configurations, this actually can perform pretty well. * qmap: A fixed five level priority scheduler to demonstrate queueing PIDs on BPF maps for scheduling. While not very practical, this is useful as a simple example and will be used to demonstrate different features. v7: - Compat helpers stripped out in prepartion of upstreaming as the upstreamed patchset will be the baselinfe. Utility macros that can be used to implement compat features are kept. - Explicitly disable map autoattach on struct_ops to avoid trying to attach twice while maintaining compatbility with older libbpf. v6: - Common header files reorganized and cleaned up. Compat helpers are added to demonstrate how schedulers can maintain backward compatibility with older kernels while making use of newly added features. - simple_select_cpu() added to keep track of the number of local dispatches. This is needed because the default ops.select_cpu() implementation is updated to dispatch directly and won't call ops.enqueue(). - Updated to reflect the sched_ext API changes. Switching all tasks is the default behavior now and scx_qmap supports partial switching when `-p` is specified. - tools/sched_ext/Kconfig dropped. This will be included in the doc instead. v5: - Improve Makefile. Build artifects are now collected into a separate dir which change be changed. Install and help targets are added and clean actually cleans everything. - MEMBER_VPTR() improved to improve access to structs. ARRAY_ELEM_PTR() and RESIZEABLE_ARRAY() are added to support resizable arrays in .bss. - Add scx_common.h which provides common utilities to user code such as SCX_BUG[_ON]() and RESIZE_ARRAY(). - Use SCX_BUG[_ON]() to simplify error handling. v4: - Dropped _example prefix from scheduler names. v3: - Rename scx_example_dummy to scx_example_simple and restructure a bit to ease later additions. Comment updates. - Added declarations for BPF inline iterators. In the future, hopefully, these will be consolidated into a generic BPF header so that they don't need to be replicated here. v2: - Updated with the generic BPF cpumask helpers. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com>
2024-06-18cpupower: Improve cpupower build process descriptionRoman Storozhenko
Enhance cpupower build process description with the information on building and installing the utility to the user defined directories as well as with the information on the way of running the utility from the custom defined installation directory. Signed-off-by: Roman Storozhenko <romeusmeister@gmail.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
2024-06-18cpupower: Add 'help' target to the main MakefileRoman Storozhenko
Make "cpupower" building process more user friendly by adding 'help' target to the main makefile. This target describes various build and cleaning options available to the user. Signed-off-by: Roman Storozhenko <romeusmeister@gmail.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
2024-06-18cpupower: Replace a dead reference link with working onesRoman Storozhenko
Replace a dead reference link to a turbo boost technology description with a reference to a root page of the technology on the Intel site, and add another one, describing power management technology, which includes short description of the turbo boost. Signed-off-by: Roman Storozhenko <romeusmeister@gmail.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
2024-06-18KVM: selftests: Test vCPU boot IDs above 2^32 and MAX_VCPU_IDMathias Krause
The KVM_SET_BOOT_CPU_ID ioctl missed to reject invalid vCPU IDs. Verify this no longer works and gets rejected with an appropriate error code. Signed-off-by: Mathias Krause <minipli@grsecurity.net> Link: https://lore.kernel.org/r/20240614202859.3597745-6-minipli@grsecurity.net [sean: add test for MAX_VCPU_ID+1, always do negative test] Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-18KVM: selftests: Test max vCPU IDs corner casesMathias Krause
The KVM_CREATE_VCPU ioctl ABI had an implicit integer truncation bug, allowing 2^32 aliases for a vCPU ID by setting the upper 32 bits of a 64 bit ioctl() argument. It also allowed excluding a once set boot CPU ID. Verify this no longer works and gets rejected with an error. Signed-off-by: Mathias Krause <minipli@grsecurity.net> Link: https://lore.kernel.org/r/20240614202859.3597745-5-minipli@grsecurity.net [sean: tweak assert message+comment for 63:32!=0 testcase] Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-18kselftest/alsa: Fix validation of writes to volatile controlsMark Brown
When validating writes to controls we check that whatever value we wrote actually appears in the control. For volatile controls we cannot assume that this will be the case, the value may be changed at any time including between our write and read. Handle this by moving the check for volatile controls that we currently do for events to a separate block and just verifying that whatever value we read is valid for the control. Signed-off-by: Mark Brown <broonie@kernel.org> Link: https://lore.kernel.org/r/20240614-alsa-selftest-volatile-v1-1-3874f02964b1@kernel.org Signed-off-by: Takashi Iwai <tiwai@suse.de> Link: https://lore.kernel.org/20240616073454.16512-5-tiwai@suse.de
2024-06-18selftests: livepatch: Test atomic replace against multiple modulesMarcos Paulo de Souza
Adapt the current test-livepatch.sh script to account the number of applied livepatches and ensure that an atomic replace livepatch disables all previously applied livepatches. Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com> Acked-by: Joe Lawrence <joe.lawrence@redhat.com> Reviewed-by: Miroslav Benes <mbenes@suse.cz> Link: https://lore.kernel.org/r/20240603-lp-atomic-replace-v3-1-9f3b8ace5c9f@suse.com [mbenes@suse.cz: Fixed typo in a comment.] Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-06-17selftests: mptcp: userspace_pm: fixed subtest namesMatthieu Baerts (NGI0)
It is important to have fixed (sub)test names in TAP, because these names are used to identify them. If they are not fixed, tracking cannot be done. Some subtests from the userspace_pm selftest were using random numbers in their names: the client and server address IDs from $RANDOM, and the client port number randomly picked by the kernel when creating the connection. These values have been replaced by 'client' and 'server' words: that's even more helpful than showing random numbers. Note that the addresses IDs are incremented and decremented in the test: +1 or -1 are then displayed in these cases. Not to loose info that can be useful for debugging in case of issues, these random numbers are now displayed at the beginning of the test. Fixes: f589234e1af0 ("selftests: mptcp: userspace_pm: format subtests results in TAP") Cc: stable@vger.kernel.org Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240614-upstream-net-20240614-selftests-mptcp-uspace-pm-fixed-test-names-v1-1-460ad3edb429@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-17testing: nvdimm: Add MODULE_DESCRIPTION() macrosIra Weiny
When building with W=1 the following errors are seen: WARNING: modpost: missing MODULE_DESCRIPTION() in tools/testing/nvdimm/test/nfit_test.o WARNING: modpost: missing MODULE_DESCRIPTION() in tools/testing/nvdimm/test/ndtest.o Add the required MODULE_DESCRIPTION() to the test platform device drivers. Suggested-by: Jeff Johnson <quic_jjohnson@quicinc.com> Reviewed-by: Jeff Johnson <quic_jjohnson@quicinc.com> Link: https://patch.msgid.link/r/20240611-nvdimm-test-mod-warn-v1-1-4a583be68c17@intel.com Signed-off-by: Ira Weiny <ira.weiny@intel.com>
2024-06-17testing: nvdimm: iomap: add MODULE_DESCRIPTION()Jeff Johnson
Fix the 'make W=1' warning: WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/nvdimm/../../tools/testing/nvdimm/test/iomap.o Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com> Link: https://patch.msgid.link/r/20240526-md-testing-nvdimm-v1-1-f8b617bb28e1@quicinc.com Signed-off-by: Ira Weiny <ira.weiny@intel.com>
2024-06-17resolve_btfids: Handle presence of .BTF.base sectionAlan Maguire
Now that btf_parse_elf() handles .BTF.base section presence, we need to ensure that resolve_btfids uses .BTF.base when present rather than the vmlinux base BTF passed in via the -B option. Detect .BTF.base section presence and unset the base BTF path to ensure that BTF ELF parsing will do the right thing. Signed-off-by: Alan Maguire <alan.maguire@oracle.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/bpf/20240613095014.357981-7-alan.maguire@oracle.com
2024-06-17libbpf: Make btf_parse_elf process .BTF.base transparentlyEduard Zingerman
Update btf_parse_elf() to check if .BTF.base section is present. The logic is as follows: if .BTF.base section exists: distilled_base := btf_new(.BTF.base) if distilled_base: btf := btf_new(.BTF, .base_btf=distilled_base) if base_btf: btf_relocate(btf, base_btf) else: btf := btf_new(.BTF) return btf In other words: - if .BTF.base section exists, load BTF from it and use it as a base for .BTF load; - if base_btf is specified and .BTF.base section exist, relocate newly loaded .BTF against base_btf. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Alan Maguire <alan.maguire@oracle.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20240613095014.357981-6-alan.maguire@oracle.com
2024-06-17selftests/bpf: Extend distilled BTF tests to cover BTF relocationAlan Maguire
Ensure relocated BTF looks as expected; in this case identical to original split BTF, with a few duplicate anonymous types added to split BTF by the relocation process. Also add relocation tests for edge cases like missing type in base BTF and multiple types of the same name. Signed-off-by: Alan Maguire <alan.maguire@oracle.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/bpf/20240613095014.357981-5-alan.maguire@oracle.com
2024-06-17libbpf: Split BTF relocationAlan Maguire
Map distilled base BTF type ids referenced in split BTF and their references to the base BTF passed in, and if the mapping succeeds, reparent the split BTF to the base BTF. Relocation is done by first verifying that distilled base BTF only consists of named INT, FLOAT, ENUM, FWD, STRUCT and UNION kinds; then we sort these to speed lookups. Once sorted, the base BTF is iterated, and for each relevant kind we check for an equivalent in distilled base BTF. When found, the mapping from distilled -> base BTF id and string offset is recorded. In establishing mappings, we need to ensure we check STRUCT/UNION size when the STRUCT/UNION is embedded in a split BTF STRUCT/UNION, and when duplicate names exist for the same STRUCT/UNION. Otherwise size is ignored in matching STRUCT/UNIONs. Once all mappings are established, we can update type ids and string offsets in split BTF and reparent it to the new base. Signed-off-by: Alan Maguire <alan.maguire@oracle.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/bpf/20240613095014.357981-4-alan.maguire@oracle.com
2024-06-17selftests/bpf: Test distilled base, split BTF generationAlan Maguire
Test generation of split+distilled base BTF, ensuring that - named base BTF STRUCTs and UNIONs are represented as 0-vlen sized STRUCT/UNIONs - named ENUM[64]s are represented as 0-vlen named ENUM[64]s - anonymous struct/unions are represented in full in split BTF - anonymous enums are represented in full in split BTF - types unreferenced from split BTF are not present in distilled base BTF Also test that with vmlinux BTF and split BTF based upon it, we only represent needed base types referenced from split BTF in distilled base. Signed-off-by: Alan Maguire <alan.maguire@oracle.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/bpf/20240613095014.357981-3-alan.maguire@oracle.com
2024-06-17libbpf: Add btf__distill_base() creating split BTF with distilled base BTFAlan Maguire
To support more robust split BTF, adding supplemental context for the base BTF type ids that split BTF refers to is required. Without such references, a simple shuffling of base BTF type ids (without any other significant change) invalidates the split BTF. Here the attempt is made to store additional context to make split BTF more robust. This context comes in the form of distilled base BTF providing minimal information (name and - in some cases - size) for base INTs, FLOATs, STRUCTs, UNIONs, ENUMs and ENUM64s along with modified split BTF that points at that base and contains any additional types needed (such as TYPEDEF, PTR and anonymous STRUCT/UNION declarations). This information constitutes the minimal BTF representation needed to disambiguate or remove split BTF references to base BTF. The rules are as follows: - INT, FLOAT, FWD are recorded in full. - if a named base BTF STRUCT or UNION is referred to from split BTF, it will be encoded as a zero-member sized STRUCT/UNION (preserving size for later relocation checks). Only base BTF STRUCT/UNIONs that are either embedded in split BTF STRUCT/UNIONs or that have multiple STRUCT/UNION instances of the same name will _need_ size checks at relocation time, but as it is possible a different set of types will be duplicates in the later to-be-resolved base BTF, we preserve size information for all named STRUCT/UNIONs. - if an ENUM[64] is named, a ENUM forward representation (an ENUM with no values) of the same size is used. - in all other cases, the type is added to the new split BTF. Avoiding struct/union/enum/enum64 expansion is important to keep the distilled base BTF representation to a minimum size. When successful, new representations of the distilled base BTF and new split BTF that refers to it are returned. Both need to be freed by the caller. So to take a simple example, with split BTF with a type referring to "struct sk_buff", we will generate distilled base BTF with a 0-member STRUCT sk_buff of the appropriate size, and the split BTF will refer to it instead. Tools like pahole can utilize such split BTF to populate the .BTF section (split BTF) and an additional .BTF.base section. Then when the split BTF is loaded, the distilled base BTF can be used to relocate split BTF to reference the current (and possibly changed) base BTF. So for example if "struct sk_buff" was id 502 when the split BTF was originally generated, we can use the distilled base BTF to see that id 502 refers to a "struct sk_buff" and replace instances of id 502 with the current (relocated) base BTF sk_buff type id. Distilled base BTF is small; when building a kernel with all modules using distilled base BTF as a test, overall module size grew by only 5.3Mb total across ~2700 modules. Signed-off-by: Alan Maguire <alan.maguire@oracle.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/bpf/20240613095014.357981-2-alan.maguire@oracle.com
2024-06-17Merge tag 'mm-hotfixes-stable-2024-06-17-11-43' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "Mainly MM singleton fixes. And a couple of ocfs2 regression fixes" * tag 'mm-hotfixes-stable-2024-06-17-11-43' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: kcov: don't lose track of remote references during softirqs mm: shmem: fix getting incorrect lruvec when replacing a shmem folio mm/debug_vm_pgtable: drop RANDOM_ORVALUE trick mm: fix possible OOB in numa_rebuild_large_mapping() mm/migrate: fix kernel BUG at mm/compaction.c:2761! selftests: mm: make map_fixed_noreplace test names stable mm/memfd: add documentation for MFD_NOEXEC_SEAL MFD_EXEC mm: mmap: allow for the maximum number of bits for randomizing mmap_base by default gcov: add support for GCC 14 zap_pid_ns_processes: clear TIF_NOTIFY_SIGNAL along with TIF_SIGPENDING mm: huge_memory: fix misused mapping_large_folio_support() for anon folios lib/alloc_tag: fix RCU imbalance in pgalloc_tag_get() lib/alloc_tag: do not register sysctl interface when CONFIG_SYSCTL=n MAINTAINERS: remove Lorenzo as vmalloc reviewer Revert "mm: init_mlocked_on_free_v3" mm/page_table_check: fix crash on ZONE_DEVICE gcc: disable '-Warray-bounds' for gcc-9 ocfs2: fix NULL pointer dereference in ocfs2_abort_trigger() ocfs2: fix NULL pointer dereference in ocfs2_journal_dirty()
2024-06-17lkdtm/bugs: add test for hung smp_call_function_single()Mark Rutland
The CONFIG_CSD_LOCK_WAIT_DEBUG option enables debugging of hung smp_call_function*() calls (e.g. when the target CPU gets stuck within the callback function). Testing this option requires triggering such hangs. This patch adds an lkdtm test with a hung smp_call_function_single() callback, which can be used to test CONFIG_CSD_LOCK_WAIT_DEBUG and NMI backtraces (as CONFIG_CSD_LOCK_WAIT_DEBUG will attempt an NMI backtrace of the hung target CPU). On arm64 using pseudo-NMI, this looks like: | # mount -t debugfs none /sys/kernel/debug/ | # echo SMP_CALL_LOCKUP > /sys/kernel/debug/provoke-crash/DIRECT | lkdtm: Performing direct entry SMP_CALL_LOCKUP | smp: csd: Detected non-responsive CSD lock (#1) on CPU#1, waiting 5000000176 ns for CPU#00 __lkdtm_SMP_CALL_LOCKUP+0x0/0x8(0x0). | smp: csd: CSD lock (#1) handling this request. | Sending NMI from CPU 1 to CPUs 0: | NMI backtrace for cpu 0 | CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.9.0-rc4-00001-gfdfd281212ec #1 | Hardware name: linux,dummy-virt (DT) | pstate: 60401005 (nZCv daif +PAN -UAO -TCO -DIT +SSBS BTYPE=--) | pc : __lkdtm_SMP_CALL_LOCKUP+0x0/0x8 | lr : __flush_smp_call_function_queue+0x1b0/0x290 | sp : ffff800080003f30 | pmr_save: 00000060 | x29: ffff800080003f30 x28: ffffa4ce961a4900 x27: 0000000000000000 | x26: fff000003fcfa0c0 x25: ffffa4ce961a4900 x24: ffffa4ce959aa140 | x23: ffffa4ce959aa140 x22: 0000000000000000 x21: ffff800080523c40 | x20: 0000000000000000 x19: 0000000000000000 x18: fff05b31aa323000 | x17: fff05b31aa323000 x16: ffff800080000000 x15: 0000330fc3fe6b2c | x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000279 | x11: 0000000000000040 x10: fff000000302d0a8 x9 : fff000000302d0a0 | x8 : fff0000003400270 x7 : 0000000000000000 x6 : ffffa4ce9451b810 | x5 : 0000000000000000 x4 : fff05b31aa323000 x3 : ffff800080003f30 | x2 : fff05b31aa323000 x1 : ffffa4ce959aa140 x0 : 0000000000000000 | Call trace: | __lkdtm_SMP_CALL_LOCKUP+0x0/0x8 | generic_smp_call_function_single_interrupt+0x14/0x20 | ipi_handler+0xb8/0x178 | handle_percpu_devid_irq+0x84/0x130 | generic_handle_domain_irq+0x2c/0x44 | gic_handle_irq+0x118/0x240 | call_on_irq_stack+0x24/0x4c | do_interrupt_handler+0x80/0x84 | el1_interrupt+0x44/0xc0 | el1h_64_irq_handler+0x18/0x24 | el1h_64_irq+0x78/0x7c | default_idle_call+0x40/0x60 | do_idle+0x23c/0x2d0 | cpu_startup_entry+0x38/0x3c | kernel_init+0x0/0x1d8 | start_kernel+0x51c/0x608 | __primary_switched+0x80/0x88 | CPU: 1 PID: 128 Comm: sh Not tainted 6.9.0-rc4-00001-gfdfd281212ec #1 | Hardware name: linux,dummy-virt (DT) | Call trace: | dump_backtrace+0x90/0xe8 | show_stack+0x18/0x24 | dump_stack_lvl+0xac/0xe8 | dump_stack+0x18/0x24 | csd_lock_wait_toolong+0x268/0x338 | smp_call_function_single+0x1dc/0x2f0 | lkdtm_SMP_CALL_LOCKUP+0xcc/0xfc | lkdtm_do_action+0x1c/0x38 | direct_entry+0xbc/0x14c | full_proxy_write+0x60/0xb4 | vfs_write+0xd0/0x35c | ksys_write+0x70/0x104 | __arm64_sys_write+0x1c/0x28 | invoke_syscall+0x48/0x114 | el0_svc_common.constprop.0+0x40/0xe0 | do_el0_svc+0x1c/0x28 | el0_svc+0x38/0x108 | el0t_64_sync_handler+0x120/0x12c | el0t_64_sync+0x1a4/0x1a8 | smp: csd: Continued non-responsive CSD lock (#1) on CPU#1, waiting 10000064272 ns for CPU#00 __lkdtm_SMP_CALL_LOCKUP+0x0/0x8(0x0). | smp: csd: CSD lock (#1) handling this request. | smp: csd: Continued non-responsive CSD lock (#1) on CPU#1, waiting 15000064384 ns for CPU#00 __lkdtm_SMP_CALL_LOCKUP+0x0/0x8(0x0). | smp: csd: CSD lock (#1) handling this request. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Acked-by: Paul E. McKenney <paulmck@kernel.org> Cc: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20240515120828.375585-1-mark.rutland@arm.com Signed-off-by: Kees Cook <kees@kernel.org>
2024-06-17Merge tag 'hyperv-fixes-signed-20240616' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux Pull Hyper-V fixes from Wei Liu: - Some cosmetic changes for hv.c and balloon.c (Aditya Nagesh) - Two documentation updates (Michael Kelley) - Suppress the invalid warning for packed member alignment (Saurabh Sengar) - Two hv_balloon fixes (Michael Kelley) * tag 'hyperv-fixes-signed-20240616' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux: Drivers: hv: Cosmetic changes for hv.c and balloon.c Documentation: hyperv: Improve synic and interrupt handling description Documentation: hyperv: Update spelling and fix typo tools: hv: suppress the invalid warning for packed member alignment hv_balloon: Enable hot-add for memblock sizes > 128 MiB hv_balloon: Use kernel macros to simplify open coded sequences
2024-06-17selftests/bpf: Add a few tests to coverYonghong Song
Add three unit tests in verifier_movsx.c to cover cases where missed var_off setting can cause unexpected verification success or failure. Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20240615174637.3995589-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-06-15perf hist: Honor symbol_conf.skip_emptyNamhyung Kim
So that it can skip events with no sample according to the config value. This can omit the dummy event in the output of perf report --group. An example output: $ sudo perf mem record -a sleep 1 $ sudo perf report --group Before) # # Samples: 232 of events 'cpu/mem-loads,ldlat=30/P, cpu/mem-stores/P, dummy:u' # Event count (approx.): 3089861 # # Overhead Command Shared Object Symbol # ........................ ........... ................. ..................................... # 9.29% 0.00% 0.00% swapper [kernel.kallsyms] [k] update_blocked_averages 5.26% 0.15% 0.00% swapper [kernel.kallsyms] [k] __update_load_avg_se 4.15% 0.00% 0.00% perf-exec [kernel.kallsyms] [k] slab_update_freelist.isra.0 3.87% 0.00% 0.00% perf-exec [kernel.kallsyms] [k] memcg_slab_post_alloc_hook 3.79% 0.17% 0.00% swapper [kernel.kallsyms] [k] enqueue_task_fair 3.63% 0.00% 0.00% sleep [kernel.kallsyms] [k] next_uptodate_page 2.86% 0.00% 0.00% swapper [kernel.kallsyms] [k] __update_load_avg_cfs_rq 2.78% 0.00% 0.00% swapper [kernel.kallsyms] [k] __schedule 2.34% 0.00% 0.00% swapper [kernel.kallsyms] [k] intel_idle 2.32% 0.97% 0.00% swapper [kernel.kallsyms] [k] psi_group_change After) # # Samples: 232 of events 'cpu/mem-loads,ldlat=30/P, cpu/mem-stores/P' # Event count (approx.): 3089861 # # Overhead Command Shared Object Symbol # ................ ........... ................. ..................................... # 9.29% 0.00% swapper [kernel.kallsyms] [k] update_blocked_averages 5.26% 0.15% swapper [kernel.kallsyms] [k] __update_load_avg_se 4.15% 0.00% perf-exec [kernel.kallsyms] [k] slab_update_freelist.isra.0 3.87% 0.00% perf-exec [kernel.kallsyms] [k] memcg_slab_post_alloc_hook 3.79% 0.17% swapper [kernel.kallsyms] [k] enqueue_task_fair 3.63% 0.00% sleep [kernel.kallsyms] [k] next_uptodate_page 2.86% 0.00% swapper [kernel.kallsyms] [k] __update_load_avg_cfs_rq 2.78% 0.00% swapper [kernel.kallsyms] [k] __schedule 2.34% 0.00% swapper [kernel.kallsyms] [k] intel_idle 2.32% 0.97% swapper [kernel.kallsyms] [k] psi_group_change Now it doesn't have a column for the dummy event. Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20240607202918.2357459-5-namhyung@kernel.org
2024-06-15perf hist: Add symbol_conf.skip_emptyNamhyung Kim
Add the skip_empty flag to symbol_conf and set the value from the report command to preserve the existing behavior. This makes the code simpler and will be needed other code which is hard to add a new argument. Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20240607202918.2357459-4-namhyung@kernel.org
2024-06-15perf hist: Simplify __hpp_fmt() using hpp_fmt_dataNamhyung Kim
The struct hpp_fmt_data is to keep the values for each group members so it doesn't need to check the event index in the group. Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20240607202918.2357459-3-namhyung@kernel.org
2024-06-15perf hist: Factor out __hpp__fmt_print()Namhyung Kim
Split the logic to print the histogram values according to the format string. This was used in 3 different places so it's better to move out the logic into a function. Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20240607202918.2357459-2-namhyung@kernel.org
2024-06-15perf: sched map skips redundant lines with cpu filtersFernand Sieber
perf sched map supports cpu filter. However, even with cpu filters active, any context switch currently corresponds to a separate line. As result, context switches on irrelevant cpus result to redundant lines, which makes the output particlularly difficult to read on wide architectures. Fix it by skipping printing for irrelevant CPUs. Example snippet of output before fix: *B0 1.461147 secs B0 B0 B0 *G0 1.517139 secs After fix: *B0 1.461147 secs *G0 1.517139 secs Signed-off-by: Fernand Sieber <sieberf@amazon.com> Acked-by: Namhyung Kim <namhyung@kernel.org> Reviewed-and-tested-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20240614073517.94974-1-sieberf@amazon.com
2024-06-15selftests: mm: make map_fixed_noreplace test names stableMark Brown
KTAP parsers interpret the output of ksft_test_result_*() as being the name of the test. The map_fixed_noreplace test uses a dynamically allocated base address for the mmap()s that it tests and currently includes this in the test names that it logs so the test names that are logged are not stable between runs. It also uses multiples of PAGE_SIZE which mean that runs for kernels with different PAGE_SIZE configurations can't be directly compared. Both these factors cause issues for CI systems when interpreting and displaying results. Fix this by replacing the current test names with fixed strings describing the intent of the mappings that are logged, the existing messages with the actual addresses and sizes are retained as diagnostic prints to aid in debugging. Link: https://lkml.kernel.org/r/20240605-kselftest-mm-fixed-noreplace-v1-1-a235db8b9be9@kernel.org Fixes: 4838cf70e539 ("selftests/mm: map_fixed_noreplace: conform test to TAP format output") Signed-off-by: Mark Brown <broonie@kernel.org> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Cc: Muhammad Usama Anjum <usama.anjum@collabora.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-06-14selftests: forwarding: Add test for minimum and maximum MTUAmit Cohen
Add cases to check minimum and maximum MTU which are exposed via "ip -d link show". Test configuration and traffic. Use VLAN devices as usually VLAN header (4 bytes) is not included in the MTU, and drivers should configure hardware correctly to send maximum MTU payload size in VLAN tagged packets. $ ./min_max_mtu.sh TEST: ping [ OK ] TEST: ping6 [ OK ] TEST: Test maximum MTU configuration [ OK ] TEST: Test traffic, packet size is maximum MTU [ OK ] TEST: Test minimum MTU configuration [ OK ] TEST: Test traffic, packet size is minimum MTU [ OK ] Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Link: https://lore.kernel.org/r/89de8be8989db7a97f3b39e3c9da695673e78d2e.1718275854.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-14Merge tag 'for-netdev' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Daniel Borkmann says: ==================== pull-request: bpf 2024-06-14 We've added 8 non-merge commits during the last 2 day(s) which contain a total of 9 files changed, 92 insertions(+), 11 deletions(-). The main changes are: 1) Silence a syzkaller splat under CONFIG_DEBUG_NET=y in pskb_pull_reason() triggered via __bpf_try_make_writable(), from Florian Westphal. 2) Fix removal of kfuncs during linking phase which then throws a kernel build warning via resolve_btfids about unresolved symbols, from Tony Ambardar. 3) Fix a UML x86_64 compilation failure from BPF as pcpu_hot symbol is not available on User Mode Linux, from Maciej Żenczykowski. 4) Fix a register corruption in reg_set_min_max triggering an invariant violation in BPF verifier, from Daniel Borkmann. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: bpf: Harden __bpf_kfunc tag against linker kfunc removal compiler_types.h: Define __retain for __attribute__((__retain__)) bpf: Avoid splat in pskb_pull_reason bpf: fix UML x86_64 compile failure selftests/bpf: Add test coverage for reg_set_min_max handling bpf: Reduce stack consumption in check_stack_write_fixed_off bpf: Fix reg_set_min_max corruption of fake_reg MAINTAINERS: mailmap: Update Stanislav's email address ==================== Link: https://lore.kernel.org/r/20240614203223.26500-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-14selftests/bpf: Add tests for add_constAlexei Starovoitov
Improve arena based tests and add several C and asm tests with specific pattern. These tests would have failed without add_const verifier support. Also add several loop_inside_iter*() tests that are not related to add_const, but nice to have. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20240613013815.953-5-alexei.starovoitov@gmail.com
2024-06-14bpf: Support can_loop/cond_break on big endianAlexei Starovoitov
Add big endian support for can_loop/cond_break macros. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/bpf/20240613013815.953-4-alexei.starovoitov@gmail.com
2024-06-14bpf: Track delta between "linked" registers.Alexei Starovoitov
Compilers can generate the code r1 = r2 r1 += 0x1 if r2 < 1000 goto ... use knowledge of r2 range in subsequent r1 operations So remember constant delta between r2 and r1 and update r1 after 'if' condition. Unfortunately LLVM still uses this pattern for loops with 'can_loop' construct: for (i = 0; i < 1000 && can_loop; i++) The "undo" pass was introduced in LLVM https://reviews.llvm.org/D121937 to prevent this optimization, but it cannot cover all cases. Instead of fighting middle end optimizer in BPF backend teach the verifier about this pattern. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/bpf/20240613013815.953-3-alexei.starovoitov@gmail.com