summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2024-09-11Merge tag 'printk-for-6.11-fixup' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux Pull printk fix from Petr Mladek: - Fix build of serial_core as a module * tag 'printk-for-6.11-fixup' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux: printk: Export match_devname_and_update_preferred_console()
2024-09-11kernel/workqueue.c: fix DEFINE_PER_CPU_SHARED_ALIGNED expansionBaoquan He
Make tags always produces below annoying warnings: ctags: Warning: kernel/workqueue.c:470: null expansion of name pattern "\1" ctags: Warning: kernel/workqueue.c:474: null expansion of name pattern "\1" ctags: Warning: kernel/workqueue.c:478: null expansion of name pattern "\1" In commit 25528213fe9f ("tags: Fix DEFINE_PER_CPU expansions"), codes in places have been adjusted including cpu_worker_pools definition. I noticed in commit 4cb1ef64609f ("workqueue: Implement BH workqueues to eventually replace tasklets"), cpu_worker_pools definition was unfolded back. Not sure if it was intentionally done or ignored carelessly. Makes change to mute them specifically. Signed-off-by: Baoquan He <bhe@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-09-11Merge branches 'pm-sleep', 'pm-opp' and 'pm-tools'Rafael J. Wysocki
Merge updates related to system sleep, operating performance points (OPP) updates, and PM tooling updates for 6.12-rc1: - Remove unused stub for saveable_highmem_page() and remove deprecated macros from power management documentation (Andy Shevchenko). - Use ysfs_emit() and sysfs_emit_at() in "show" functions in the PM sysfs interface (Xueqin Luo). - Update the maintainers information for the operating-points-v2-ti-cpu DT binding (Dhruva Gole). - Drop unnecessary of_match_ptr() from ti-opp-supply (Rob Herring). - Update directory handling and installation process in the pm-graph Makefile and add .gitignore to ignore sleepgraph.py artifacts to pm-graph (Amit Vadhavana, Yo-Jung Lin). - Make cpupower display residency value in idle-info (Aboorva Devarajan). - Add missing powercap_set_enabled() stub function to cpupower (John B. Wyatt IV). - Add SWIG support to cpupower (John B. Wyatt IV). * pm-sleep: PM: hibernate: Remove unused stub for saveable_highmem_page() Documentation: PM: Discourage use of deprecated macros PM: sleep: Use sysfs_emit() and sysfs_emit_at() in "show" functions PM: hibernate: Use sysfs_emit() and sysfs_emit_at() in "show" functions * pm-opp: dt-bindings: opp: operating-points-v2-ti-cpu: Update maintainers opp: ti: Drop unnecessary of_match_ptr() * pm-tools: pm:cpupower: Add error warning when SWIG is not installed MAINTAINERS: Add Maintainers for SWIG Python bindings pm:cpupower: Include test_raw_pylibcpupower.py pm:cpupower: Add SWIG bindings files for libcpupower pm:cpupower: Add missing powercap_set_enabled() stub function pm-graph: Update directory handling and installation process in Makefile pm-graph: Make git ignore sleepgraph.py artifacts tools/cpupower: display residency value in idle-info
2024-09-11bpf: wire up sleepable bpf_get_stack() and bpf_get_task_stack() helpersAndrii Nakryiko
Add sleepable implementations of bpf_get_stack() and bpf_get_task_stack() helpers and allow them to be used from sleepable BPF program (e.g., sleepable uprobes). Note, the stack trace IPs capturing itself is not sleepable (that would need to be a separate project), only build ID fetching is sleepable and thus more reliable, as it will wait for data to be paged in, if necessary. For that we make use of sleepable build_id_parse() implementation. Now that build ID related internals in kernel/bpf/stackmap.c can be used both in sleepable and non-sleepable contexts, we need to add additional rcu_read_lock()/rcu_read_unlock() protection around fetching perf_callchain_entry, but with the refactoring in previous commit it's now pretty straightforward. We make sure to do rcu_read_unlock (in sleepable mode only) right before stack_map_get_build_id_offset() call which can sleep. By that time we don't have any more use of perf_callchain_entry. Note, bpf_get_task_stack() will fail for user mode if task != current. And for kernel mode build ID are irrelevant. So in that sense adding sleepable bpf_get_task_stack() implementation is a no-op. It feel right to wire this up for symmetry and completeness, but I'm open to just dropping it until we support `user && crosstask` condition. Reviewed-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20240829174232.3133883-10-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-09-11bpf: decouple stack_map_get_build_id_offset() from perf_callchain_entryAndrii Nakryiko
Change stack_map_get_build_id_offset() which is used to convert stack trace IP addresses into build ID+offset pairs. Right now this function accepts an array of u64s as an input, and uses array of struct bpf_stack_build_id as an output. This is problematic because u64 array is coming from perf_callchain_entry, which is (non-sleepable) RCU protected, so once we allows sleepable build ID fetching, this all breaks down. But its actually pretty easy to make stack_map_get_build_id_offset() works with array of struct bpf_stack_build_id as both input and output. Which is what this patch is doing, eliminating the dependency on perf_callchain_entry. We require caller to fill out bpf_stack_build_id.ip fields (all other can be left uninitialized), and update in place as we do build ID resolution. We make sure to READ_ONCE() and cache locally current IP value as we used it in a few places to find matching VMA and so on. Given this data is directly accessible and modifiable by user's BPF code, we should make sure to have a consistent view of it. Reviewed-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20240829174232.3133883-9-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-09-11lib/buildid: rename build_id_parse() into build_id_parse_nofault()Andrii Nakryiko
Make it clear that build_id_parse() assumes that it can take no page fault by renaming it and current few users to build_id_parse_nofault(). Also add build_id_parse() stub which for now falls back to non-sleepable implementation, but will be changed in subsequent patches to take advantage of sleepable context. PROCMAP_QUERY ioctl() on /proc/<pid>/maps file is using build_id_parse() and will automatically take advantage of more reliable sleepable context implementation. Reviewed-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20240829174232.3133883-6-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-09-11bpf: Support __nullable argument suffix for tp_btfPhilo Lu
Pointers passed to tp_btf were trusted to be valid, but some tracepoints do take NULL pointer as input, such as trace_tcp_send_reset(). Then the invalid memory access cannot be detected by verifier. This patch fix it by add a suffix "__nullable" to the unreliable argument. The suffix is shown in btf, and PTR_MAYBE_NULL will be added to nullable arguments. Then users must check the pointer before use it. A problem here is that we use "btf_trace_##call" to search func_proto. As it is a typedef, argument names as well as the suffix are not recorded. To solve this, I use bpf_raw_event_map to find "__bpf_trace##template" from "btf_trace_##call", and then we can see the suffix. Suggested-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Philo Lu <lulie@linux.alibaba.com> Link: https://lore.kernel.org/r/20240911033719.91468-2-lulie@linux.alibaba.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-09-11bpf, cpumap: Move xdp:xdp_cpumap_kthread tracepoint before rcvDaniel Xu
cpumap takes RX processing out of softirq and onto a separate kthread. Since the kthread needs to be scheduled in order to run (versus softirq which does not), we can theoretically experience extra latency if the system is under load and the scheduler is being unfair to us. Moving the tracepoint to before passing the skb list up the stack allows users to more accurately measure enqueue/dequeue latency introduced by cpumap via xdp:xdp_cpumap_enqueue and xdp:xdp_cpumap_kthread tracepoints. f9419f7bd7a5 ("bpf: cpumap add tracepoints") which added the tracepoints states that the intent behind them was for general observability and for a feedback loop to see if the queues are being overwhelmed. This change does not mess with either of those use cases but rather adds a third one. Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Link: https://lore.kernel.org/bpf/47615d5b5e302e4bd30220473779e98b492d47cd.1725585718.git.dxu@dxuuu.xyz
2024-09-11sched/cpufreq: Use NSEC_PER_MSEC for deadline taskChristian Loehle
Convert the sugov deadline task attributes to use the available definitions to make them more readable. No functional change. Signed-off-by: Christian Loehle <christian.loehle@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Acked-by: Juri Lelli <juri.lelli@redhat.com> Acked-by: Rafael J. Wysocki <rafael@kernel.org> Link: https://lore.kernel.org/r/20240813144348.1180344-5-christian.loehle@arm.com
2024-09-11Merge branch 'for-6.11-fixup' into for-linusPetr Mladek
2024-09-11Merge v6.11-rc7 into drm-nextSimona Vetter
Thomas needs 5a498d4d06d6 ("drm/fbdev-dma: Only install deferred I/O if necessary") in drm-misc, so start the backmerge cascade. Signed-off-by: Simona Vetter <simona.vetter@ffwll.ch>
2024-09-10sched_ext: Don't trigger ops.quiescent/runnable() on migrationsTejun Heo
A task moving across CPUs should not trigger quiescent/runnable task state events as the task is staying runnable the whole time and just stopping and then starting on different CPUs. Suppress quiescent/runnable task state events if task_on_rq_migrating(). Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-by: David Vernet <void@manifault.com> Cc: Daniel Hodges <hodges.daniel.scott@gmail.com> Cc: Changwoo Min <multics69@gmail.com> Cc: Andrea Righi <andrea.righi@linux.dev> Cc: Dan Schatzberg <schatzberg.dan@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-09-10sched_ext: Synchronize bypass state changes with rq lockTejun Heo
While the BPF scheduler is being unloaded, the following warning messages trigger sometimes: NOHZ tick-stop error: local softirq work is pending, handler #80!!! This is caused by the CPU entering idle while there are pending softirqs. The main culprit is the bypassing state assertion not being synchronized with rq operations. As the BPF scheduler cannot be trusted in the disable path, the first step is entering the bypass mode where the BPF scheduler is ignored and scheduling becomes global FIFO. This is implemented by turning scx_ops_bypassing() true. However, the transition isn't synchronized against anything and it's possible for enqueue and dispatch paths to have different ideas on whether bypass mode is on. Make each rq track its own bypass state with SCX_RQ_BYPASSING which is modified while rq is locked. This removes most of the NOHZ tick-stop messages but not completely. I believe the stragglers are from the sched core bug where pick_task_scx() can be called without preceding balance_scx(). Once that bug is fixed, we should verify that all occurrences of this error message are gone too. v2: scx_enabled() test moved inside the for_each_possible_cpu() loop so that the per-cpu states are always synchronized with the global state. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: David Vernet <void@manifault.com>
2024-09-10cgroup: Do not report unavailable v1 controllers in /proc/cgroupsMichal Koutný
This is a followup to CONFIG-urability of cpuset and memory controllers for v1 hierarchies. Make the output in /proc/cgroups reflect that !CONFIG_CPUSETS_V1 is like !CONFIG_CPUSETS and !CONFIG_MEMCG_V1 is like !CONFIG_MEMCG. The intended effect is that hiding the unavailable controllers will hint users not to try mounting them on v1. Signed-off-by: Michal Koutný <mkoutny@suse.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-09-10cgroup: Disallow mounting v1 hierarchies without controller implementationMichal Koutný
The configs that disable some v1 controllers would still allow mounting them but with no controller-specific files. (Making such hierarchies equivalent to named v1 hierarchies.) To achieve behavior consistent with actual out-compilation of a whole controller, the mounts should treat respective controllers as non-existent. Wrap implementation into a helper function, leverage legacy_files to detect compiled out controllers. The effect is that mounts on v1 would fail and produce a message like: [ 1543.999081] cgroup: Unknown subsys name 'memory' Signed-off-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-09-10cgroup/cpuset: Expose cpuset filesystem with cpuset v1 onlyMichal Koutný
The cpuset filesystem is a legacy interface to cpuset controller with (pre-)v1 features. It makes little sense to co-mount it on systems without cpuset v1, so do not build it when cpuset v1 is not built neither. Signed-off-by: Michal Koutný <mkoutny@suse.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-09-10PM: hibernate: Remove unused stub for saveable_highmem_page()Andy Shevchenko
When saveable_highmem_page() is unused, it prevents kernel builds with clang, `make W=1` and CONFIG_WERROR=y: kernel/power/snapshot.c:1369:21: error: unused function 'saveable_highmem_page' [-Werror,-Wunused-function] 1369 | static inline void *saveable_highmem_page(struct zone *z, unsigned long p) | ^~~~~~~~~~~~~~~~~~~~~ Fix this by removing unused stub. See also commit 6863f5643dd7 ("kbuild: allow Clang to find unused static inline functions for W=1 build"). Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Link: https://patch.msgid.link/20240905184848.318978-1-andriy.shevchenko@linux.intel.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2024-09-10Merge tag 'trace-v6.11-rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Move declaration of interface_lock outside of CONFIG_TIMERLAT_TRACER The fix to some locking races moved the declaration of the interface_lock up in the file, but also moved it into the CONFIG_TIMERLAT_TRACER #ifdef block, breaking the build when that wasn't set. Move it further up and out of that #ifdef block. - Remove unused function run_tracer_selftest() stub When CONFIG_FTRACE_STARTUP_TEST is not set the stub function run_tracer_selftest() is not used and clang is warning about it. Remove the function stub as it is not needed. * tag 'trace-v6.11-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing: Drop unused helper function to fix the build tracing/osnoise: Fix build when timerlat is not enabled
2024-09-10ntp: Make sure RTC is synchronized when time goes backwardsBenjamin ROBIN
sync_hw_clock() is normally called every 11 minutes when time is synchronized. This issue is that this periodic timer uses the REALTIME clock, so when time moves backwards (the NTP server jumps into the past), the timer expires late. If the timer expires late, which can be days later, the RTC will no longer be updated, which is an issue if the device is abruptly powered OFF during this period. When the device will restart (when powered ON), it will have the date prior to the ADJ_SETOFFSET call. A normal NTP server should not jump in the past like that, but it is possible... Another way of reproducing this issue is to use phc2sys to synchronize the REALTIME clock with, for example, an IRIG timecode with the source always starting at the same date (not synchronized). Also, if the time jump in the future by less than 11 minutes, the RTC may not be updated immediately (minor issue). Consider the following scenario: - Time is synchronized, and sync_hw_clock() was just called (the timer expires in 11 minutes). - A time jump is realized in the future by a couple of minutes. - The time is synchronized again. - Users may expect that RTC to be updated as soon as possible, and not after 11 minutes (for the same reason, if a power loss occurs in this period). Cancel periodic timer on any time jump (ADJ_SETOFFSET) greater than or equal to 1s. The timer will be relaunched at the end of do_adjtimex() if NTP is still considered synced. Otherwise the timer will be relaunched later when NTP is synced. This way, when the time is synchronized again, the RTC is updated after less than 2 seconds. Signed-off-by: Benjamin ROBIN <dev@benjarobin.fr> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20240908140836.203911-1-dev@benjarobin.fr
2024-09-10Merge branch 'linus' into timers/coreThomas Gleixner
To update with the latest fixes.
2024-09-10locking/rwsem: Move is_rwsem_reader_owned() and rwsem_owner() under ↵Waiman Long
CONFIG_DEBUG_RWSEMS Both is_rwsem_reader_owned() and rwsem_owner() are currently only used when CONFIG_DEBUG_RWSEMS is defined. This causes a compilation error with clang when `make W=1` and CONFIG_WERROR=y: kernel/locking/rwsem.c:187:20: error: unused function 'is_rwsem_reader_owned' [-Werror,-Wunused-function] 187 | static inline bool is_rwsem_reader_owned(struct rw_semaphore *sem) | ^~~~~~~~~~~~~~~~~~~~~ kernel/locking/rwsem.c:271:35: error: unused function 'rwsem_owner' [-Werror,-Wunused-function] 271 | static inline struct task_struct *rwsem_owner(struct rw_semaphore *sem) | ^~~~~~~~~~~ Fix this by moving these two functions under the CONFIG_DEBUG_RWSEMS define. Reported-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Tested-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Link: https://lore.kernel.org/r/20240909182905.161156-1-longman@redhat.com
2024-09-10jump_label: Fix static_key_slow_dec() yet againPeter Zijlstra
While commit 83ab38ef0a0b ("jump_label: Fix concurrency issues in static_key_slow_dec()") fixed one problem, it created yet another, notably the following is now possible: slow_dec if (try_dec) // dec_not_one-ish, false // enabled == 1 slow_inc if (inc_not_disabled) // inc_not_zero-ish // enabled == 2 return guard((mutex)(&jump_label_mutex); if (atomic_cmpxchg(1,0)==1) // false, we're 2 slow_dec if (try-dec) // dec_not_one, true // enabled == 1 return else try_dec() // dec_not_one, false WARN Use dec_and_test instead of cmpxchg(), like it was prior to 83ab38ef0a0b. Add a few WARNs for the paranoid. Fixes: 83ab38ef0a0b ("jump_label: Fix concurrency issues in static_key_slow_dec()") Reported-by: "Darrick J. Wong" <djwong@kernel.org> Tested-by: Klara Modin <klarasmodin@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2024-09-10perf: Add PERF_EV_CAP_READ_SCOPEKan Liang
Usually, an event can be read from any CPU of the scope. It doesn't need to be read from the advertised CPU. Add a new event cap, PERF_EV_CAP_READ_SCOPE. An event of a PMU with scope can be read from any active CPU in the scope. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240802151643.1691631-3-kan.liang@linux.intel.com
2024-09-10perf: Generic hotplug support for a PMU with a scopeKan Liang
The perf subsystem assumes that the counters of a PMU are per-CPU. So the user space tool reads a counter from each CPU in the system wide mode. However, many PMUs don't have a per-CPU counter. The counter is effective for a scope, e.g., a die or a socket. To address this, a cpumask is exposed by the kernel driver to restrict to one CPU to stand for a specific scope. In case the given CPU is removed, the hotplug support has to be implemented for each such driver. The codes to support the cpumask and hotplug are very similar. - Expose a cpumask into sysfs - Pickup another CPU in the same scope if the given CPU is removed. - Invoke the perf_pmu_migrate_context() to migrate to a new CPU. - In event init, always set the CPU in the cpumask to event->cpu Similar duplicated codes are implemented for each such PMU driver. It would be good to introduce a generic infrastructure to avoid such duplication. 5 popular scopes are implemented here, core, die, cluster, pkg, and the system-wide. The scope can be set when a PMU is registered. If so, a "cpumask" is automatically exposed for the PMU. The "cpumask" is from the perf_online_<scope>_mask, which is to track the active CPU for each scope. They are set when the first CPU of the scope is online via the generic perf hotplug support. When a corresponding CPU is removed, the perf_online_<scope>_mask is updated accordingly and the PMU will be moved to a new CPU from the same scope if possible. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240802151643.1691631-2-kan.liang@linux.intel.com
2024-09-10sched/debug: Fix the runnable tasks outputHuang Shijie
The current runnable tasks output looks like: runnable tasks: S task PID tree-key switches prio wait-time sum-exec sum-sleep ------------------------------------------------------------------------------------------------------------- Ikworker/R-rcu_g 4 0.129049 E 0.620179 0.750000 0.002920 2 100 0.000000 0.002920 0.000000 0.000000 0 0 / Ikworker/R-sync_ 5 0.125328 E 0.624147 0.750000 0.001840 2 100 0.000000 0.001840 0.000000 0.000000 0 0 / Ikworker/R-slub_ 6 0.120835 E 0.628680 0.750000 0.001800 2 100 0.000000 0.001800 0.000000 0.000000 0 0 / Ikworker/R-netns 7 0.114294 E 0.634701 0.750000 0.002400 2 100 0.000000 0.002400 0.000000 0.000000 0 0 / I kworker/0:1 9 508.781746 E 511.754666 3.000000 151.575240 224 120 0.000000 151.575240 0.000000 0.000000 0 0 / Which is messy. Remove the duplicate printing of sum_exec_runtime and tidy up the layout to make it look like: runnable tasks: S task PID vruntime eligible deadline slice sum-exec switches prio wait-time sum-sleep sum-block node group-id group-path ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- I kworker/0:3 1698 295.001459 E 297.977619 3.000000 38.862920 9 120 0.000000 0.000000 0.000000 0 0 / I kworker/0:4 1702 278.026303 E 281.026303 3.000000 9.918760 3 120 0.000000 0.000000 0.000000 0 0 / S NetworkManager 2646 0.377936 E 2.598104 3.000000 98.535880 314 120 0.000000 0.000000 0.000000 0 0 /system.slice/NetworkManager.service S virtqemud 2689 0.541016 E 2.440104 3.000000 50.967960 80 120 0.000000 0.000000 0.000000 0 0 /system.slice/virtqemud.service S gsd-smartcard 3058 73.604144 E 76.475904 3.000000 74.033320 88 120 0.000000 0.000000 0.000000 0 0 /user.slice/user-42.slice/session-c1.scope Reviewed-by: Christoph Lameter (Ampere) <cl@linux.com> Signed-off-by: Huang Shijie <shijie@os.amperecomputing.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20240906053019.7874-1-shijie@os.amperecomputing.com
2024-09-10sched: Fix sched_delayed vs sched_corePeter Zijlstra
Completely analogous to commit dfa0a574cbc4 ("sched/uclamg: Handle delayed dequeue"), avoid double dequeue for the sched_core entries. Fixes: 152e11f6df29 ("sched/fair: Implement delayed dequeue") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2024-09-10kernel/sched: Fix util_est accounting for DELAY_DEQUEUEDietmar Eggemann
Remove delayed tasks from util_est even they are runnable. Exclude delayed task which are (a) migrating between rq's or (b) in a SAVE/RESTORE dequeue/enqueue. Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/c49ef5fe-a909-43f1-b02f-a765ab9cedbf@arm.com
2024-09-10kthread: Fix task state in kthread worker if being frozenChen Yu
When analyzing a kernel waring message, Peter pointed out that there is a race condition when the kworker is being frozen and falls into try_to_freeze() with TASK_INTERRUPTIBLE, which could trigger a might_sleep() warning in try_to_freeze(). Although the root cause is not related to freeze()[1], it is still worthy to fix this issue ahead. One possible race scenario: CPU 0 CPU 1 ----- ----- // kthread_worker_fn set_current_state(TASK_INTERRUPTIBLE); suspend_freeze_processes() freeze_processes static_branch_inc(&freezer_active); freeze_kernel_threads pm_nosig_freezing = true; if (work) { //false __set_current_state(TASK_RUNNING); } else if (!freezing(current)) //false, been frozen freezing(): if (static_branch_unlikely(&freezer_active)) if (pm_nosig_freezing) return true; schedule() } // state is still TASK_INTERRUPTIBLE try_to_freeze() might_sleep() <--- warning Fix this by explicitly set the TASK_RUNNING before entering try_to_freeze(). Fixes: b56c0d8937e6 ("kthread: implement kthread_worker") Suggested-by: Peter Zijlstra <peterz@infradead.org> Suggested-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/lkml/Zs2ZoAcUsZMX2B%2FI@chenyu5-mobl2/ [1]
2024-09-10sched/pelt: Use rq_clock_task() for hw_pressureChen Yu
commit 97450eb90965 ("sched/pelt: Remove shift of thermal clock") removed the decay_shift for hw_pressure. This commit uses the sched_clock_task() in sched_tick() while it replaces the sched_clock_task() with rq_clock_pelt() in __update_blocked_others(). This could bring inconsistence. One possible scenario I can think of is in ___update_load_sum(): u64 delta = now - sa->last_update_time 'now' could be calculated by rq_clock_pelt() from __update_blocked_others(), and last_update_time was calculated by rq_clock_task() previously from sched_tick(). Usually the former chases after the latter, it cause a very large 'delta' and brings unexpected behavior. Fixes: 97450eb90965 ("sched/pelt: Remove shift of thermal clock") Signed-off-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Hongyan Xia <hongyan.xia2@arm.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20240827112607.181206-1-yu.c.chen@intel.com
2024-09-10sched/fair: Move effective_cpu_util() and effective_cpu_util() in fair.cVincent Guittot
Move effective_cpu_util() and sched_cpu_util() functions in fair.c file with others utilization related functions. No functional change. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20240904092417.20660-1-vincent.guittot@linaro.org
2024-09-10sched/core: Introduce SM_IDLE and an idle re-entry fast-path in __schedule()Peter Zijlstra
Since commit b2a02fc43a1f ("smp: Optimize send_call_function_single_ipi()") an idle CPU in TIF_POLLING_NRFLAG mode can be pulled out of idle by setting TIF_NEED_RESCHED flag to service an IPI without actually sending an interrupt. Even in cases where the IPI handler does not queue a task on the idle CPU, do_idle() will call __schedule() since need_resched() returns true in these cases. Introduce and use SM_IDLE to identify call to __schedule() from schedule_idle() and shorten the idle re-entry time by skipping pick_next_task() when nr_running is 0 and the previous task is the idle task. With the SM_IDLE fast-path, the time taken to complete a fixed set of IPIs using ipistorm improves noticeably. Following are the numbers from a dual socket Intel Ice Lake Xeon server (2 x 32C/64T) and 3rd Generation AMD EPYC system (2 x 64C/128T) (boost on, C2 disabled) running ipistorm between CPU8 and CPU16: cmdline: insmod ipistorm.ko numipi=100000 single=1 offset=8 cpulist=8 wait=1 ================================================================== Test : ipistorm (modified) Units : Normalized runtime Interpretation: Lower is better Statistic : AMean ======================= Intel Ice Lake Xeon ====================== kernel: time [pct imp] tip:sched/core 1.00 [baseline] tip:sched/core + SM_IDLE 0.80 [20.51%] ==================== 3rd Generation AMD EPYC ===================== kernel: time [pct imp] tip:sched/core 1.00 [baseline] tip:sched/core + SM_IDLE 0.90 [10.17%] ================================================================== [ kprateek: Commit message, SM_RTLOCK_WAIT fix ] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Not-yet-signed-off-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20240809092240.6921-1-kprateek.nayak@amd.com
2024-09-10dma-mapping: add tracing for dma-mapping API callsSean Anderson
When debugging drivers, it can often be useful to trace when memory gets (un)mapped for DMA (and can be accessed by the device). Add some tracepoints for this purpose. Use u64 instead of phys_addr_t and dma_addr_t (and similarly %llx instead of %pa) because libtraceevent can't handle typedefs in all cases. Signed-off-by: Sean Anderson <sean.anderson@linux.dev> Signed-off-by: Christoph Hellwig <hch@lst.de>
2024-09-09user_namespace: use kmemdup_array() instead of kmemdup() for multiple allocationJinjie Ruan
Let the kmemdup_array() take care about multiplication and possible overflows. Link: https://lkml.kernel.org/r/20240828072340.1249310-1-ruanjinjie@huawei.com Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com> Reviewed-by: Kees Cook <kees@kernel.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Li zeming <zeming@nfschina.com> Cc: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq()Tejun Heo
Once a task is put into a DSQ, the allowed operations are fairly limited. Tasks in the built-in local and global DSQs are executed automatically and, ignoring dequeue, there is only one way a task in a user DSQ can be manipulated - scx_bpf_consume() moves the first task to the dispatching local DSQ. This inflexibility sometimes gets in the way and is an area where multiple feature requests have been made. Implement scx_bpf_dispatch[_vtime]_from_dsq(), which can be called during DSQ iteration and can move the task to any DSQ - local DSQs, global DSQ and user DSQs. The kfuncs can be called from ops.dispatch() and any BPF context which dosen't hold a rq lock including BPF timers and SYSCALL programs. This is an expansion of an earlier patch which only allowed moving into the dispatching local DSQ: http://lkml.kernel.org/r/Zn4Cw4FDTmvXnhaf@slm.duckdns.org v2: Remove @slice and @vtime from scx_bpf_dispatch_from_dsq[_vtime]() as they push scx_bpf_dispatch_from_dsq_vtime() over the kfunc argument count limit and often won't be needed anyway. Instead provide scx_bpf_dispatch_from_dsq_set_{slice|vtime}() kfuncs which can be called only when needed and override the specified parameter for the subsequent dispatch. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Daniel Hodges <hodges.daniel.scott@gmail.com> Cc: David Vernet <void@manifault.com> Cc: Changwoo Min <multics69@gmail.com> Cc: Andrea Righi <andrea.righi@linux.dev> Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
2024-09-09sched_ext: Compact struct bpf_iter_scx_dsq_kernTejun Heo
struct scx_iter_scx_dsq is defined as 6 u64's and scx_dsq_iter_kern was using 5 of them. We want to add two more u64 fields but it's better if we do so while staying within scx_iter_scx_dsq to maintain binary compatibility. The way scx_iter_scx_dsq_kern is laid out is rather inefficient - the node field takes up three u64's but only one bit of the last u64 is used. Turn the bool into u32 flags and only use the lower 16 bits freeing up 48 bits - 16 bits for flags, 32 bits for a u32 - for use by struct bpf_iter_scx_dsq_kern. This allows moving the dsq_seq and flags fields of bpf_iter_scx_dsq_kern into the cursor field reducing the struct size by a full u64. No behavior changes intended. Signed-off-by: Tejun Heo <tj@kernel.org>
2024-09-09sched_ext: Replace consume_local_task() with move_local_task_to_local_dsq()Tejun Heo
- Rename move_task_to_local_dsq() to move_remote_task_to_local_dsq(). - Rename consume_local_task() to move_local_task_to_local_dsq() and remove task_unlink_from_dsq() and source DSQ unlocking from it. This is to make the migration code easier to reuse. No functional changes intended. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David Vernet <void@manifault.com>
2024-09-09sched_ext: Move consume_local_task() upwardTejun Heo
So that the local case comes first and two CONFIG_SMP blocks can be merged. No functional changes intended. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David Vernet <void@manifault.com>
2024-09-09sched_ext: Move sanity check and dsq_mod_nr() into task_unlink_from_dsq()Tejun Heo
All task_unlink_from_dsq() users are doing dsq_mod_nr(dsq, -1). Move it into task_unlink_from_dsq(). Also move sanity check into it. No functional changes intended. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David Vernet <void@manifault.com>
2024-09-09sched_ext: Reorder args for consume_local/remote_task()Tejun Heo
Reorder args for consistency in the order of: current_rq, p, src_[rq|dsq], dst_[rq|dsq]. No functional changes intended. Signed-off-by: Tejun Heo <tj@kernel.org>
2024-09-09sched_ext: Restructure dispatch_to_local_dsq()Tejun Heo
Now that there's nothing left after the big if block, flip the if condition and unindent the body. No functional changes intended. v2: Add BUG() to clarify control can't reach the end of dispatch_to_local_dsq() in UP kernels per David. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David Vernet <void@manifault.com>
2024-09-09sched_ext: Fix processs_ddsp_deferred_locals() by unifying DTL_INVALID handlingTejun Heo
With the preceding update, the only return value which makes meaningful difference is DTL_INVALID, for which one caller, finish_dispatch(), falls back to the global DSQ and the other, process_ddsp_deferred_locals(), doesn't do anything. It should always fallback to the global DSQ. Move the global DSQ fallback into dispatch_to_local_dsq() and remove the return value. v2: Patch title and description updated to reflect the behavior fix for process_ddsp_deferred_locals(). Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David Vernet <void@manifault.com>
2024-09-09sched_ext: Make find_dsq_for_dispatch() handle SCX_DSQ_LOCAL_ONTejun Heo
find_dsq_for_dispatch() handles all DSQ IDs except SCX_DSQ_LOCAL_ON. Instead, each caller is hanlding SCX_DSQ_LOCAL_ON before calling it. Move SCX_DSQ_LOCAL_ON lookup into find_dsq_for_dispatch() to remove duplicate code in direct_dispatch() and dispatch_to_local_dsq(). No functional changes intended. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David Vernet <void@manifault.com>
2024-09-09sched_ext: Refactor consume_remote_task()Tejun Heo
The tricky p->scx.holding_cpu handling was split across consume_remote_task() body and move_task_to_local_dsq(). Refactor such that: - All the tricky part is now in the new unlink_dsq_and_lock_src_rq() with consolidated documentation. - move_task_to_local_dsq() now implements straightforward task migration making it easier to use in other places. - dispatch_to_local_dsq() is another user move_task_to_local_dsq(). The usage is updated accordingly. This makes the local and remote cases more symmetric. No functional changes intended. v2: s/task_rq/src_rq/ for consistency. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David Vernet <void@manifault.com>
2024-09-09sched_ext: Rename scx_kfunc_set_sleepable to unlocked and relocateTejun Heo
Sleepables don't need to be in its own kfunc set as each is tagged with KF_SLEEPABLE. Rename to scx_kfunc_set_unlocked indicating that rq lock is not held and relocate right above the any set. This will be used to add kfuncs that are allowed to be called from SYSCALL but not TRACING. No functional changes intended. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David Vernet <void@manifault.com>
2024-09-09cgroup: clarify css sibling linkage is protected by cgroup_mutex or RCUKinsey Ho
Patch series "Improve mem_cgroup_iter()", v4. Incremental cgroup iteration is being used again [1]. This patchset improves the reliability of mem_cgroup_iter(). It also improves simplicity and code readability. [1] https://lore.kernel.org/20240514202641.2821494-1-hannes@cmpxchg.org/ This patch (of 5): Explicitly document that css sibling/descendant linkage is protected by cgroup_mutex or RCU. Also, document in css_next_descendant_pre() and similar functions that it isn't necessary to hold a ref on @pos. The following changes in this patchset rely on this clarification for simplification in memcg iteration code. Link: https://lkml.kernel.org/r/20240905003058.1859929-1-kinseyho@google.com Link: https://lkml.kernel.org/r/20240905003058.1859929-2-kinseyho@google.com Suggested-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Kinsey Ho <kinseyho@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Tejun Heo <tj@kernel.org> Cc: Zefan Li <lizefan.x@bytedance.com> Cc: Hugh Dickins <hughd@google.com> Cc: T.J. Mercier <tjmercier@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09uprobes: use vm_special_mapping close() functionalitySven Schnelle
The following KASAN splat was shown: [ 44.505448] ================================================================== 20:37:27 [3421/145075] [ 44.505455] BUG: KASAN: slab-use-after-free in special_mapping_close+0x9c/0xc8 [ 44.505471] Read of size 8 at addr 00000000868dac48 by task sh/1384 [ 44.505479] [ 44.505486] CPU: 51 UID: 0 PID: 1384 Comm: sh Not tainted 6.11.0-rc6-next-20240902-dirty #1496 [ 44.505503] Hardware name: IBM 3931 A01 704 (z/VM 7.3.0) [ 44.505508] Call Trace: [ 44.505511] [<000b0324d2f78080>] dump_stack_lvl+0xd0/0x108 [ 44.505521] [<000b0324d2f5435c>] print_address_description.constprop.0+0x34/0x2e0 [ 44.505529] [<000b0324d2f5464c>] print_report+0x44/0x138 [ 44.505536] [<000b0324d1383192>] kasan_report+0xc2/0x140 [ 44.505543] [<000b0324d2f52904>] special_mapping_close+0x9c/0xc8 [ 44.505550] [<000b0324d12c7978>] remove_vma+0x78/0x120 [ 44.505557] [<000b0324d128a2c6>] exit_mmap+0x326/0x750 [ 44.505563] [<000b0324d0ba655a>] __mmput+0x9a/0x370 [ 44.505570] [<000b0324d0bbfbe0>] exit_mm+0x240/0x340 [ 44.505575] [<000b0324d0bc0228>] do_exit+0x548/0xd70 [ 44.505580] [<000b0324d0bc1102>] do_group_exit+0x132/0x390 [ 44.505586] [<000b0324d0bc13b6>] __s390x_sys_exit_group+0x56/0x60 [ 44.505592] [<000b0324d0adcbd6>] do_syscall+0x2f6/0x430 [ 44.505599] [<000b0324d2f78434>] __do_syscall+0xa4/0x170 [ 44.505606] [<000b0324d2f9454c>] system_call+0x74/0x98 [ 44.505614] [ 44.505616] Allocated by task 1384: [ 44.505621] kasan_save_stack+0x40/0x70 [ 44.505630] kasan_save_track+0x28/0x40 [ 44.505636] __kasan_kmalloc+0xa0/0xc0 [ 44.505642] __create_xol_area+0xfa/0x410 [ 44.505648] get_xol_area+0xb0/0xf0 [ 44.505652] uprobe_notify_resume+0x27a/0x470 [ 44.505657] irqentry_exit_to_user_mode+0x15e/0x1d0 [ 44.505664] pgm_check_handler+0x122/0x170 [ 44.505670] [ 44.505672] Freed by task 1384: [ 44.505676] kasan_save_stack+0x40/0x70 [ 44.505682] kasan_save_track+0x28/0x40 [ 44.505687] kasan_save_free_info+0x4a/0x70 [ 44.505693] __kasan_slab_free+0x5a/0x70 [ 44.505698] kfree+0xe8/0x3f0 [ 44.505704] __mmput+0x20/0x370 [ 44.505709] exit_mm+0x240/0x340 [ 44.505713] do_exit+0x548/0xd70 [ 44.505718] do_group_exit+0x132/0x390 [ 44.505722] __s390x_sys_exit_group+0x56/0x60 [ 44.505727] do_syscall+0x2f6/0x430 [ 44.505732] __do_syscall+0xa4/0x170 [ 44.505738] system_call+0x74/0x98 The problem is that uprobe_clear_state() kfree's struct xol_area, which contains struct vm_special_mapping *xol_mapping. This one is passed to _install_special_mapping() in xol_add_vma(). __mput reads: static inline void __mmput(struct mm_struct *mm) { VM_BUG_ON(atomic_read(&mm->mm_users)); uprobe_clear_state(mm); exit_aio(mm); ksm_exit(mm); khugepaged_exit(mm); /* must run before exit_mmap */ exit_mmap(mm); ... } So uprobe_clear_state() in the beginning free's the memory area containing the vm_special_mapping data, but exit_mmap() uses this address later via vma->vm_private_data (which was set in _install_special_mapping(). Fix this by moving uprobe_clear_state() to uprobes.c and use it as close() callback. [usama.anjum@collabora.com: remove unneeded condition] Link: https://lkml.kernel.org/r/20240906101825.177490-1-usama.anjum@collabora.com Link: https://lkml.kernel.org/r/20240903073629.2442754-1-svens@linux.ibm.com Fixes: 223febc6e557 ("mm: add optional close() to struct vm_special_mapping") Signed-off-by: Sven Schnelle <svens@linux.ibm.com> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09sched_ext: Add missing static to scx_dump_dataTejun Heo
scx_dump_data is only used inside ext.c but doesn't have static. Add it. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202409070218.RB5WsQ07-lkp@intel.com/
2024-09-09bpf: Fix error message on kfunc arg type mismatchMaxim Mikityanskiy
When "arg#%d expected pointer to ctx, but got %s" error is printed, both template parts actually point to the type of the argument, therefore, it will also say "but got PTR", regardless of what was the actual register type. Fix the message to print the register type in the second part of the template, change the existing test to adapt to the new format, and add a new test to test the case when arg is a pointer to context, but reg is a scalar. Fixes: 00b85860feb8 ("bpf: Rewrite kfunc argument handling") Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/bpf/20240909133909.1315460-1-maxim@isovalent.com
2024-09-09tracing: Drop unused helper function to fix the buildAndy Shevchenko
A helper function defined but not used. This, in particular, prevents kernel builds with clang, `make W=1` and CONFIG_WERROR=y: kernel/trace/trace.c:2229:19: error: unused function 'run_tracer_selftest' [-Werror,-Wunused-function] 2229 | static inline int run_tracer_selftest(struct tracer *type) | ^~~~~~~~~~~~~~~~~~~ Fix this by dropping unused functions. See also commit 6863f5643dd7 ("kbuild: allow Clang to find unused static inline functions for W=1 build"). Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Bill Wendling <morbo@google.com> Cc: Justin Stitt <justinstitt@google.com> Link: https://lore.kernel.org/20240909105314.928302-1-andriy.shevchenko@linux.intel.com Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-09-09tracing/osnoise: Fix build when timerlat is not enabledSteven Rostedt
To fix some critical section races, the interface_lock was added to a few locations. One of those locations was above where the interface_lock was declared, so the declaration was moved up before that usage. Unfortunately, where it was placed was inside a CONFIG_TIMERLAT_TRACER ifdef block. As the interface_lock is used outside that config, this broke the build when CONFIG_OSNOISE_TRACER was enabled but CONFIG_TIMERLAT_TRACER was not. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: "Helena Anna" <helena.anna.dubel@intel.com> Cc: "Luis Claudio R. Goncalves" <lgoncalv@redhat.com> Cc: Tomas Glozar <tglozar@redhat.com> Link: https://lore.kernel.org/20240909103231.23a289e2@gandalf.local.home Fixes: e6a53481da29 ("tracing/timerlat: Only clear timer if a kthread exists") Reported-by: "Bityutskiy, Artem" <artem.bityutskiy@intel.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>