summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2024-09-13bpf: Zero former ARG_PTR_TO_{LONG,INT} args in case of errorDaniel Borkmann
For all non-tracing helpers which formerly had ARG_PTR_TO_{LONG,INT} as input arguments, zero the value for the case of an error as otherwise it could leak memory. For tracing, it is not needed given CAP_PERFMON can already read all kernel memory anyway hence bpf_get_func_arg() and bpf_get_func_ret() is skipped in here. Also, the MTU helpers mtu_len pointer value is being written but also read. Technically, the MEM_UNINIT should not be there in order to always force init. Removing MEM_UNINIT needs more verifier rework though: MEM_UNINIT right now implies two things actually: i) write into memory, ii) memory does not have to be initialized. If we lift MEM_UNINIT, it then becomes: i) read into memory, ii) memory must be initialized. This means that for bpf_*_check_mtu() we're readding the issue we're trying to fix, that is, it would then be able to write back into things like .rodata BPF maps. Follow-up work will rework the MEM_UNINIT semantics such that the intent can be better expressed. For now just clear the *mtu_len on error path which can be lifted later again. Fixes: 8a67f2de9b1d ("bpf: expose bpf_strtol and bpf_strtoul to all program types") Fixes: d7a4cb9b6705 ("bpf: Introduce bpf_strtol and bpf_strtoul helpers") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/e5edd241-59e7-5e39-0ee5-a51e31b6840a@iogearbox.net Link: https://lore.kernel.org/r/20240913191754.13290-5-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-09-13bpf: Improve check_raw_mode_ok test for MEM_UNINIT-tagged typesDaniel Borkmann
When checking malformed helper function signatures, also take other argument types into account aside from just ARG_PTR_TO_UNINIT_MEM. This concerns (formerly) ARG_PTR_TO_{INT,LONG} given uninitialized memory can be passed there, too. The func proto sanity check goes back to commit 435faee1aae9 ("bpf, verifier: add ARG_PTR_TO_RAW_STACK type"), and its purpose was to detect wrong func protos which had more than just one MEM_UNINIT-tagged type as arguments. The reason more than one is currently not supported is as we mark stack slots with STACK_MISC in check_helper_call() in case of raw mode based on meta.access_size to allow uninitialized stack memory to be passed to helpers when they just write into the buffer. Probing for base type as well as MEM_UNINIT tagging ensures that other types do not get missed (as it used to be the case for ARG_PTR_TO_{INT,LONG}). Fixes: 57c3bb725a3d ("bpf: Introduce ARG_PTR_TO_{INT,LONG} arg types") Reported-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Link: https://lore.kernel.org/r/20240913191754.13290-4-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-09-13bpf: Fix helper writes to read-only mapsDaniel Borkmann
Lonial found an issue that despite user- and BPF-side frozen BPF map (like in case of .rodata), it was still possible to write into it from a BPF program side through specific helpers having ARG_PTR_TO_{LONG,INT} as arguments. In check_func_arg() when the argument is as mentioned, the meta->raw_mode is never set. Later, check_helper_mem_access(), under the case of PTR_TO_MAP_VALUE as register base type, it assumes BPF_READ for the subsequent call to check_map_access_type() and given the BPF map is read-only it succeeds. The helpers really need to be annotated as ARG_PTR_TO_{LONG,INT} | MEM_UNINIT when results are written into them as opposed to read out of them. The latter indicates that it's okay to pass a pointer to uninitialized memory as the memory is written to anyway. However, ARG_PTR_TO_{LONG,INT} is a special case of ARG_PTR_TO_FIXED_SIZE_MEM just with additional alignment requirement. So it is better to just get rid of the ARG_PTR_TO_{LONG,INT} special cases altogether and reuse the fixed size memory types. For this, add MEM_ALIGNED to additionally ensure alignment given these helpers write directly into the args via *<ptr> = val. The .arg*_size has been initialized reflecting the actual sizeof(*<ptr>). MEM_ALIGNED can only be used in combination with MEM_FIXED_SIZE annotated argument types, since in !MEM_FIXED_SIZE cases the verifier does not know the buffer size a priori and therefore cannot blindly write *<ptr> = val. Fixes: 57c3bb725a3d ("bpf: Introduce ARG_PTR_TO_{INT,LONG} arg types") Reported-by: Lonial Con <kongln9170@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Link: https://lore.kernel.org/r/20240913191754.13290-3-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-09-13bpf: Remove truncation test in bpf_strtol and bpf_strtoul helpersDaniel Borkmann
Both bpf_strtol() and bpf_strtoul() helpers passed a temporary "long long" respectively "unsigned long long" to __bpf_strtoll() / __bpf_strtoull(). Later, the result was checked for truncation via _res != ({unsigned,} long)_res as the destination buffer for the BPF helpers was of type {unsigned,} long which is 32bit on 32bit architectures. Given the latter was a bug in the helper signatures where the destination buffer got adjusted to {s,u}64, the truncation check can now be removed. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20240913191754.13290-2-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-09-13bpf: Fix bpf_strtol and bpf_strtoul helpers for 32bitDaniel Borkmann
The bpf_strtol() and bpf_strtoul() helpers are currently broken on 32bit: The argument type ARG_PTR_TO_LONG is BPF-side "long", not kernel-side "long" and therefore always considered fixed 64bit no matter if 64 or 32bit underlying architecture. This contract breaks in case of the two mentioned helpers since their BPF_CALL definition for the helpers was added with {unsigned,}long *res. Meaning, the transition from BPF-side "long" (BPF program) to kernel-side "long" (BPF helper) breaks here. Both helpers call __bpf_strtoll() with "long long" correctly, but later assigning the result into 32-bit "*(long *)" on 32bit architectures. From a BPF program point of view, this means upper bits will be seen as uninitialised. Therefore, fix both BPF_CALL signatures to {s,u}64 types to fix this situation. Now, changing also uapi/bpf.h helper documentation which generates bpf_helper_defs.h for BPF programs is tricky: Changing signatures there to __{s,u}64 would trigger compiler warnings (incompatible pointer types passing 'long *' to parameter of type '__s64 *' (aka 'long long *')) for existing BPF programs. Leaving the signatures as-is would be fine as from BPF program point of view it is still BPF-side "long" and thus equivalent to __{s,u}64 on 64 or 32bit underlying architectures. Note that bpf_strtol() and bpf_strtoul() are the only helpers with this issue. Fixes: d7a4cb9b6705 ("bpf: Introduce bpf_strtol and bpf_strtoul helpers") Reported-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/481fcec8-c12c-9abb-8ecb-76c71c009959@iogearbox.net Link: https://lore.kernel.org/r/20240913191754.13290-1-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-09-13bpf: Fix a sdiv overflow issueYonghong Song
Zac Ecob reported a problem where a bpf program may cause kernel crash due to the following error: Oops: divide error: 0000 [#1] PREEMPT SMP KASAN PTI The failure is due to the below signed divide: LLONG_MIN/-1 where LLONG_MIN equals to -9,223,372,036,854,775,808. LLONG_MIN/-1 is supposed to give a positive number 9,223,372,036,854,775,808, but it is impossible since for 64-bit system, the maximum positive number is 9,223,372,036,854,775,807. On x86_64, LLONG_MIN/-1 will cause a kernel exception. On arm64, the result for LLONG_MIN/-1 is LLONG_MIN. Further investigation found all the following sdiv/smod cases may trigger an exception when bpf program is running on x86_64 platform: - LLONG_MIN/-1 for 64bit operation - INT_MIN/-1 for 32bit operation - LLONG_MIN%-1 for 64bit operation - INT_MIN%-1 for 32bit operation where -1 can be an immediate or in a register. On arm64, there are no exceptions: - LLONG_MIN/-1 = LLONG_MIN - INT_MIN/-1 = INT_MIN - LLONG_MIN%-1 = 0 - INT_MIN%-1 = 0 where -1 can be an immediate or in a register. Insn patching is needed to handle the above cases and the patched codes produced results aligned with above arm64 result. The below are pseudo codes to handle sdiv/smod exceptions including both divisor -1 and divisor 0 and the divisor is stored in a register. sdiv: tmp = rX tmp += 1 /* [-1, 0] -> [0, 1] if tmp >(unsigned) 1 goto L2 if tmp == 0 goto L1 rY = 0 L1: rY = -rY; goto L3 L2: rY /= rX L3: smod: tmp = rX tmp += 1 /* [-1, 0] -> [0, 1] if tmp >(unsigned) 1 goto L1 if tmp == 1 (is64 ? goto L2 : goto L3) rY = 0; goto L2 L1: rY %= rX L2: goto L4 // only when !is64 L3: wY = wY // only when !is64 L4: [1] https://lore.kernel.org/bpf/tPJLTEh7S_DxFEqAI2Ji5MBSoZVg7_G-Py2iaZpAaWtM961fFTWtsnlzwvTbzBzaUzwQAoNATXKUlt0LZOFgnDcIyKCswAnAGdUF3LBrhGQ=@protonmail.com/ Reported-by: Zac Ecob <zacecob@protonmail.com> Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20240913150326.1187788-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-09-13module: Refine kmemleak scanned areasVincent Donnefort
commit ac3b43283923 ("module: replace module_layout with module_memory") introduced a set of memory regions for the module layout sharing the same attributes. However, it didn't update the kmemleak scanned areas which intended to limit kmemleak scan to sections containing writable data. This means sections such as .text and .rodata are scanned by kmemleak. Refine the scanned areas for modules by limiting it to MOD_TEXT and MOD_INIT_TEXT mod_mem regions. CC: Song Liu <song@kernel.org> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Vincent Donnefort <vdonnefort@google.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2024-09-13module: abort module loading when sysfs setup suffer errorsChunhui Li
When insmod a kernel module, if fails in add_notes_attrs or add_sysfs_attrs such as memory allocation fail, mod_sysfs_setup will still return success, but we can't access user interface on android device. Patch for make mod_sysfs_setup can check the error of add_notes_attrs and add_sysfs_attrs [mcgrof: the section stuff comes from linux history.git [0]] Fixes: 3f7b0672086b ("Module section offsets in /sys/module") [0] Fixes: 6d76013381ed ("Add /sys/module/name/notes") Acked-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: Petr Pavlu <petr.pavlu@suse.com> Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202409010016.3XIFSmRA-lkp@intel.com/ Closes: https://lore.kernel.org/oe-kbuild-all/202409072018.qfEzZbO7-lkp@intel.com/ Link: https://git.kernel.org/pub/scm/linux/kernel/git/history/history.git/commit/?id=3f7b0672086b97b2d7f322bdc289cbfa203f10ef [0] Signed-off-by: Xion Wang <xion.wang@mediatek.com> Signed-off-by: Chunhui Li <chunhui.li@mediatek.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2024-09-12Merge tag 'for-netdev' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2024-09-11 We've added 12 non-merge commits during the last 16 day(s) which contain a total of 20 files changed, 228 insertions(+), 30 deletions(-). There's a minor merge conflict in drivers/net/netkit.c: 00d066a4d4ed ("netdev_features: convert NETIF_F_LLTX to dev->lltx") d96608794889 ("netkit: Disable netpoll support") The main changes are: 1) Enable bpf_dynptr_from_skb for tp_btf such that this can be used to easily parse skbs in BPF programs attached to tracepoints, from Philo Lu. 2) Add a cond_resched() point in BPF's sock_hash_free() as there have been several syzbot soft lockup reports recently, from Eric Dumazet. 3) Fix xsk_buff_can_alloc() to account for queue_empty_descs which got noticed when zero copy ice driver started to use it, from Maciej Fijalkowski. 4) Move the xdp:xdp_cpumap_kthread tracepoint before cpumap pushes skbs up via netif_receive_skb_list() to better measure latencies, from Daniel Xu. 5) Follow-up to disable netpoll support from netkit, from Daniel Borkmann. 6) Improve xsk selftests to not assume a fixed MAX_SKB_FRAGS of 17 but instead gather the actual value via /proc/sys/net/core/max_skb_frags, also from Maciej Fijalkowski. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: sock_map: Add a cond_resched() in sock_hash_free() selftests/bpf: Expand skb dynptr selftests for tp_btf bpf: Allow bpf_dynptr_from_skb() for tp_btf tcp: Use skb__nullable in trace_tcp_send_reset selftests/bpf: Add test for __nullable suffix in tp_btf bpf: Support __nullable argument suffix for tp_btf bpf, cpumap: Move xdp:xdp_cpumap_kthread tracepoint before rcv selftests/xsk: Read current MAX_SKB_FRAGS from sysctl knob xsk: Bump xsk_queue::queue_empty_descs in xp_can_alloc() tcp_bpf: Remove an unused parameter for bpf_tcp_ingress() bpf, sockmap: Correct spelling skmsg.c netkit: Disable netpoll support Signed-off-by: Jakub Kicinski <kuba@kernel.org> ==================== Link: https://patch.msgid.link/20240911211525.13834-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-12bpf: convert bpf_token_create() to CLASS(fd, ...)Al Viro
Keep file reference through the entire thing, don't bother with grabbing struct path reference and while we are at it, don't confuse the hell out of readers by random mix of path.dentry->d_sb and path.mnt->mnt_sb uses - these two are equal, so just put one of those into a local variable and use that. Reviewed-by: Christian Brauner <brauner@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
2024-09-12Merge tag 'wq-for-6.11-rc7-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq Pull workqueue fix from Tejun Heo: "A fix for a NULL worker->pool deref bug which can be triggered when a worker is created and then destroyed immediately" * tag 'wq-for-6.11-rc7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: Clear worker->pool in the worker thread context
2024-09-12dma-mapping: reflow dma_supportedChristoph Hellwig
dma_supported has become too much spaghetti for my taste. Reflow it to remove the duplicate use_dma_iommu condition and make the main path more obvious. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Leon Romanovsky <leon@kernel.org>
2024-09-12uidgid: make sure we fit into one cachelineChristian Brauner
When I expanded uidgid mappings I intended for a struct uid_gid_map to fit into a single cacheline on x86 as they tend to be pretty performance sensitive (idmapped mounts etc). But a 4 byte hole was added that brought it over 64 bytes. Fix that by adding the static extent array and the extent counter into a substruct. C's type punning for unions guarantees that we can access ->nr_extents even if the last written to member wasn't within the same object. This is also what we rely on in struct_group() and friends. This of course relies on non-strict aliasing which we don't do. 99) If the member used to read the contents of a union object is not the same as the member last used to store a value in the object, the appropriate part of the object representation of the value is reinterpreted as an object representation in the new type as described in 6.2.6 (a process sometimes called "type punning"). Link: https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2310.pdf Link: https://lore.kernel.org/r/20240910-work-uid_gid_map-v1-1-e6bc761363ed@kernel.org Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-09-12dma-mapping: reliably inform about DMA support for IOMMULeon Romanovsky
If the DMA IOMMU path is going to be used, the appropriate check should return that DMA is supported. Fixes: b5c58b2fdc42 ("dma-mapping: direct calls for dma-iommu") Closes: https://lore.kernel.org/all/181e06ff-35a3-434f-b505-672f430bd1cb@notapiano Reported-by: Nícolas F. R. A. Prado <nfraprado@collabora.com> #KernelCI Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Robin Murphy <robin.murphy@arm.com> Tested-by: Nícolas F. R. A. Prado <nfraprado@collabora.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2024-09-11sched: Move update_other_load_avgs() to kernel/sched/pelt.cTejun Heo
96fd6c65efc6 ("sched: Factor out update_other_load_avgs() from __update_blocked_others()") added update_other_load_avgs() in kernel/sched/syscalls.c right above effective_cpu_util(). This location didn't fit that well in the first place, and with 5d871a63997f ("sched/fair: Move effective_cpu_util() and effective_cpu_util() in fair.c") moving effective_cpu_util() to kernel/sched/fair.c, it looks even more out of place. Relocate the function to kernel/sched/pelt.c where all its callees are. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com>
2024-09-11workqueue: Clear worker->pool in the worker thread contextLai Jiangshan
Marc Hartmayer reported: [ 23.133876] Unable to handle kernel pointer dereference in virtual kernel address space [ 23.133950] Failing address: 0000000000000000 TEID: 0000000000000483 [ 23.133954] Fault in home space mode while using kernel ASCE. [ 23.133957] AS:000000001b8f0007 R3:0000000056cf4007 S:0000000056cf3800 P:000000000000003d [ 23.134207] Oops: 0004 ilc:2 [#1] SMP (snip) [ 23.134516] Call Trace: [ 23.134520] [<0000024e326caf28>] worker_thread+0x48/0x430 [ 23.134525] ([<0000024e326caf18>] worker_thread+0x38/0x430) [ 23.134528] [<0000024e326d3a3e>] kthread+0x11e/0x130 [ 23.134533] [<0000024e3264b0dc>] __ret_from_fork+0x3c/0x60 [ 23.134536] [<0000024e333fb37a>] ret_from_fork+0xa/0x38 [ 23.134552] Last Breaking-Event-Address: [ 23.134553] [<0000024e333f4c04>] mutex_unlock+0x24/0x30 [ 23.134562] Kernel panic - not syncing: Fatal exception: panic_on_oops With debuging and analysis, worker_thread() accesses to the nullified worker->pool when the newly created worker is destroyed before being waken-up, in which case worker_thread() can see the result detach_worker() reseting worker->pool to NULL at the begining. Move the code "worker->pool = NULL;" out from detach_worker() to fix the problem. worker->pool had been designed to be constant for regular workers and changeable for rescuer. To share attaching/detaching code for regular and rescuer workers and to avoid worker->pool being accessed inadvertently when the worker has been detached, worker->pool is reset to NULL when detached no matter the worker is rescuer or not. To maintain worker->pool being reset after detached, move the code "worker->pool = NULL;" in the worker thread context after detached. It is either be in the regular worker thread context after PF_WQ_WORKER is cleared or in rescuer worker thread context with wq_pool_attach_mutex held. So it is safe to do so. Cc: Marc Hartmayer <mhartmay@linux.ibm.com> Link: https://lore.kernel.org/lkml/87wmjj971b.fsf@linux.ibm.com/ Reported-by: Marc Hartmayer <mhartmay@linux.ibm.com> Fixes: f4b7b53c94af ("workqueue: Detach workers directly in idle_cull_fn()") Cc: stable@vger.kernel.org # v6.11+ Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-09-11bpf: Use fake pt_regs when doing bpf syscall tracepoint tracingYonghong Song
Salvatore Benedetto reported an issue that when doing syscall tracepoint tracing the kernel stack is empty. For example, using the following command line bpftrace -e 'tracepoint:syscalls:sys_enter_read { print("Kernel Stack\n"); print(kstack()); }' bpftrace -e 'tracepoint:syscalls:sys_exit_read { print("Kernel Stack\n"); print(kstack()); }' the output for both commands is === Kernel Stack === Further analysis shows that pt_regs used for bpf syscall tracepoint tracing is from the one constructed during user->kernel transition. The call stack looks like perf_syscall_enter+0x88/0x7c0 trace_sys_enter+0x41/0x80 syscall_trace_enter+0x100/0x160 do_syscall_64+0x38/0xf0 entry_SYSCALL_64_after_hwframe+0x76/0x7e The ip address stored in pt_regs is from user space hence no kernel stack is printed. To fix the issue, kernel address from pt_regs is required. In kernel repo, there are already a few cases like this. For example, in kernel/trace/bpf_trace.c, several perf_fetch_caller_regs(fake_regs_ptr) instances are used to supply ip address or use ip address to construct call stack. Instead of allocate fake_regs in the stack which may consume a lot of bytes, the function perf_trace_buf_alloc() in perf_syscall_{enter, exit}() is leveraged to create fake_regs, which will be passed to perf_call_bpf_{enter,exit}(). For the above bpftrace script, I got the following output with this patch: for tracepoint:syscalls:sys_enter_read === Kernel Stack syscall_trace_enter+407 syscall_trace_enter+407 do_syscall_64+74 entry_SYSCALL_64_after_hwframe+75 === and for tracepoint:syscalls:sys_exit_read === Kernel Stack syscall_exit_work+185 syscall_exit_work+185 syscall_exit_to_user_mode+305 do_syscall_64+118 entry_SYSCALL_64_after_hwframe+75 === Reported-by: Salvatore Benedetto <salvabenedetto@meta.com> Suggested-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20240910214037.3663272-1-yonghong.song@linux.dev
2024-09-11bpf: Check percpu map value size firstTao Chen
Percpu map is often used, but the map value size limit often ignored, like issue: https://github.com/iovisor/bcc/issues/2519. Actually, percpu map value size is bound by PCPU_MIN_UNIT_SIZE, so we can check the value size whether it exceeds PCPU_MIN_UNIT_SIZE first, like percpu map of local_storage. Maybe the error message seems clearer compared with "cannot allocate memory". Signed-off-by: Jinke Han <jinkehan@didiglobal.com> Signed-off-by: Tao Chen <chen.dylane@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20240910144111.1464912-2-chen.dylane@gmail.com
2024-09-11Merge branch 'tip/sched/core' into sched_ext/for-6.12Tejun Heo
Pull in tip/sched/core to resolve two merge conflicts: - 96fd6c65efc6 ("sched: Factor out update_other_load_avgs() from __update_blocked_others()") 5d871a63997f ("sched/fair: Move effective_cpu_util() and effective_cpu_util() in fair.c") A simple context conflict. The former added __update_blocked_others() in the same #ifdef CONFIG_SMP block that effective_cpu_util() and sched_cpu_util() are in and the latter moved those functions to fair.c. This makes __update_blocked_others() more out of place. Will follow up with a patch to relocate. - 96fd6c65efc6 ("sched: Factor out update_other_load_avgs() from __update_blocked_others()") 84d265281d6c ("sched/pelt: Use rq_clock_task() for hw_pressure") The former factored out the body of __update_blocked_others() into update_other_load_avgs(). The latter changed how update_hw_load_avg() is called in the body. Resolved by applying the change to update_other_load_avgs() instead. Signed-off-by: Tejun Heo <tj@kernel.org>
2024-09-11Merge tag 'printk-for-6.11-fixup' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux Pull printk fix from Petr Mladek: - Fix build of serial_core as a module * tag 'printk-for-6.11-fixup' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux: printk: Export match_devname_and_update_preferred_console()
2024-09-11kernel/workqueue.c: fix DEFINE_PER_CPU_SHARED_ALIGNED expansionBaoquan He
Make tags always produces below annoying warnings: ctags: Warning: kernel/workqueue.c:470: null expansion of name pattern "\1" ctags: Warning: kernel/workqueue.c:474: null expansion of name pattern "\1" ctags: Warning: kernel/workqueue.c:478: null expansion of name pattern "\1" In commit 25528213fe9f ("tags: Fix DEFINE_PER_CPU expansions"), codes in places have been adjusted including cpu_worker_pools definition. I noticed in commit 4cb1ef64609f ("workqueue: Implement BH workqueues to eventually replace tasklets"), cpu_worker_pools definition was unfolded back. Not sure if it was intentionally done or ignored carelessly. Makes change to mute them specifically. Signed-off-by: Baoquan He <bhe@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-09-11Merge branches 'pm-sleep', 'pm-opp' and 'pm-tools'Rafael J. Wysocki
Merge updates related to system sleep, operating performance points (OPP) updates, and PM tooling updates for 6.12-rc1: - Remove unused stub for saveable_highmem_page() and remove deprecated macros from power management documentation (Andy Shevchenko). - Use ysfs_emit() and sysfs_emit_at() in "show" functions in the PM sysfs interface (Xueqin Luo). - Update the maintainers information for the operating-points-v2-ti-cpu DT binding (Dhruva Gole). - Drop unnecessary of_match_ptr() from ti-opp-supply (Rob Herring). - Update directory handling and installation process in the pm-graph Makefile and add .gitignore to ignore sleepgraph.py artifacts to pm-graph (Amit Vadhavana, Yo-Jung Lin). - Make cpupower display residency value in idle-info (Aboorva Devarajan). - Add missing powercap_set_enabled() stub function to cpupower (John B. Wyatt IV). - Add SWIG support to cpupower (John B. Wyatt IV). * pm-sleep: PM: hibernate: Remove unused stub for saveable_highmem_page() Documentation: PM: Discourage use of deprecated macros PM: sleep: Use sysfs_emit() and sysfs_emit_at() in "show" functions PM: hibernate: Use sysfs_emit() and sysfs_emit_at() in "show" functions * pm-opp: dt-bindings: opp: operating-points-v2-ti-cpu: Update maintainers opp: ti: Drop unnecessary of_match_ptr() * pm-tools: pm:cpupower: Add error warning when SWIG is not installed MAINTAINERS: Add Maintainers for SWIG Python bindings pm:cpupower: Include test_raw_pylibcpupower.py pm:cpupower: Add SWIG bindings files for libcpupower pm:cpupower: Add missing powercap_set_enabled() stub function pm-graph: Update directory handling and installation process in Makefile pm-graph: Make git ignore sleepgraph.py artifacts tools/cpupower: display residency value in idle-info
2024-09-11bpf: wire up sleepable bpf_get_stack() and bpf_get_task_stack() helpersAndrii Nakryiko
Add sleepable implementations of bpf_get_stack() and bpf_get_task_stack() helpers and allow them to be used from sleepable BPF program (e.g., sleepable uprobes). Note, the stack trace IPs capturing itself is not sleepable (that would need to be a separate project), only build ID fetching is sleepable and thus more reliable, as it will wait for data to be paged in, if necessary. For that we make use of sleepable build_id_parse() implementation. Now that build ID related internals in kernel/bpf/stackmap.c can be used both in sleepable and non-sleepable contexts, we need to add additional rcu_read_lock()/rcu_read_unlock() protection around fetching perf_callchain_entry, but with the refactoring in previous commit it's now pretty straightforward. We make sure to do rcu_read_unlock (in sleepable mode only) right before stack_map_get_build_id_offset() call which can sleep. By that time we don't have any more use of perf_callchain_entry. Note, bpf_get_task_stack() will fail for user mode if task != current. And for kernel mode build ID are irrelevant. So in that sense adding sleepable bpf_get_task_stack() implementation is a no-op. It feel right to wire this up for symmetry and completeness, but I'm open to just dropping it until we support `user && crosstask` condition. Reviewed-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20240829174232.3133883-10-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-09-11bpf: decouple stack_map_get_build_id_offset() from perf_callchain_entryAndrii Nakryiko
Change stack_map_get_build_id_offset() which is used to convert stack trace IP addresses into build ID+offset pairs. Right now this function accepts an array of u64s as an input, and uses array of struct bpf_stack_build_id as an output. This is problematic because u64 array is coming from perf_callchain_entry, which is (non-sleepable) RCU protected, so once we allows sleepable build ID fetching, this all breaks down. But its actually pretty easy to make stack_map_get_build_id_offset() works with array of struct bpf_stack_build_id as both input and output. Which is what this patch is doing, eliminating the dependency on perf_callchain_entry. We require caller to fill out bpf_stack_build_id.ip fields (all other can be left uninitialized), and update in place as we do build ID resolution. We make sure to READ_ONCE() and cache locally current IP value as we used it in a few places to find matching VMA and so on. Given this data is directly accessible and modifiable by user's BPF code, we should make sure to have a consistent view of it. Reviewed-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20240829174232.3133883-9-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-09-11lib/buildid: rename build_id_parse() into build_id_parse_nofault()Andrii Nakryiko
Make it clear that build_id_parse() assumes that it can take no page fault by renaming it and current few users to build_id_parse_nofault(). Also add build_id_parse() stub which for now falls back to non-sleepable implementation, but will be changed in subsequent patches to take advantage of sleepable context. PROCMAP_QUERY ioctl() on /proc/<pid>/maps file is using build_id_parse() and will automatically take advantage of more reliable sleepable context implementation. Reviewed-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20240829174232.3133883-6-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-09-11bpf: Support __nullable argument suffix for tp_btfPhilo Lu
Pointers passed to tp_btf were trusted to be valid, but some tracepoints do take NULL pointer as input, such as trace_tcp_send_reset(). Then the invalid memory access cannot be detected by verifier. This patch fix it by add a suffix "__nullable" to the unreliable argument. The suffix is shown in btf, and PTR_MAYBE_NULL will be added to nullable arguments. Then users must check the pointer before use it. A problem here is that we use "btf_trace_##call" to search func_proto. As it is a typedef, argument names as well as the suffix are not recorded. To solve this, I use bpf_raw_event_map to find "__bpf_trace##template" from "btf_trace_##call", and then we can see the suffix. Suggested-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Philo Lu <lulie@linux.alibaba.com> Link: https://lore.kernel.org/r/20240911033719.91468-2-lulie@linux.alibaba.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-09-11bpf, cpumap: Move xdp:xdp_cpumap_kthread tracepoint before rcvDaniel Xu
cpumap takes RX processing out of softirq and onto a separate kthread. Since the kthread needs to be scheduled in order to run (versus softirq which does not), we can theoretically experience extra latency if the system is under load and the scheduler is being unfair to us. Moving the tracepoint to before passing the skb list up the stack allows users to more accurately measure enqueue/dequeue latency introduced by cpumap via xdp:xdp_cpumap_enqueue and xdp:xdp_cpumap_kthread tracepoints. f9419f7bd7a5 ("bpf: cpumap add tracepoints") which added the tracepoints states that the intent behind them was for general observability and for a feedback loop to see if the queues are being overwhelmed. This change does not mess with either of those use cases but rather adds a third one. Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Link: https://lore.kernel.org/bpf/47615d5b5e302e4bd30220473779e98b492d47cd.1725585718.git.dxu@dxuuu.xyz
2024-09-11sched/cpufreq: Use NSEC_PER_MSEC for deadline taskChristian Loehle
Convert the sugov deadline task attributes to use the available definitions to make them more readable. No functional change. Signed-off-by: Christian Loehle <christian.loehle@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Acked-by: Juri Lelli <juri.lelli@redhat.com> Acked-by: Rafael J. Wysocki <rafael@kernel.org> Link: https://lore.kernel.org/r/20240813144348.1180344-5-christian.loehle@arm.com
2024-09-11Merge branch 'for-6.11-fixup' into for-linusPetr Mladek
2024-09-11Merge v6.11-rc7 into drm-nextSimona Vetter
Thomas needs 5a498d4d06d6 ("drm/fbdev-dma: Only install deferred I/O if necessary") in drm-misc, so start the backmerge cascade. Signed-off-by: Simona Vetter <simona.vetter@ffwll.ch>
2024-09-10sched_ext: Don't trigger ops.quiescent/runnable() on migrationsTejun Heo
A task moving across CPUs should not trigger quiescent/runnable task state events as the task is staying runnable the whole time and just stopping and then starting on different CPUs. Suppress quiescent/runnable task state events if task_on_rq_migrating(). Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-by: David Vernet <void@manifault.com> Cc: Daniel Hodges <hodges.daniel.scott@gmail.com> Cc: Changwoo Min <multics69@gmail.com> Cc: Andrea Righi <andrea.righi@linux.dev> Cc: Dan Schatzberg <schatzberg.dan@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-09-10sched_ext: Synchronize bypass state changes with rq lockTejun Heo
While the BPF scheduler is being unloaded, the following warning messages trigger sometimes: NOHZ tick-stop error: local softirq work is pending, handler #80!!! This is caused by the CPU entering idle while there are pending softirqs. The main culprit is the bypassing state assertion not being synchronized with rq operations. As the BPF scheduler cannot be trusted in the disable path, the first step is entering the bypass mode where the BPF scheduler is ignored and scheduling becomes global FIFO. This is implemented by turning scx_ops_bypassing() true. However, the transition isn't synchronized against anything and it's possible for enqueue and dispatch paths to have different ideas on whether bypass mode is on. Make each rq track its own bypass state with SCX_RQ_BYPASSING which is modified while rq is locked. This removes most of the NOHZ tick-stop messages but not completely. I believe the stragglers are from the sched core bug where pick_task_scx() can be called without preceding balance_scx(). Once that bug is fixed, we should verify that all occurrences of this error message are gone too. v2: scx_enabled() test moved inside the for_each_possible_cpu() loop so that the per-cpu states are always synchronized with the global state. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: David Vernet <void@manifault.com>
2024-09-10cgroup: Do not report unavailable v1 controllers in /proc/cgroupsMichal Koutný
This is a followup to CONFIG-urability of cpuset and memory controllers for v1 hierarchies. Make the output in /proc/cgroups reflect that !CONFIG_CPUSETS_V1 is like !CONFIG_CPUSETS and !CONFIG_MEMCG_V1 is like !CONFIG_MEMCG. The intended effect is that hiding the unavailable controllers will hint users not to try mounting them on v1. Signed-off-by: Michal Koutný <mkoutny@suse.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-09-10cgroup: Disallow mounting v1 hierarchies without controller implementationMichal Koutný
The configs that disable some v1 controllers would still allow mounting them but with no controller-specific files. (Making such hierarchies equivalent to named v1 hierarchies.) To achieve behavior consistent with actual out-compilation of a whole controller, the mounts should treat respective controllers as non-existent. Wrap implementation into a helper function, leverage legacy_files to detect compiled out controllers. The effect is that mounts on v1 would fail and produce a message like: [ 1543.999081] cgroup: Unknown subsys name 'memory' Signed-off-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-09-10cgroup/cpuset: Expose cpuset filesystem with cpuset v1 onlyMichal Koutný
The cpuset filesystem is a legacy interface to cpuset controller with (pre-)v1 features. It makes little sense to co-mount it on systems without cpuset v1, so do not build it when cpuset v1 is not built neither. Signed-off-by: Michal Koutný <mkoutny@suse.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-09-10PM: hibernate: Remove unused stub for saveable_highmem_page()Andy Shevchenko
When saveable_highmem_page() is unused, it prevents kernel builds with clang, `make W=1` and CONFIG_WERROR=y: kernel/power/snapshot.c:1369:21: error: unused function 'saveable_highmem_page' [-Werror,-Wunused-function] 1369 | static inline void *saveable_highmem_page(struct zone *z, unsigned long p) | ^~~~~~~~~~~~~~~~~~~~~ Fix this by removing unused stub. See also commit 6863f5643dd7 ("kbuild: allow Clang to find unused static inline functions for W=1 build"). Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Link: https://patch.msgid.link/20240905184848.318978-1-andriy.shevchenko@linux.intel.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2024-09-10Merge tag 'trace-v6.11-rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Move declaration of interface_lock outside of CONFIG_TIMERLAT_TRACER The fix to some locking races moved the declaration of the interface_lock up in the file, but also moved it into the CONFIG_TIMERLAT_TRACER #ifdef block, breaking the build when that wasn't set. Move it further up and out of that #ifdef block. - Remove unused function run_tracer_selftest() stub When CONFIG_FTRACE_STARTUP_TEST is not set the stub function run_tracer_selftest() is not used and clang is warning about it. Remove the function stub as it is not needed. * tag 'trace-v6.11-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing: Drop unused helper function to fix the build tracing/osnoise: Fix build when timerlat is not enabled
2024-09-10ntp: Make sure RTC is synchronized when time goes backwardsBenjamin ROBIN
sync_hw_clock() is normally called every 11 minutes when time is synchronized. This issue is that this periodic timer uses the REALTIME clock, so when time moves backwards (the NTP server jumps into the past), the timer expires late. If the timer expires late, which can be days later, the RTC will no longer be updated, which is an issue if the device is abruptly powered OFF during this period. When the device will restart (when powered ON), it will have the date prior to the ADJ_SETOFFSET call. A normal NTP server should not jump in the past like that, but it is possible... Another way of reproducing this issue is to use phc2sys to synchronize the REALTIME clock with, for example, an IRIG timecode with the source always starting at the same date (not synchronized). Also, if the time jump in the future by less than 11 minutes, the RTC may not be updated immediately (minor issue). Consider the following scenario: - Time is synchronized, and sync_hw_clock() was just called (the timer expires in 11 minutes). - A time jump is realized in the future by a couple of minutes. - The time is synchronized again. - Users may expect that RTC to be updated as soon as possible, and not after 11 minutes (for the same reason, if a power loss occurs in this period). Cancel periodic timer on any time jump (ADJ_SETOFFSET) greater than or equal to 1s. The timer will be relaunched at the end of do_adjtimex() if NTP is still considered synced. Otherwise the timer will be relaunched later when NTP is synced. This way, when the time is synchronized again, the RTC is updated after less than 2 seconds. Signed-off-by: Benjamin ROBIN <dev@benjarobin.fr> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20240908140836.203911-1-dev@benjarobin.fr
2024-09-10Merge branch 'linus' into timers/coreThomas Gleixner
To update with the latest fixes.
2024-09-10locking/rwsem: Move is_rwsem_reader_owned() and rwsem_owner() under ↵Waiman Long
CONFIG_DEBUG_RWSEMS Both is_rwsem_reader_owned() and rwsem_owner() are currently only used when CONFIG_DEBUG_RWSEMS is defined. This causes a compilation error with clang when `make W=1` and CONFIG_WERROR=y: kernel/locking/rwsem.c:187:20: error: unused function 'is_rwsem_reader_owned' [-Werror,-Wunused-function] 187 | static inline bool is_rwsem_reader_owned(struct rw_semaphore *sem) | ^~~~~~~~~~~~~~~~~~~~~ kernel/locking/rwsem.c:271:35: error: unused function 'rwsem_owner' [-Werror,-Wunused-function] 271 | static inline struct task_struct *rwsem_owner(struct rw_semaphore *sem) | ^~~~~~~~~~~ Fix this by moving these two functions under the CONFIG_DEBUG_RWSEMS define. Reported-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Tested-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Link: https://lore.kernel.org/r/20240909182905.161156-1-longman@redhat.com
2024-09-10jump_label: Fix static_key_slow_dec() yet againPeter Zijlstra
While commit 83ab38ef0a0b ("jump_label: Fix concurrency issues in static_key_slow_dec()") fixed one problem, it created yet another, notably the following is now possible: slow_dec if (try_dec) // dec_not_one-ish, false // enabled == 1 slow_inc if (inc_not_disabled) // inc_not_zero-ish // enabled == 2 return guard((mutex)(&jump_label_mutex); if (atomic_cmpxchg(1,0)==1) // false, we're 2 slow_dec if (try-dec) // dec_not_one, true // enabled == 1 return else try_dec() // dec_not_one, false WARN Use dec_and_test instead of cmpxchg(), like it was prior to 83ab38ef0a0b. Add a few WARNs for the paranoid. Fixes: 83ab38ef0a0b ("jump_label: Fix concurrency issues in static_key_slow_dec()") Reported-by: "Darrick J. Wong" <djwong@kernel.org> Tested-by: Klara Modin <klarasmodin@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2024-09-10perf: Add PERF_EV_CAP_READ_SCOPEKan Liang
Usually, an event can be read from any CPU of the scope. It doesn't need to be read from the advertised CPU. Add a new event cap, PERF_EV_CAP_READ_SCOPE. An event of a PMU with scope can be read from any active CPU in the scope. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240802151643.1691631-3-kan.liang@linux.intel.com
2024-09-10perf: Generic hotplug support for a PMU with a scopeKan Liang
The perf subsystem assumes that the counters of a PMU are per-CPU. So the user space tool reads a counter from each CPU in the system wide mode. However, many PMUs don't have a per-CPU counter. The counter is effective for a scope, e.g., a die or a socket. To address this, a cpumask is exposed by the kernel driver to restrict to one CPU to stand for a specific scope. In case the given CPU is removed, the hotplug support has to be implemented for each such driver. The codes to support the cpumask and hotplug are very similar. - Expose a cpumask into sysfs - Pickup another CPU in the same scope if the given CPU is removed. - Invoke the perf_pmu_migrate_context() to migrate to a new CPU. - In event init, always set the CPU in the cpumask to event->cpu Similar duplicated codes are implemented for each such PMU driver. It would be good to introduce a generic infrastructure to avoid such duplication. 5 popular scopes are implemented here, core, die, cluster, pkg, and the system-wide. The scope can be set when a PMU is registered. If so, a "cpumask" is automatically exposed for the PMU. The "cpumask" is from the perf_online_<scope>_mask, which is to track the active CPU for each scope. They are set when the first CPU of the scope is online via the generic perf hotplug support. When a corresponding CPU is removed, the perf_online_<scope>_mask is updated accordingly and the PMU will be moved to a new CPU from the same scope if possible. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240802151643.1691631-2-kan.liang@linux.intel.com
2024-09-10sched/debug: Fix the runnable tasks outputHuang Shijie
The current runnable tasks output looks like: runnable tasks: S task PID tree-key switches prio wait-time sum-exec sum-sleep ------------------------------------------------------------------------------------------------------------- Ikworker/R-rcu_g 4 0.129049 E 0.620179 0.750000 0.002920 2 100 0.000000 0.002920 0.000000 0.000000 0 0 / Ikworker/R-sync_ 5 0.125328 E 0.624147 0.750000 0.001840 2 100 0.000000 0.001840 0.000000 0.000000 0 0 / Ikworker/R-slub_ 6 0.120835 E 0.628680 0.750000 0.001800 2 100 0.000000 0.001800 0.000000 0.000000 0 0 / Ikworker/R-netns 7 0.114294 E 0.634701 0.750000 0.002400 2 100 0.000000 0.002400 0.000000 0.000000 0 0 / I kworker/0:1 9 508.781746 E 511.754666 3.000000 151.575240 224 120 0.000000 151.575240 0.000000 0.000000 0 0 / Which is messy. Remove the duplicate printing of sum_exec_runtime and tidy up the layout to make it look like: runnable tasks: S task PID vruntime eligible deadline slice sum-exec switches prio wait-time sum-sleep sum-block node group-id group-path ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- I kworker/0:3 1698 295.001459 E 297.977619 3.000000 38.862920 9 120 0.000000 0.000000 0.000000 0 0 / I kworker/0:4 1702 278.026303 E 281.026303 3.000000 9.918760 3 120 0.000000 0.000000 0.000000 0 0 / S NetworkManager 2646 0.377936 E 2.598104 3.000000 98.535880 314 120 0.000000 0.000000 0.000000 0 0 /system.slice/NetworkManager.service S virtqemud 2689 0.541016 E 2.440104 3.000000 50.967960 80 120 0.000000 0.000000 0.000000 0 0 /system.slice/virtqemud.service S gsd-smartcard 3058 73.604144 E 76.475904 3.000000 74.033320 88 120 0.000000 0.000000 0.000000 0 0 /user.slice/user-42.slice/session-c1.scope Reviewed-by: Christoph Lameter (Ampere) <cl@linux.com> Signed-off-by: Huang Shijie <shijie@os.amperecomputing.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20240906053019.7874-1-shijie@os.amperecomputing.com
2024-09-10sched: Fix sched_delayed vs sched_corePeter Zijlstra
Completely analogous to commit dfa0a574cbc4 ("sched/uclamg: Handle delayed dequeue"), avoid double dequeue for the sched_core entries. Fixes: 152e11f6df29 ("sched/fair: Implement delayed dequeue") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2024-09-10kernel/sched: Fix util_est accounting for DELAY_DEQUEUEDietmar Eggemann
Remove delayed tasks from util_est even they are runnable. Exclude delayed task which are (a) migrating between rq's or (b) in a SAVE/RESTORE dequeue/enqueue. Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/c49ef5fe-a909-43f1-b02f-a765ab9cedbf@arm.com
2024-09-10kthread: Fix task state in kthread worker if being frozenChen Yu
When analyzing a kernel waring message, Peter pointed out that there is a race condition when the kworker is being frozen and falls into try_to_freeze() with TASK_INTERRUPTIBLE, which could trigger a might_sleep() warning in try_to_freeze(). Although the root cause is not related to freeze()[1], it is still worthy to fix this issue ahead. One possible race scenario: CPU 0 CPU 1 ----- ----- // kthread_worker_fn set_current_state(TASK_INTERRUPTIBLE); suspend_freeze_processes() freeze_processes static_branch_inc(&freezer_active); freeze_kernel_threads pm_nosig_freezing = true; if (work) { //false __set_current_state(TASK_RUNNING); } else if (!freezing(current)) //false, been frozen freezing(): if (static_branch_unlikely(&freezer_active)) if (pm_nosig_freezing) return true; schedule() } // state is still TASK_INTERRUPTIBLE try_to_freeze() might_sleep() <--- warning Fix this by explicitly set the TASK_RUNNING before entering try_to_freeze(). Fixes: b56c0d8937e6 ("kthread: implement kthread_worker") Suggested-by: Peter Zijlstra <peterz@infradead.org> Suggested-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/lkml/Zs2ZoAcUsZMX2B%2FI@chenyu5-mobl2/ [1]
2024-09-10sched/pelt: Use rq_clock_task() for hw_pressureChen Yu
commit 97450eb90965 ("sched/pelt: Remove shift of thermal clock") removed the decay_shift for hw_pressure. This commit uses the sched_clock_task() in sched_tick() while it replaces the sched_clock_task() with rq_clock_pelt() in __update_blocked_others(). This could bring inconsistence. One possible scenario I can think of is in ___update_load_sum(): u64 delta = now - sa->last_update_time 'now' could be calculated by rq_clock_pelt() from __update_blocked_others(), and last_update_time was calculated by rq_clock_task() previously from sched_tick(). Usually the former chases after the latter, it cause a very large 'delta' and brings unexpected behavior. Fixes: 97450eb90965 ("sched/pelt: Remove shift of thermal clock") Signed-off-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Hongyan Xia <hongyan.xia2@arm.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20240827112607.181206-1-yu.c.chen@intel.com
2024-09-10sched/fair: Move effective_cpu_util() and effective_cpu_util() in fair.cVincent Guittot
Move effective_cpu_util() and sched_cpu_util() functions in fair.c file with others utilization related functions. No functional change. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20240904092417.20660-1-vincent.guittot@linaro.org
2024-09-10sched/core: Introduce SM_IDLE and an idle re-entry fast-path in __schedule()Peter Zijlstra
Since commit b2a02fc43a1f ("smp: Optimize send_call_function_single_ipi()") an idle CPU in TIF_POLLING_NRFLAG mode can be pulled out of idle by setting TIF_NEED_RESCHED flag to service an IPI without actually sending an interrupt. Even in cases where the IPI handler does not queue a task on the idle CPU, do_idle() will call __schedule() since need_resched() returns true in these cases. Introduce and use SM_IDLE to identify call to __schedule() from schedule_idle() and shorten the idle re-entry time by skipping pick_next_task() when nr_running is 0 and the previous task is the idle task. With the SM_IDLE fast-path, the time taken to complete a fixed set of IPIs using ipistorm improves noticeably. Following are the numbers from a dual socket Intel Ice Lake Xeon server (2 x 32C/64T) and 3rd Generation AMD EPYC system (2 x 64C/128T) (boost on, C2 disabled) running ipistorm between CPU8 and CPU16: cmdline: insmod ipistorm.ko numipi=100000 single=1 offset=8 cpulist=8 wait=1 ================================================================== Test : ipistorm (modified) Units : Normalized runtime Interpretation: Lower is better Statistic : AMean ======================= Intel Ice Lake Xeon ====================== kernel: time [pct imp] tip:sched/core 1.00 [baseline] tip:sched/core + SM_IDLE 0.80 [20.51%] ==================== 3rd Generation AMD EPYC ===================== kernel: time [pct imp] tip:sched/core 1.00 [baseline] tip:sched/core + SM_IDLE 0.90 [10.17%] ================================================================== [ kprateek: Commit message, SM_RTLOCK_WAIT fix ] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Not-yet-signed-off-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20240809092240.6921-1-kprateek.nayak@amd.com