summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2025-05-23traceevent/block: Add REQ_ATOMIC flag to block trace eventsRitesh Harjani (IBM)
Filesystems like XFS can implement atomic write I/O using either REQ_ATOMIC flag set in the bio or via CoW operation. It will be useful if we have a flag in trace events to distinguish between the two. This patch adds char 'U' (Untorn writes) to rwbs field of the trace events if REQ_ATOMIC flag is set in the bio. <W/ REQ_ATOMIC> ================= xfs_io-4238 [009] ..... 4148.126843: block_rq_issue: 259,0 WFSU 16384 () 768 + 32 none,0,0 [xfs_io] <idle>-0 [009] d.h1. 4148.129864: block_rq_complete: 259,0 WFSU () 768 + 32 none,0,0 [0] <W/O REQ_ATOMIC> =============== xfs_io-4237 [010] ..... 4143.325616: block_rq_issue: 259,0 WS 16384 () 768 + 32 none,0,0 [xfs_io] <idle>-0 [010] d.H1. 4143.329138: block_rq_complete: 259,0 WS () 768 + 32 none,0,0 [0] Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://lore.kernel.org/r/44317cb2ec4588f6a2c1501a96684e6a1196e8ba.1747921498.git.ritesh.list@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-23xen: enable XEN_UNPOPULATED_ALLOC as part of xen.configRoger Pau Monne
PVH dom0 is useless without XEN_UNPOPULATED_ALLOC, as otherwise it will very likely balloon out all dom0 memory to map foreign and grant pages. Enable it by default as part of xen.config. This also requires enabling MEMORY_HOTREMOVE and ZONE_DEVICE. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Juergen Gross <jgross@suse.com> Message-ID: <20250514092037.28970-1-roger.pau@citrix.com> Signed-off-by: Juergen Gross <jgross@suse.com>
2025-05-22mm: rename try_alloc_pages() to alloc_pages_nolock()Alexei Starovoitov
The "try_" prefix is confusing, since it made people believe that try_alloc_pages() is analogous to spin_trylock() and NULL return means EAGAIN. This is not the case. If it returns NULL there is no reason to call it again. It will most likely return NULL again. Hence rename it to alloc_pages_nolock() to make it symmetrical to free_pages_nolock() and document that NULL means ENOMEM. Link: https://lkml.kernel.org/r/20250517003446.60260-1-alexei.starovoitov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Harry Yoo <harry.yoo@oracle.com> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-22sched_ext: Call ops.update_idle() after updating builtin idle bitsTejun Heo
BPF schedulers that use both builtin CPU idle mechanism and ops.update_idle() may want to use the latter to create interlocking between ops.enqueue() and CPU idle transitions so that either ops.enqueue() sees the idle bit or ops.update_idle() sees the task queued somewhere. This can prevent race conditions where CPUs go idle while tasks are waiting in DSQs. For such interlocking to work, ops.update_idle() must be called after builtin CPU masks are updated. Relocate the invocation. Currently, there are no ordering requirements on transitions from idle and this relocation isn't expected to make meaningful differences in that direction. This also makes the ops.update_idle() behavior semantically consistent: any action performed in this callback should be able to override the builtin idle state, not the other way around. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-and-tested-by: Andrea Righi <arighi@nvidia.com> Acked-by: Changwoo Min <changwoo@igalia.com>
2025-05-22sched_ext: Convert cgroup BPF support to use cgroup_lifetime_notifierTejun Heo
Replace explicit cgroup_bpf_inherit/offline() calls from cgroup creation/destruction paths with notification callback registered on cgroup_lifetime_notifier. Signed-off-by: Tejun Heo <tj@kernel.org>
2025-05-22sched_ext: Introduce cgroup_lifetime_notifierTejun Heo
Other subsystems may make use of the cgroup hierarchy with the cgroup_bpf support being one such example. For such a feature, it's useful to be able to hook into cgroup creation and destruction paths to perform feature-specific initializations and cleanups. Add cgroup_lifetime_notifier which generates CGROUP_LIFETIME_ONLINE and CGROUP_LIFETIME_OFFLINE events whenever cgroups are created and destroyed, respectively. The next patch will convert cgroup_bpf to use the new notifier and other uses are planned. Signed-off-by: Tejun Heo <tj@kernel.org>
2025-05-22cgroup: Minor reorganization of cgroup_create()Tejun Heo
cgroup_bpf init and exit handling will be moved to a notifier chain. In prepartion, reorganize cgroup_create() a bit so that the new cgroup is fully initialized before any outside changes are made. - cgrp->ancestors[] initialization and the hierarchical nr_descendants and nr_frozen_descendants updates were in the same loop. Separate them out and do the former earlier and do the latter later. - Relocate cgroup_bpf_inherit() call so that it's after all cgroup initializations are complete. No visible behavior changes expected. Signed-off-by: Tejun Heo <tj@kernel.org>
2025-05-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.15-rc8). Conflicts: 80f2ab46c2ee ("irdma: free iwdev->rf after removing MSI-X") 4bcc063939a5 ("ice, irdma: fix an off by one in error handling code") c24a65b6a27c ("iidc/ice/irdma: Update IDC to support multiple consumers") https://lore.kernel.org/20250513130630.280ee6c5@canb.auug.org.au No extra adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-22bpf: Revert "bpf: remove unnecessary rcu_read_{lock,unlock}() in ↵Di Shen
multi-uprobe attach logic" This reverts commit 4a8f635a6054. Althought get_pid_task() internally already calls rcu_read_lock() and rcu_read_unlock(), the find_vpid() was not. The documentation for find_vpid() clearly states: "Must be called with the tasklist_lock or rcu_read_lock() held." Add proper rcu_read_lock/unlock() to protect the find_vpid(). Fixes: 4a8f635a6054 ("bpf: remove unnecessary rcu_read_{lock,unlock}() in multi-uprobe attach logic") Reported-by: Xuewen Yan <xuewen.yan@unisoc.com> Signed-off-by: Di Shen <di.shen@unisoc.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20250520054943.5002-1-xuewen.yan@unisoc.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-05-21cgroup: avoid per-cpu allocation of size zero rstat cpu locksJP Kobryn
Subsystem rstat locks are dynamically allocated per-cpu. It was discovered that a panic can occur during this allocation when the lock size is zero. This is the case on non-smp systems, since arch_spinlock_t is defined as an empty struct. Prevent this allocation when !CONFIG_SMP by adding a pre-processor conditional around the affected block. Signed-off-by: JP Kobryn <inwardvessel@gmail.com> Reported-by: Klara Modin <klarasmodin@gmail.com> Fixes: 748922dcfabd ("cgroup: use subsystem-specific rstat locks to avoid contention") Signed-off-by: Tejun Heo <tj@kernel.org>
2025-05-21kernel/panic.c: format kernel-doc commentsSravan Kumar Gundu
kernel-doc function comment don't follows documentation commenting style misinterpreting arguments description with function description. Please see latest docs generated before applying this patch https://docs.kernel.org/driver-api/basics.html#c.panic Link: https://lkml.kernel.org/r/20250516174031.2937-1-sravankumarlpu@gmail.com Signed-off-by: Sravan Kumar Gundu <sravankumarlpu@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jani Nikula <jani.nikula@intel.com> Cc: John Ogness <john.ogness@linutronix.de> Cc: Petr Mladek <pmladek@suse.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-21fork: define a local GFP_VMAP_STACKLinus Walleij
The current allocation of VMAP stack memory is using (THREADINFO_GFP & ~__GFP_ACCOUNT) which is a complicated way of saying (GFP_KERNEL | __GFP_ZERO): <linux/thread_info.h>: define THREADINFO_GFP (GFP_KERNEL_ACCOUNT | __GFP_ZERO) <linux/gfp_types.h>: define GFP_KERNEL_ACCOUNT (GFP_KERNEL | __GFP_ACCOUNT) This is an unfortunate side-effect of independent changes blurring the picture: commit 19809c2da28aee5860ad9a2eff760730a0710df0 changed (THREADINFO_GFP | __GFP_HIGHMEM) to just THREADINFO_GFP since highmem became implicit. commit 9b6f7e163cd0f468d1b9696b785659d3c27c8667 then added stack caching and rewrote the allocation to (THREADINFO_GFP & ~__GFP_ACCOUNT) as cached stacks need to be accounted separately. However that code, when it eventually accounts the memory does this: ret = memcg_kmem_charge(vm->pages[i], GFP_KERNEL, 0) so the memory is charged as a GFP_KERNEL allocation. Define a unique GFP_VMAP_STACK to use GFP_KERNEL | __GFP_ZERO and move the comment there. Link: https://lkml.kernel.org/r/20250509-gfp-stack-v1-1-82f6f7efc210@linaro.org Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Reported-by: Mateusz Guzik <mjguzik@gmail.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-21fork: check charging success before zeroing stackPasha Tatashin
No need to do zero cached stack if memcg charge fails, so move the charging attempt before the memset operation. [linus.walleij@linaro.org: rebased] Link: https://lkml.kernel.org/r/20250509-fork-fixes-v3-3-e6c69dd356f2@linaro.org Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Link: https://lore.kernel.org/20240311164638.2015063-6-pasha.tatashin@soleen.com Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-21fork: clean-up naming of vm_stack/vm_struct variables in vmap stacks codePasha Tatashin
There are two data types: "struct vm_struct" and "struct vm_stack" that have the same local variable names: vm_stack, or vm, or s, which makes the code confusing to read. Change the code so the naming is consistent: struct vm_struct is always called vm_area struct vm_stack is always called vm_stack One change altering vfree(vm_stack) to vfree(vm_area->addr) may look like a semantic change but it is not: vm_area->addr points to the vm_stack. This was done to improve readability. [linus.walleij@linaro.org: rebased and added new users of the variable names, address review comments] Link: https://lore.kernel.org/20240311164638.2015063-4-pasha.tatashin@soleen.com Link: https://lkml.kernel.org/r/20250509-fork-fixes-v3-2-e6c69dd356f2@linaro.org Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-21fork: clean-up ifdef logic around stack allocationPasha Tatashin
Patch series "fork: Page operation cleanups in the fork code", v3. This patchset consists of outtakes from a 1 year+ old patchset from Pasha, which all stand on their own. See: https://lore.kernel.org/all/20240311164638.2015063-1-pasha.tatashin@soleen.com/ These are good cleanups for readability so I split these off, rebased on v6.15-rc1, addressed review comments and send them separately. All mentions of dynamic stack are removed from the patchset as we have no idea whether that will go anywhere. This patch (of 3): There is unneeded OR in the ifdef functions that are used to allocate and free kernel stacks based on direct map or vmap. Therefore, clean up by changing the order so OR is no longer needed. [linus.walleij@linaro.org: rebased] Link: https://lkml.kernel.org/r/20250509-fork-fixes-v3-1-e6c69dd356f2@linaro.org Link: https://lkml.kernel.org/r/20250509-fork-fixes-v3-0-e6c69dd356f2@linaro.org Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Link: https://lore.kernel.org/20240311164638.2015063-3-pasha.tatashin@soleen.com Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-21kernel/watchdog: add /sys/kernel/{hard,soft}lockup_countMax Kellermann
Patch series "sysfs: add counters for lockups and stalls", v2. Commits 9db89b411170 ("exit: Expose "oops_count" to sysfs") and 8b05aa263361 ("panic: Expose "warn_count" to sysfs") added counters for oopses and warnings to sysfs, and these two patches do the same for hard/soft lockups and RCU stalls. All of these counters are useful for monitoring tools to detect whether the machine is healthy. If the kernel has experienced a lockup or a stall, it's probably due to a kernel bug, and I'd like to detect that quickly and easily. There is currently no way to detect that, other than parsing dmesg. Or observing indirect effects: such as certain tasks not responding, but then I need to observe all tasks, and it may take a while until these effects become visible/measurable. I'd rather be able to detect the primary cause more quickly, possibly before everything falls apart. This patch (of 2): There is /proc/sys/kernel/hung_task_detect_count, /sys/kernel/warn_count and /sys/kernel/oops_count but there is no userspace-accessible counter for hard/soft lockups. Having this is useful for monitoring tools. Link: https://lkml.kernel.org/r/20250504180831.4190860-1-max.kellermann@ionos.com Link: https://lkml.kernel.org/r/20250504180831.4190860-2-max.kellermann@ionos.com Signed-off-by: Max Kellermann <max.kellermann@ionos.com> Cc: Cc: Core Minyard <cminyard@mvista.com> Cc: Doug Anderson <dianders@chromium.org> Cc: Joel Granados <joel.granados@kernel.org> Cc: Song Liu <song@kernel.org> Cc: Kees Cook <kees@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-21crash_dump: retrieve dm crypt keys in kdump kernelCoiby Xu
Crash kernel will retrieve the dm crypt keys based on the dmcryptkeys command line parameter. When user space writes the key description to /sys/kernel/config/crash_dm_crypt_key/restore, the crash kernel will save the encryption keys to the user keyring. Then user space e.g. cryptsetup's --volume-key-keyring API can use it to unlock the encrypted device. Link: https://lkml.kernel.org/r/20250502011246.99238-6-coxu@redhat.com Signed-off-by: Coiby Xu <coxu@redhat.com> Acked-by: Baoquan He <bhe@redhat.com> Cc: "Daniel P. Berrange" <berrange@redhat.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Dave Young <dyoung@redhat.com> Cc: Jan Pazdziora <jpazdziora@redhat.com> Cc: Liu Pingfan <kernelfans@gmail.com> Cc: Milan Broz <gmazyland@gmail.com> Cc: Ondrej Kozina <okozina@redhat.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-21crash_dump: reuse saved dm crypt keys for CPU/memory hot-pluggingCoiby Xu
When there are CPU and memory hot un/plugs, the dm crypt keys may need to be reloaded again depending on the solution for crash hotplug support. Currently, there are two solutions. One is to utilizes udev to instruct user space to reload the kdump kernel image and initrd, elfcorehdr and etc again. The other is to only update the elfcorehdr segment introduced in commit 247262756121 ("crash: add generic infrastructure for crash hotplug support"). For the 1st solution, the dm crypt keys need to be reloaded again. The user space can write true to /sys/kernel/config/crash_dm_crypt_key/reuse so the stored keys can be re-used. For the 2nd solution, the dm crypt keys don't need to be reloaded. Currently, only x86 supports the 2nd solution. If the 2nd solution gets extended to all arches, this patch can be dropped. Link: https://lkml.kernel.org/r/20250502011246.99238-5-coxu@redhat.com Signed-off-by: Coiby Xu <coxu@redhat.com> Acked-by: Baoquan He <bhe@redhat.com> Cc: "Daniel P. Berrange" <berrange@redhat.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Dave Young <dyoung@redhat.com> Cc: Jan Pazdziora <jpazdziora@redhat.com> Cc: Liu Pingfan <kernelfans@gmail.com> Cc: Milan Broz <gmazyland@gmail.com> Cc: Ondrej Kozina <okozina@redhat.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-21crash_dump: store dm crypt keys in kdump reserved memoryCoiby Xu
When the kdump kernel image and initrd are loaded, the dm crypts keys will be read from keyring and then stored in kdump reserved memory. Assume a key won't exceed 256 bytes thus MAX_KEY_SIZE=256 according to "cryptsetup benchmark". Link: https://lkml.kernel.org/r/20250502011246.99238-4-coxu@redhat.com Signed-off-by: Coiby Xu <coxu@redhat.com> Acked-by: Baoquan He <bhe@redhat.com> Cc: "Daniel P. Berrange" <berrange@redhat.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Dave Young <dyoung@redhat.com> Cc: Jan Pazdziora <jpazdziora@redhat.com> Cc: Liu Pingfan <kernelfans@gmail.com> Cc: Milan Broz <gmazyland@gmail.com> Cc: Ondrej Kozina <okozina@redhat.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-21crash_dump: make dm crypt keys persist for the kdump kernelCoiby Xu
A configfs /sys/kernel/config/crash_dm_crypt_keys is provided for user space to make the dm crypt keys persist for the kdump kernel. Take the case of dumping to a LUKS-encrypted target as an example, here is the life cycle of the kdump copies of LUKS volume keys, 1. After the 1st kernel loads the initramfs during boot, systemd uses an user-input passphrase to de-crypt the LUKS volume keys or simply TPM-sealed volume keys and then save the volume keys to specified keyring (using the --link-vk-to-keyring API) and the keys will expire within specified time. 2. A user space tool (kdump initramfs loader like kdump-utils) create key items inside /sys/kernel/config/crash_dm_crypt_keys to inform the 1st kernel which keys are needed. 3. When the kdump initramfs is loaded by the kexec_file_load syscall, the 1st kernel will iterate created key items, save the keys to kdump reserved memory. 4. When the 1st kernel crashes and the kdump initramfs is booted, the kdump initramfs asks the kdump kernel to create a user key using the key stored in kdump reserved memory by writing yes to /sys/kernel/crash_dm_crypt_keys/restore. Then the LUKS encrypted device is unlocked with libcryptsetup's --volume-key-keyring API. 5. The system gets rebooted to the 1st kernel after dumping vmcore to the LUKS encrypted device is finished Eventually the keys have to stay in the kdump reserved memory for the kdump kernel to unlock encrypted volumes. During this process, some measures like letting the keys expire within specified time are desirable to reduce security risk. This patch assumes, 1) there are 128 LUKS devices at maximum to be unlocked thus MAX_KEY_NUM=128. 2) a key description won't exceed 128 bytes thus KEY_DESC_MAX_LEN=128. And here is a demo on how to interact with /sys/kernel/config/crash_dm_crypt_keys, # Add key #1 mkdir /sys/kernel/config/crash_dm_crypt_keys/7d26b7b4-e342-4d2d-b660-7426b0996720 # Add key #1's description echo cryptsetup:7d26b7b4-e342-4d2d-b660-7426b0996720 > /sys/kernel/config/crash_dm_crypt_keys/description # how many keys do we have now? cat /sys/kernel/config/crash_dm_crypt_keys/count 1 # Add key# 2 in the same way # how many keys do we have now? cat /sys/kernel/config/crash_dm_crypt_keys/count 2 # the tree structure of /crash_dm_crypt_keys configfs tree /sys/kernel/config/crash_dm_crypt_keys/ /sys/kernel/config/crash_dm_crypt_keys/ ├── 7d26b7b4-e342-4d2d-b660-7426b0996720 │   └── description ├── count ├── fce2cd38-4d59-4317-8ce2-1fd24d52c46a │   └── description Link: https://lkml.kernel.org/r/20250502011246.99238-3-coxu@redhat.com Signed-off-by: Coiby Xu <coxu@redhat.com> Acked-by: Baoquan He <bhe@redhat.com> Cc: "Daniel P. Berrange" <berrange@redhat.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Dave Young <dyoung@redhat.com> Cc: Jan Pazdziora <jpazdziora@redhat.com> Cc: Liu Pingfan <kernelfans@gmail.com> Cc: Milan Broz <gmazyland@gmail.com> Cc: Ondrej Kozina <okozina@redhat.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-21kexec_file: allow to place kexec_buf randomlyCoiby Xu
Patch series "Support kdump with LUKS encryption by reusing LUKS volume keys", v9. LUKS is the standard for Linux disk encryption, widely adopted by users, and in some cases, such as Confidential VMs, it is a requirement. With kdump enabled, when the first kernel crashes, the system can boot into the kdump/crash kernel to dump the memory image (i.e., /proc/vmcore) to a specified target. However, there are two challenges when dumping vmcore to a LUKS-encrypted device: - Kdump kernel may not be able to decrypt the LUKS partition. For some machines, a system administrator may not have a chance to enter the password to decrypt the device in kdump initramfs after the 1st kernel crashes; For cloud confidential VMs, depending on the policy the kdump kernel may not be able to unseal the keys with TPM and the console virtual keyboard is untrusted. - LUKS2 by default use the memory-hard Argon2 key derivation function which is quite memory-consuming compared to the limited memory reserved for kdump. Take Fedora example, by default, only 256M is reserved for systems having memory between 4G-64G. With LUKS enabled, ~1300M needs to be reserved for kdump. Note if the memory reserved for kdump can't be used by 1st kernel i.e. an user sees ~1300M memory missing in the 1st kernel. Besides users (at least for Fedora) usually expect kdump to work out of the box i.e. no manual password input or custom crashkernel value is needed. And it doesn't make sense to derivate the keys again in kdump kernel which seems to be redundant work. This patchset addresses the above issues by making the LUKS volume keys persistent for kdump kernel with the help of cryptsetup's new APIs (--link-vk-to-keyring/--volume-key-keyring). Here is the life cycle of the kdump copies of LUKS volume keys, 1. After the 1st kernel loads the initramfs during boot, systemd use an user-input passphrase to de-crypt the LUKS volume keys or TPM-sealed key and then save the volume keys to specified keyring (using the --link-vk-to-keyring API) and the key will expire within specified time. 2. A user space tool (kdump initramfs loader like kdump-utils) create key items inside /sys/kernel/config/crash_dm_crypt_keys to inform the 1st kernel which keys are needed. 3. When the kdump initramfs is loaded by the kexec_file_load syscall, the 1st kernel will iterate created key items, save the keys to kdump reserved memory. 4. When the 1st kernel crashes and the kdump initramfs is booted, the kdump initramfs asks the kdump kernel to create a user key using the key stored in kdump reserved memory by writing yes to /sys/kernel/crash_dm_crypt_keys/restore. Then the LUKS encrypted device is unlocked with libcryptsetup's --volume-key-keyring API. 5. The system gets rebooted to the 1st kernel after dumping vmcore to the LUKS encrypted device is finished After libcryptsetup saving the LUKS volume keys to specified keyring, whoever takes this should be responsible for the safety of these copies of keys. The keys will be saved in the memory area exclusively reserved for kdump where even the 1st kernel has no direct access. And further more, two additional protections are added, - save the copy randomly in kdump reserved memory as suggested by Jan - clear the _PAGE_PRESENT flag of the page that stores the copy as suggested by Pingfan This patchset only supports x86. There will be patches to support other architectures once this patch set gets merged. This patch (of 9): Currently, kexec_buf is placed in order which means for the same machine, the info in the kexec_buf is always located at the same position each time the machine is booted. This may cause a risk for sensitive information like LUKS volume key. Now struct kexec_buf has a new field random which indicates it's supposed to be placed in a random position. Note this feature is enabled only when CONFIG_CRASH_DUMP is enabled. So it only takes effect for kdump and won't impact kexec reboot. Link: https://lkml.kernel.org/r/20250502011246.99238-1-coxu@redhat.com Link: https://lkml.kernel.org/r/20250502011246.99238-2-coxu@redhat.com Signed-off-by: Coiby Xu <coxu@redhat.com> Suggested-by: Jan Pazdziora <jpazdziora@redhat.com> Acked-by: Baoquan He <bhe@redhat.com> Cc: "Daniel P. Berrange" <berrange@redhat.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Dave Young <dyoung@redhat.com> Cc: Liu Pingfan <kernelfans@gmail.com> Cc: Milan Broz <gmazyland@gmail.com> Cc: Ondrej Kozina <okozina@redhat.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-21sched_ext: idle: Consolidate default idle CPU selection kfuncsAndrea Righi
There is no reason to restrict scx_bpf_select_cpu_dfl() invocations to ops.select_cpu() while allowing scx_bpf_select_cpu_and() to be used from multiple contexts, as both provide equivalent functionality, with the latter simply accepting an additional "allowed" cpumask. Therefore, unify the two APIs, enabling both kfuncs to be used from ops.select_cpu(), ops.enqueue(), and unlocked contexts (e.g., via BPF test_run). This allows schedulers to implement a consistent idle CPU selection policy and helps reduce code duplication. Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-05-21genirq/irqdesc: Remove double locking in hwirq_show()Claudiu Beznea
&desc->lock is acquired on 2 consecutive lines in hwirq_show(). This leads obviously to a deadlock. Drop the raw_spin_lock_irq() and keep guard(). Fixes: 5d964a9f7cd8 ("genirq/irqdesc: Switch to lock guards") Signed-off-by: Claudiu Beznea <claudiu.beznea.uj@bp.renesas.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250521142541.3832130-1-claudiu.beznea.uj@bp.renesas.com
2025-05-21perf: Only dump the throttle log for the leaderKan Liang
The PERF_RECORD_THROTTLE records are dumped for all throttled events. It's not necessary for group events, which are throttled altogether. Optimize it by only dump the throttle log for the leader. The sample right after the THROTTLE record must be generated by the actual target event. It is good enough for the perf tool to locate the actual target event. Suggested-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20250520181644.2673067-3-kan.liang@linux.intel.com
2025-05-21perf: Fix the throttle logic for a groupKan Liang
The current throttle logic doesn't work well with a group, e.g., the following sampling-read case. $ perf record -e "{cycles,cycles}:S" ... $ perf report -D | grep THROTTLE | tail -2 THROTTLE events: 426 ( 9.0%) UNTHROTTLE events: 425 ( 9.0%) $ perf report -D | grep PERF_RECORD_SAMPLE -a4 | tail -n 5 0 1020120874009167 0x74970 [0x68]: PERF_RECORD_SAMPLE(IP, 0x1): ... sample_read: .... group nr 2 ..... id 0000000000000327, value 000000000cbb993a, lost 0 ..... id 0000000000000328, value 00000002211c26df, lost 0 The second cycles event has a much larger value than the first cycles event in the same group. The current throttle logic in the generic code only logs the THROTTLE event. It relies on the specific driver implementation to disable events. For all ARCHs, the implementation is similar. Only the event is disabled, rather than the group. The logic to disable the group should be generic for all ARCHs. Add the logic in the generic code. The following patch will remove the buggy driver-specific implementation. The throttle only happens when an event is overflowed. Stop the entire group when any event in the group triggers the throttle. The MAX_INTERRUPTS is set to all throttle events. The unthrottled could happen in 3 places. - event/group sched. All events in the group are scheduled one by one. All of them will be unthrottled eventually. Nothing needs to be changed. - The perf_adjust_freq_unthr_events for each tick. Needs to restart the group altogether. - The __perf_event_period(). The whole group needs to be restarted altogether as well. With the fix, $ sudo perf report -D | grep PERF_RECORD_SAMPLE -a4 | tail -n 5 0 3573470770332 0x12f5f8 [0x70]: PERF_RECORD_SAMPLE(IP, 0x2): ... sample_read: .... group nr 2 ..... id 0000000000000a28, value 00000004fd3dfd8f, lost 0 ..... id 0000000000000a29, value 00000004fd3dfd8f, lost 0 Suggested-by: "Peter Zijlstra (Intel)" <peterz@infradead.org> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20250520181644.2673067-2-kan.liang@linux.intel.com
2025-05-21futex: Correct the kernedoc return value for futex_wait_setup().Sebastian Andrzej Siewior
The kerneldoc for futex_wait_setup() states it can return "0" or "<1". This isn't true because the error case is "<0" not less than 1. Document that <0 is returned on error. Drop the possible return values and state possible reasons. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: André Almeida <andrealmeid@igalia.com> Link: https://lore.kernel.org/r/20250517151455.1065363-6-bigeasy@linutronix.de
2025-05-21sched/uclamp: Align uclamp and util_est and call before freq updateXuewen Yan
The commit dfa0a574cbc47 ("sched/uclamg: Handle delayed dequeue") has add the sched_delayed check to prevent double uclamp_dec/inc. However, it put the uclamp_rq_inc() after enqueue_task(). This may lead to the following issues: When a task with uclamp goes through enqueue_task() and could trigger cpufreq update, its uclamp won't even be considered in the cpufreq update. It is only after enqueue will the uclamp be added to rq buckets, and cpufreq will only pick it up at the next update. This could cause a delay in frequency updating. It may affect the performance(uclamp_min > 0) or power(uclamp_max < 1024). So, just like util_est, put the uclamp_rq_inc() before enqueue_task(). And as for the sched_delayed_task, same as util_est, using the sched_delayed flag to prevent inc the sched_delayed_task's uclamp, using the ENQUEUE_DELAYED flag to allow inc the sched_delayed_task's uclamp which is being woken up. Signed-off-by: Xuewen Yan <xuewen.yan@unisoc.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/20250417043457.10632-3-xuewen.yan@unisoc.com
2025-05-21sched/util_est: Simplify condition for util_est_{en,de}queue()Xuewen Yan
To prevent double enqueue/dequeue of the util-est for sched_delayed tasks, commit 729288bc6856 ("kernel/sched: Fix util_est accounting for DELAY_DEQUEUE") added the corresponding check. This check excludes double en/dequeue during task migration and priority changes. In fact, these conditions can be simplified. For util_est_dequeue, we know that sched_delayed flag is set in dequeue_entity. When the task is sleeping, we need to call util_est_dequeue to subtract util-est from the cfs_rq. At this point, sched_delayed has not yet been set. If we find that sched_delayed is already set, it indicates that this task has already called dequeue_task_fair once. In this case, there is no need to call util_est_dequeue again. Therefore, simply checking the sched_delayed flag should be sufficient to prevent unnecessary util_est updates during the dequeue. For util_est_enqueue, our goal is to add the util_est to the cfs_rq when task enqueue. However, we don't want to add the util_est of a sched_delayed task to the cfs_rq because the task is sleeping. Therefore, we can exclude the util_est_enqueue for sched_delayed tasks by checking the sched_delayed flag. However, when waking up a delayed task, the sched_delayed flag is cleared after util_est_enqueue. As a result, if we only check the sched_delayed flag, we would miss the util_est_enqueue. Since waking up a sched_delayed task calls enqueue_task with the ENQUEUE_DELAYED flag, we can determine whether to call util_est_enqueue by checking if the enqueue_flag contains ENQUEUE_DELAYED. Signed-off-by: Xuewen Yan <xuewen.yan@unisoc.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/20250417043457.10632-2-xuewen.yan@unisoc.com
2025-05-21sched/fair: Fixup wake_up_sync() vs DELAYED_DEQUEUEXuewen Yan
Delayed dequeued feature keeps a sleeping task enqueued until its lag has elapsed. As a result, it stays also visible in rq->nr_running. So when in wake_affine_idle(), we should use the real running-tasks in rq to check whether we should place the wake-up task to current cpu. On the other hand, add a helper function to return the nr-delayed. Fixes: 152e11f6df29 ("sched/fair: Implement delayed dequeue") Signed-off-by: Xuewen Yan <xuewen.yan@unisoc.com> Reviewed-and-tested-by: Tianchen Ding <dtcccc@linux.alibaba.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20250303105241.17251-2-xuewen.yan@unisoc.com
2025-05-20Merge tag 'v6.15-p7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 Pull crypto fixes from Herbert Xu: "This fixes a regression in padata as well as an ancient double-free bug in af_alg" * tag 'v6.15-p7' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: crypto: algif_hash - fix double free in hash_accept padata: do not leak refcount in reorder_work
2025-05-20sched_ext: idle: Allow scx_bpf_select_cpu_and() from unlocked contextAndrea Righi
Allow scx_bpf_select_cpu_and() to be used from an unlocked context, in addition to ops.enqueue() or ops.select_cpu(). This enables schedulers, including user-space ones, to implement a consistent idle CPU selection policy and helps reduce code duplication. Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-05-20sched_ext: idle: Validate locking correctness in scx_bpf_select_cpu_and()Andrea Righi
Validate locking correctness when accessing p->nr_cpus_allowed and p->cpus_ptr inside scx_bpf_select_cpu_and(): if the rq lock is held, access is safe; otherwise, require that p->pi_lock is held. This allows to catch potential unsafe calls to scx_bpf_select_cpu_and(). Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-05-20sched_ext: Make scx_kf_allowed_if_unlocked() available outside ext.cAndrea Righi
Relocate the scx_kf_allowed_if_unlocked(), so it can be used from other source files (e.g., ext_idle.c). No functional change. Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-05-19cgroup: helper for checking rstat participation of cssJP Kobryn
There are a few places where a conditional check is performed to validate a given css on its rstat participation. This new helper tries to make the code more readable where this check is performed. Signed-off-by: JP Kobryn <inwardvessel@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-05-19cgroup: use subsystem-specific rstat locks to avoid contentionJP Kobryn
It is possible to eliminate contention between subsystems when updating/flushing stats by using subsystem-specific locks. Let the existing rstat locks be dedicated to the cgroup base stats and rename them to reflect that. Add similar locks to the cgroup_subsys struct for use with individual subsystems. Lock initialization is done in the new function ss_rstat_init(ss) which replaces cgroup_rstat_boot(void). If NULL is passed to this function, the global base stat locks will be initialized. Otherwise, the subsystem locks will be initialized. Change the existing lock helper functions to accept a reference to a css. Then within these functions, conditionally select the appropriate locks based on the subsystem affiliation of the given css. Add helper functions for this selection routine to avoid repeated code. Signed-off-by: JP Kobryn <inwardvessel@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-05-19cgroup: use separate rstat trees for each subsystemJP Kobryn
Different subsystems may call cgroup_rstat_updated() within the same cgroup, resulting in a tree of pending updates from multiple subsystems. When one of these subsystems is flushed via cgroup_rstat_flushed(), all other subsystems with pending updates on the tree will also be flushed. Change the paradigm of having a single rstat tree for all subsystems to having separate trees for each subsystem. This separation allows for subsystems to perform flushes without the side effects of other subsystems. As an example, flushing the cpu stats will no longer cause the memory stats to be flushed and vice versa. In order to achieve subsystem-specific trees, change the tree node type from cgroup to cgroup_subsys_state pointer. Then remove those pointers from the cgroup and instead place them on the css. Finally, change update/flush functions to make use of the different node type (css). These changes allow a specific subsystem to be associated with an update or flush. Separate rstat trees will now exist for each unique subsystem. Since updating/flushing will now be done at the subsystem level, there is no longer a need to keep track of updated css nodes at the cgroup level. The list management of these nodes done within the cgroup (rstat_css_list and related) has been removed accordingly. Conditional guards for checking validity of a given css were placed within css_rstat_updated/flush() to prevent undefined behavior occuring from kfunc usage in bpf programs. Guards were also placed within css_rstat_init/exit() in order to help consolidate calls to them. At call sites for all four functions, the existing guards were removed. Signed-off-by: JP Kobryn <inwardvessel@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-05-19cgroup: compare css to cgroup::self in helper for distingushing cssJP Kobryn
Adjust the implementation of css_is_cgroup() so that it compares the given css to cgroup::self. Rename the function to css_is_self() in order to reflect that. Change the existing css->ss NULL check to a warning in the true branch. Finally, adjust call sites to use the new function name. Signed-off-by: JP Kobryn <inwardvessel@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-05-19cgroup: warn on rstat usage by early init subsystemsJP Kobryn
An early init subsystem that attempts to make use of rstat can lead to failures during early boot. The reason for this is the timing in which the css's of the root cgroup have css_online() invoked on them. At the point of this call, there is a stated assumption that a cgroup has "successfully completed all allocations" [0]. An example of a subsystem that relies on the previously mentioned assumption [0] is the memory subsystem. Within its implementation of css_online(), work is queued to asynchronously begin flushing via rstat. In the early init path for a given subsystem, having rstat enabled leads to this sequence: cgroup_init_early() for_each_subsys(ss, ssid) if (ss->early_init) cgroup_init_subsys(ss, true) cgroup_init_subsys(ss, early_init) css = ss->css_alloc(...) init_and_link_css(css, ss, ...) ... online_css(css) online_css(css) ss = css->ss ss->css_online(css) Continuing to use the memory subsystem as an example, the issue with this sequence is that css_rstat_init() has not been called yet. This means there is now a race between the pending async work to flush rstat and the call to css_rstat_init(). So a flush can occur within the given cgroup while the rstat fields are not initialized. Since we are in the early init phase, the rstat fields cannot be initialized because they require per-cpu allocations. So it's not possible to have css_rstat_init() called early enough (before online_css()). This patch treats the combination of early init and rstat the same as as other invalid conditions. [0] Documentation/admin-guide/cgroup-v1/cgroups.rst (section: css_online) Signed-off-by: JP Kobryn <inwardvessel@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-05-19bpf: WARN_ONCE on verifier bugsPaul Chaignon
Throughout the verifier's logic, there are multiple checks for inconsistent states that should never happen and would indicate a verifier bug. These bugs are typically logged in the verifier logs and sometimes preceded by a WARN_ONCE. This patch reworks these checks to consistently emit a verifier log AND a warning when CONFIG_DEBUG_KERNEL is enabled. The consistent use of WARN_ONCE should help fuzzers (ex. syzkaller) expose any situation where they are actually able to reach one of those buggy verifier states. Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Link: https://lore.kernel.org/r/aCs1nYvNNMq8dAWP@mail.gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-05-19padata: do not leak refcount in reorder_workDominik Grzegorzek
A recent patch that addressed a UAF introduced a reference count leak: the parallel_data refcount is incremented unconditionally, regardless of the return value of queue_work(). If the work item is already queued, the incremented refcount is never decremented. Fix this by checking the return value of queue_work() and decrementing the refcount when necessary. Resolves: Unreferenced object 0xffff9d9f421e3d80 (size 192): comm "cryptomgr_probe", pid 157, jiffies 4294694003 hex dump (first 32 bytes): 80 8b cf 41 9f 9d ff ff b8 97 e0 89 ff ff ff ff ...A............ d0 97 e0 89 ff ff ff ff 19 00 00 00 1f 88 23 00 ..............#. backtrace (crc 838fb36): __kmalloc_cache_noprof+0x284/0x320 padata_alloc_pd+0x20/0x1e0 padata_alloc_shell+0x3b/0xa0 0xffffffffc040a54d cryptomgr_probe+0x43/0xc0 kthread+0xf6/0x1f0 ret_from_fork+0x2f/0x50 ret_from_fork_asm+0x1a/0x30 Fixes: dd7d37ccf6b1 ("padata: avoid UAF for reorder_work") Cc: <stable@vger.kernel.org> Signed-off-by: Dominik Grzegorzek <dominik.grzegorzek@oracle.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2025-05-18module: Remove outdated comment about text_sizeValentin Schneider
The text_size bit referred to by the comment has been removed as of commit ac3b43283923 ("module: replace module_layout with module_memory") and is thus no longer relevant. Remove it and comment about the contents of the masks array instead. Signed-off-by: Valentin Schneider <vschneid@redhat.com> Link: https://lore.kernel.org/r/20250429113242.998312-23-vschneid@redhat.com Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
2025-05-18module: Make .static_call_sites read-only after initPetr Pavlu
Section .static_call_sites holds data structures that need to be sorted and processed only at module load time. This initial processing happens in static_call_add_module(), which is invoked as a callback to the MODULE_STATE_COMING notification from prepare_coming_module(). The section is never modified afterwards. Make it therefore read-only after module initialization to avoid any (non-)accidental modifications. Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Link: https://lore.kernel.org/r/20250306131430.7016-4-petr.pavlu@suse.com Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
2025-05-18module: Add a separate function to mark sections as read-only after initPetr Pavlu
Move the logic to mark special sections as read-only after module initialization into a separate function, along other related code in strict_rwx.c. Use a table with names of such sections to make it easier to add more. Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Link: https://lore.kernel.org/r/20250306131430.7016-3-petr.pavlu@suse.com Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
2025-05-18module: Constify parameters of module_enforce_rwx_sections()Petr Pavlu
Minor cleanup, this is a non-functional change. Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Link: https://lore.kernel.org/r/20250306131430.7016-2-petr.pavlu@suse.com Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
2025-05-17Merge tag 'mm-hotfixes-stable-2025-05-17-09-41' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull hotfixes from Andrew Morton: "Nine singleton hotfixes, all MM. Four are cc:stable" * tag 'mm-hotfixes-stable-2025-05-17-09-41' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mm: userfaultfd: correct dirty flags set for both present and swap pte zsmalloc: don't underflow size calculation in zs_obj_write() mm/page_alloc: fix race condition in unaccepted memory handling mm/page_alloc: ensure try_alloc_pages() plays well with unaccepted memory MAINTAINERS: add mm GUP section mm/codetag: move tag retrieval back upfront in __free_pages() mm/memory: fix mapcount / refcount sanity check for mTHP reuse kernel/fork: only call untrack_pfn_clear() on VMAs duplicated for fork() mm: hugetlb: fix incorrect fallback for subpool
2025-05-17perf/core: Add the is_event_in_freq_mode() helper to simplify the codeKan Liang
Add a helper to check if an event is in freq mode to improve readability. No functional changes. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250516182853.2610284-2-kan.liang@linux.intel.com
2025-05-16PM: freezer: Rewrite restarting tasks log to remove stray *done.*Paul Menzel
`pr_cont()` unfortunately does not work here, as other parts of the Linux kernel log between the two log lines: [18445.295056] r8152-cfgselector 4-1.1.3: USB disconnect, device number 5 [18445.295112] OOM killer enabled. [18445.295115] Restarting tasks ... [18445.295185] usb 3-1: USB disconnect, device number 2 [18445.295193] usb 3-1.1: USB disconnect, device number 3 [18445.296262] usb 3-1.5: USB disconnect, device number 4 [18445.297017] done. [18445.297029] random: crng reseeded on system resumption `pr_cont()` also uses the default log level, normally warning, if the corresponding log line is interrupted. Therefore, replace the `pr_cont()`, and explicitly log it as a separate line with log level info: Restarting tasks: Starting […] Restarting tasks: Done Signed-off-by: Paul Menzel <pmenzel@molgen.mpg.de> Link: https://patch.msgid.link/20250511174648.950430-1-pmenzel@molgen.mpg.de [ rjw: Rebase on top of an earlier analogous change ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-05-16genirq/msi: Add helper for creating MSI-parent irq domainsMarc Zyngier
Creating an irq domain that serves as an MSI parent requires a substantial amount of esoteric boiler-plate code, some of which is often provided twice (such as the bus token). To make things a bit simpler for the unsuspecting MSI tinkerer, provide a helper that does it for them, and serves as documentation of what needs to be provided. Signed-off-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250513172819.2216709-3-maz@kernel.org
2025-05-16irqdomain: Fix kernel-doc and add it to DocumentationJiri Slaby (SUSE)
irqdomain.c's kernel-doc exists, but is not plugged into Documentation/ yet. Before plugging it in, fix it first: irq_domain_get_irq_data() and irq_domain_set_info() were documented twice. Identically, by both definitions for CONFIG_IRQ_DOMAIN_HIERARCHY and !CONFIG_IRQ_DOMAIN_HIERARCHY. Therefore, switch the second kernel-doc into an ordinary comment -- change "/**" to simple "/*". This avoids sphinx's: WARNING: Duplicate C declaration Next, in commit b7b377332b96 ("irqdomain: Fix the kernel-doc and plug it into Documentation"), irqdomain.h's (header) kernel-doc was added into core-api/genericirq.rst. But given the amount of irqdomain functions and structures, move all these to core-api/irq/irq-domain.rst now. Finally, add these newly fixed irqdomain.c's (source) docs there as well. Signed-off-by: Jiri Slaby (SUSE) <jirislaby@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250319092951.37667-58-jirislaby@kernel.org
2025-05-16irqdomain: Drop irq_domain_add_*() functionsJiri Slaby (SUSE)
Most irq_domain_add_*() functions are unused now, so drop them. The remaining ones are moved to the deprecated section and will be removed during the merge window after the patches in various trees have been merged. Note: The Chinese docs are touched but unfinished. I cannot parse those. [ tglx: Remove the leftover in irq-domain.rst and handle merge logistics ] Signed-off-by: Jiri Slaby (SUSE) <jirislaby@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250319092951.37667-41-jirislaby@kernel.org