summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2025-05-27selftests/bpf: Add unit tests with __bpf_trap() kfuncYonghong Song
Add some inline-asm tests and C tests where __bpf_trap() or __builtin_trap() is used in the code. The __builtin_trap() test is guarded with llvm21 ([1]) since otherwise the compilation failure will happen. [1] https://github.com/llvm/llvm-project/pull/131731 Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20250523205331.1291734-1-yonghong.song@linux.dev Tested-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-05-27Merge tag 'x86_sev_for_v6.16_rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull AMD SEV update from Borislav Petkov: "Add a virtual TPM driver glue which allows a guest kernel to talk to a TPM device emulated by a Secure VM Service Module (SVSM) - a helper module of sorts which runs at a different privilege level in the SEV-SNP VM stack. The intent being that a TPM device is emulated by a trusted entity and not by the untrusted host which is the default assumption in the confidential computing scenarios" * tag 'x86_sev_for_v6.16_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/sev: Register tpm-svsm platform device tpm: Add SNP SVSM vTPM driver svsm: Add header with SVSM_VTPM_CMD helpers x86/sev: Add SVSM vTPM probe/send_command functions
2025-05-27Merge tag 'x86_mtrr_for_v6.16_rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull mtrr update from Borislav Petkov: "A single change to verify the presence of fixed MTRR ranges before accessing the respective MSRs" * tag 'x86_mtrr_for_v6.16_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mtrr: Check if fixed-range MTRRs exist in mtrr_save_fixed_ranges()
2025-05-27Merge tag 'edac_updates_for_v6.16' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras Pull EDAC updates from Borislav Petkov: - ie31200: Add support for Raptor Lake-S and Alder Lake-S compute dies - Rework how RRL registers per channel tracking is done in order to support newer hardware with different RRL configurations and refactor that code. Add support for Granite Rapids server - i10nm: explicitly set RRL modes to fix any wrong BIOS programming - Properly save and restore Retry Read error Log channel configuration info on Intel drivers - igen6: Handle correctly the case of fused off memory controllers on Arizona Beach and Amston Lake SoCs before adding support for them - the usual set of fixes and cleanups * tag 'edac_updates_for_v6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras: EDAC/bluefield: Don't use bluefield_edac_readl() result on error EDAC/i10nm: Fix the bitwise operation between variables of different sizes EDAC/ie31200: Add two Intel SoCs for EDAC support EDAC/{skx_common,i10nm}: Add RRL support for Intel Granite Rapids server EDAC/{skx_common,i10nm}: Refactor show_retry_rd_err_log() EDAC/{skx_common,i10nm}: Refactor enable_retry_rd_err_log() EDAC/{skx_common,i10nm}: Structure the per-channel RRL registers EDAC/i10nm: Explicitly set the modes of the RRL register sets EDAC/{skx_common,i10nm}: Fix the loss of saved RRL for HBM pseudo channel 0 EDAC/skx_common: Fix general protection fault EDAC/igen6: Add Intel Amston Lake SoCs support EDAC/igen6: Add Intel Arizona Beach SoCs support EDAC/igen6: Skip absent memory controllers
2025-05-27bpf: Warn with __bpf_trap() kfunc maybe due to uninitialized variableYonghong Song
Marc Suñé (Isovalent, part of Cisco) reported an issue where an uninitialized variable caused generating bpf prog binary code not working as expected. The reproducer is in [1] where the flags “-Wall -Werror” are enabled, but there is no warning as the compiler takes advantage of uninitialized variable to do aggressive optimization. The optimized code looks like below: ; { 0: bf 16 00 00 00 00 00 00 r6 = r1 ; bpf_printk("Start"); 1: 18 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r1 = 0x0 ll 0000000000000008: R_BPF_64_64 .rodata 3: b4 02 00 00 06 00 00 00 w2 = 0x6 4: 85 00 00 00 06 00 00 00 call 0x6 ; DEFINE_FUNC_CTX_POINTER(data) 5: 61 61 4c 00 00 00 00 00 w1 = *(u32 *)(r6 + 0x4c) ; bpf_printk("pre ipv6_hdrlen_offset"); 6: 18 01 00 00 06 00 00 00 00 00 00 00 00 00 00 00 r1 = 0x6 ll 0000000000000030: R_BPF_64_64 .rodata 8: b4 02 00 00 17 00 00 00 w2 = 0x17 9: 85 00 00 00 06 00 00 00 call 0x6 <END> The verifier will report the following failure: 9: (85) call bpf_trace_printk#6 last insn is not an exit or jmp The above verifier log does not give a clear hint about how to fix the problem and user may take quite some time to figure out that the issue is due to compiler taking advantage of uninitialized variable. In llvm internals, uninitialized variable usage may generate 'unreachable' IR insn and these 'unreachable' IR insns may indicate uninitialized variable impact on code optimization. So far, llvm BPF backend ignores 'unreachable' IR hence the above code is generated. With clang21 patch [2], those 'unreachable' IR insn are converted to func __bpf_trap(). In order to maintain proper control flow graph for bpf progs, [2] also adds an 'exit' insn after bpf_trap() if __bpf_trap() is the last insn in the function. The new code looks like: ; { 0: bf 16 00 00 00 00 00 00 r6 = r1 ; bpf_printk("Start"); 1: 18 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r1 = 0x0 ll 0000000000000008: R_BPF_64_64 .rodata 3: b4 02 00 00 06 00 00 00 w2 = 0x6 4: 85 00 00 00 06 00 00 00 call 0x6 ; DEFINE_FUNC_CTX_POINTER(data) 5: 61 61 4c 00 00 00 00 00 w1 = *(u32 *)(r6 + 0x4c) ; bpf_printk("pre ipv6_hdrlen_offset"); 6: 18 01 00 00 06 00 00 00 00 00 00 00 00 00 00 00 r1 = 0x6 ll 0000000000000030: R_BPF_64_64 .rodata 8: b4 02 00 00 17 00 00 00 w2 = 0x17 9: 85 00 00 00 06 00 00 00 call 0x6 10: 85 10 00 00 ff ff ff ff call -0x1 0000000000000050: R_BPF_64_32 __bpf_trap 11: 95 00 00 00 00 00 00 00 exit <END> In kernel, a new kfunc __bpf_trap() is added. During insn verification, any hit with __bpf_trap() will result in verification failure. The kernel is able to provide better log message for debugging. With llvm patch [2] and without this patch (no __bpf_trap() kfunc for existing kernel), e.g., for old kernels, the verifier outputs 10: <invalid kfunc call> kfunc '__bpf_trap' is referenced but wasn't resolved Basically, kernel does not support __bpf_trap() kfunc. This still didn't give clear signals about possible reason. With llvm patch [2] and with this patch, the verifier outputs 10: (85) call __bpf_trap#74479 unexpected __bpf_trap() due to uninitialized variable? It gives much better hints for verification failure. [1] https://github.com/msune/clang_bpf/blob/main/Makefile#L3 [2] https://github.com/llvm/llvm-project/pull/131731 Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20250523205326.1291640-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-05-27Merge tag 'x86_cache_for_v6.16_rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 resource control updates from Borislav Petkov: "Carve out the resctrl filesystem-related code into fs/resctrl/ so that multiple architectures can share the fs API for manipulating their respective hw resource control implementation. This is the second step in the work towards sharing the resctrl filesystem interface, the next one being plugging ARM's MPAM into the aforementioned fs API" * tag 'x86_cache_for_v6.16_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (25 commits) MAINTAINERS: Add reviewers for fs/resctrl x86,fs/resctrl: Move the resctrl filesystem code to live in /fs/resctrl x86/resctrl: Always initialise rid field in rdt_resources_all[] x86/resctrl: Relax some asm #includes x86/resctrl: Prefer alloc(sizeof(*foo)) idiom in rdt_init_fs_context() x86/resctrl: Squelch whitespace anomalies in resctrl core code x86/resctrl: Move pseudo lock prototypes to include/linux/resctrl.h x86/resctrl: Fix types in resctrl_arch_mon_ctx_{alloc,free}() stubs x86/resctrl: Move enum resctrl_event_id to resctrl.h x86/resctrl: Move the filesystem bits to headers visible to fs/resctrl fs/resctrl: Add boiler plate for external resctrl code x86/resctrl: Add 'resctrl' to the title of the resctrl documentation x86/resctrl: Split trace.h x86/resctrl: Expand the width of domid by replacing mon_data_bits x86/resctrl: Add end-marker to the resctrl_event_id enum x86/resctrl: Move is_mba_sc() out of core.c x86/resctrl: Drop __init/__exit on assorted symbols x86/resctrl: Resctrl_exit() teardown resctrl but leave the mount point x86/resctrl: Check all domains are offline in resctrl_exit() x86/resctrl: Rename resctrl_sched_in() to begin with "resctrl_arch_" ...
2025-05-27bpf: Remove special_kfunc_set from verifierYonghong Song
Currently, the verifier has both special_kfunc_set and special_kfunc_list. When adding a new kfunc usage to the verifier, it is often confusing about whether special_kfunc_set or special_kfunc_list or both should add that kfunc. For example, some kfuncs, e.g., bpf_dynptr_from_skb, bpf_dynptr_clone, bpf_wq_set_callback_impl, does not need to be in special_kfunc_set. To avoid potential future confusion, special_kfunc_set is deleted and btf_id_set_contains(&special_kfunc_set, ...) is removed. The code is refactored with a new func check_special_kfunc(), which contains all codes covered by original branch meta.btf == btf_vmlinux && btf_id_set_contains(&special_kfunc_set, meta.func_id) There is no functionality change. Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20250523205321.1291431-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-05-27Merge branch 'replace-config_dmabuf_sysfs_stats-with-bpf'Alexei Starovoitov
T.J. Mercier says: ==================== Replace CONFIG_DMABUF_SYSFS_STATS with BPF Until CONFIG_DMABUF_SYSFS_STATS was added [1] it was only possible to perform per-buffer accounting with debugfs which is not suitable for production environments. Eventually we discovered the overhead with per-buffer sysfs file creation/removal was significantly impacting allocation and free times, and exacerbated kernfs lock contention. [2] dma_buf_stats_setup() is responsible for 39% of single-page buffer creation duration, or 74% of single-page dma_buf_export() duration when stressing dmabuf allocations and frees. I prototyped a change from per-buffer to per-exporter statistics with a RCU protected list of exporter allocations that accommodates most (but not all) of our use-cases and avoids almost all of the sysfs overhead. While that adds less overhead than per-buffer sysfs, and less even than the maintenance of the dmabuf debugfs_list, it's still *additional* overhead on top of the debugfs_list and doesn't give us per-buffer info. This series uses the existing dmabuf debugfs_list to implement a BPF dmabuf iterator, which adds no overhead to buffer allocation/free and provides per-buffer info. The list has been moved outside of CONFIG_DEBUG_FS scope so that it is always populated. The BPF program loaded by userspace that extracts per-buffer information gets to define its own interface which avoids the lack of ABI stability with debugfs. This will allow us to replace our use of CONFIG_DMABUF_SYSFS_STATS, and the plan is to remove it from the kernel after the next longterm stable release. [1] https://lore.kernel.org/linux-media/20201210044400.1080308-1-hridya@google.com [2] https://lore.kernel.org/all/20220516171315.2400578-1-tjmercier@google.com v1: https://lore.kernel.org/all/20250414225227.3642618-1-tjmercier@google.com v1 -> v2: Make the DMA buffer list independent of CONFIG_DEBUG_FS per Christian König Add CONFIG_DMA_SHARED_BUFFER check to kernel/bpf/Makefile per kernel test robot Use BTF_ID_LIST_SINGLE instead of BTF_ID_LIST_GLOBAL_SINGLE per Song Liu Fixup comment style, mixing code/declarations, and use ASSERT_OK_FD in selftest per Song Liu Add BPF_ITER_RESCHED feature to bpf_dmabuf_reg_info per Alexei Starovoitov Add open-coded iterator and selftest per Alexei Starovoitov Add a second test buffer from the system dmabuf heap to selftests Use the BPF program we'll use in production for selftest per Alexei Starovoitov https://r.android.com/c/platform/system/bpfprogs/+/3616123/2/dmabufIter.c https://r.android.com/c/platform/system/memory/libmeminfo/+/3614259/1/libdmabufinfo/dmabuf_bpf_stats.cpp v2: https://lore.kernel.org/all/20250504224149.1033867-1-tjmercier@google.com v2 -> v3: Rebase onto bpf-next/master Move get_next_dmabuf() into drivers/dma-buf/dma-buf.c, along with the new get_first_dmabuf(). This avoids having to expose the dmabuf list and mutex to the rest of the kernel, and keeps the dmabuf mutex operations near each other in the same file. (Christian König) Add Christian's RB to dma-buf: Rename debugfs symbols Drop RFC: dma-buf: Remove DMA-BUF statistics v3: https://lore.kernel.org/all/20250507001036.2278781-1-tjmercier@google.com v3 -> v4: Fix selftest BPF program comment style (not kdoc) per Alexei Starovoitov Fix dma-buf.c kdoc comment style per Alexei Starovoitov Rename get_first_dmabuf / get_next_dmabuf to dma_buf_iter_begin / dma_buf_iter_next per Christian König Add Christian's RB to bpf: Add dmabuf iterator v4: https://lore.kernel.org/all/20250508182025.2961555-1-tjmercier@google.com v4 -> v5: Add Christian's Acks to all patches Add Song Liu's Acks Move BTF_ID_LIST_SINGLE and DEFINE_BPF_ITER_FUNC closer to usage per Song Liu Fix open-coded iterator comment style per Song Liu Move iterator termination check to its own subtest per Song Liu Rework selftest buffer creation per Song Liu Fix spacing in sanitize_string per BPF CI v5: https://lore.kernel.org/all/20250512174036.266796-1-tjmercier@google.com v5 -> v6: Song Liu: Init test buffer FDs to -1 Zero-init udmabuf_create for future proofing Bail early for iterator fd/FILE creation failure Dereference char ptr to check for NUL in sanitize_string() Move map insertion from create_test_buffers() to test_dmabuf_iter() Add ACK to selftests/bpf: Add test for open coded dmabuf_iter v6: https://lore.kernel.org/all/20250513163601.812317-1-tjmercier@google.com v6 -> v7: Zero uninitialized name bytes following the end of name strings per s390x BPF CI Reorder sanitize_string bounds checks per Song Liu Add Song's Ack to: selftests/bpf: Add test for dmabuf_iter Rebase onto bpf-next/master per BPF CI ==================== Link: https://patch.msgid.link/20250522230429.941193-1-tjmercier@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-05-27selftests/bpf: Add test for open coded dmabuf_iterT.J. Mercier
Use the same test buffers as the traditional iterator and a new BPF map to verify the test buffers can be found with the open coded dmabuf iterator. Signed-off-by: T.J. Mercier <tjmercier@google.com> Acked-by: Christian König <christian.koenig@amd.com> Acked-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20250522230429.941193-6-tjmercier@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-05-27selftests/bpf: Add test for dmabuf_iterT.J. Mercier
This test creates a udmabuf, and a dmabuf from the system dmabuf heap, and uses a BPF program that prints dmabuf metadata with the new dmabuf_iter to verify they can be found. Signed-off-by: T.J. Mercier <tjmercier@google.com> Acked-by: Christian König <christian.koenig@amd.com> Acked-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20250522230429.941193-5-tjmercier@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-05-27bpf: Add open coded dmabuf iteratorT.J. Mercier
This open coded iterator allows for more flexibility when creating BPF programs. It can support output in formats other than text. With an open coded iterator, a single BPF program can traverse multiple kernel data structures (now including dmabufs), allowing for more efficient analysis of kernel data compared to multiple reads from procfs, sysfs, or multiple traditional BPF iterator invocations. Signed-off-by: T.J. Mercier <tjmercier@google.com> Acked-by: Christian König <christian.koenig@amd.com> Acked-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20250522230429.941193-4-tjmercier@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-05-27bpf: Add dmabuf iteratorT.J. Mercier
The dmabuf iterator traverses the list of all DMA buffers. DMA buffers are refcounted through their associated struct file. A reference is taken on each buffer as the list is iterated to ensure each buffer persists for the duration of the bpf program execution without holding the list mutex. Signed-off-by: T.J. Mercier <tjmercier@google.com> Reviewed-by: Christian König <christian.koenig@amd.com> Acked-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20250522230429.941193-3-tjmercier@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-05-27dma-buf: Rename debugfs symbolsT.J. Mercier
Rename the debugfs list and mutex so it's clear they are now usable without the need for CONFIG_DEBUG_FS. The list will always be populated to support the creation of a BPF iterator for dmabufs. Signed-off-by: T.J. Mercier <tjmercier@google.com> Reviewed-by: Christian König <christian.koenig@amd.com> Acked-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20250522230429.941193-2-tjmercier@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-05-27ASoC: codecs: wcd93xx: Few regulator supplies fixesMark Brown
Merge series from Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>: Fix cleanup paths in wcd9335 and wcd937x codec drivers.
2025-05-27RISC-V: KVM: use kvm_trylock_all_vcpus when locking all vCPUsMaxim Levitsky
Use kvm_trylock_all_vcpus instead of a custom implementation when locking all vCPUs of a VM. Compile tested only. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Anup Patel <anup@brainfault.org> Tested-by: Anup Patel <anup@brainfault.org> Message-ID: <20250512180407.659015-7-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-05-27KVM: arm64: use kvm_trylock_all_vcpus when locking all vCPUsMaxim Levitsky
Use kvm_trylock_all_vcpus instead of a custom implementation when locking all vCPUs of a VM, to avoid triggering a lockdep warning, in the case in which the VM is configured to have more than MAX_LOCK_DEPTH vCPUs. This fixes the following false lockdep warning: [ 328.171264] BUG: MAX_LOCK_DEPTH too low! [ 328.175227] turning off the locking correctness validator. [ 328.180726] Please attach the output of /proc/lock_stat to the bug report [ 328.187531] depth: 48 max: 48! [ 328.190678] 48 locks held by qemu-kvm/11664: [ 328.194957] #0: ffff800086de5ba0 (&kvm->lock){+.+.}-{3:3}, at: kvm_ioctl_create_device+0x174/0x5b0 [ 328.204048] #1: ffff0800e78800b8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0 [ 328.212521] #2: ffff07ffeee51e98 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0 [ 328.220991] #3: ffff0800dc7d80b8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0 [ 328.229463] #4: ffff07ffe0c980b8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0 [ 328.237934] #5: ffff0800a3883c78 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0 [ 328.246405] #6: ffff07fffbe480b8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0 Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Acked-by: Marc Zyngier <maz@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Message-ID: <20250512180407.659015-6-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-05-27x86: KVM: SVM: use kvm_lock_all_vcpus instead of a custom implementationMaxim Levitsky
Use kvm_lock_all_vcpus instead of sev's own implementation. Because kvm_lock_all_vcpus uses the _nest_lock feature of lockdep, which ignores subclasses, there is no longer a need to use separate subclasses for source and target VMs. No functional change intended. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Message-ID: <20250512180407.659015-5-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-05-27KVM: add kvm_lock_all_vcpus and kvm_trylock_all_vcpusMaxim Levitsky
In a few cases, usually in the initialization code, KVM locks all vCPUs of a VM to ensure that userspace doesn't do funny things while KVM performs an operation that affects the whole VM. Until now, all these operations were implemented using custom code, and all of them share the same problem: Lockdep can't cope with simultaneous locking of a large number of locks of the same class. However if these locks are taken while another lock is already held, which is luckily the case, it is possible to take advantage of little known _nest_lock feature of lockdep which allows in this case to have an unlimited number of locks of same class to be taken. To implement this, create two functions: kvm_lock_all_vcpus() and kvm_trylock_all_vcpus() Both functions are needed because some code that will be replaced in the subsequent patches, uses mutex_trylock, instead of regular mutex_lock. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Acked-by: Marc Zyngier <maz@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Message-ID: <20250512180407.659015-4-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-05-27locking/mutex: implement mutex_lock_killable_nest_lockMaxim Levitsky
KVM's SEV intra-host migration code needs to lock all vCPUs of the source and the target VM, before it proceeds with the migration. The number of vCPUs that belong to each VM is not bounded by anything except a self-imposed KVM limit of CONFIG_KVM_MAX_NR_VCPUS vCPUs which is significantly larger than the depth of lockdep's lock stack. Luckily, the locks in both of the cases mentioned above, are held under the 'kvm->lock' of each VM, which means that we can use the little known lockdep feature called a "nest_lock" to support this use case in a cleaner way, compared to the way it's currently done. Implement and expose 'mutex_lock_killable_nest_lock' for this purpose. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Message-ID: <20250512180407.659015-3-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-05-27locking/mutex: implement mutex_trylock_nestedMaxim Levitsky
Despite the fact that several lockdep-related checks are skipped when calling trylock* versions of the locking primitives, for example mutex_trylock, each time the mutex is acquired, a held_lock is still placed onto the lockdep stack by __lock_acquire() which is called regardless of whether the trylock* or regular locking API was used. This means that if the caller successfully acquires more than MAX_LOCK_DEPTH locks of the same class, even when using mutex_trylock, lockdep will still complain that the maximum depth of the held lock stack has been reached and disable itself. For example, the following error currently occurs in the ARM version of KVM, once the code tries to lock all vCPUs of a VM configured with more than MAX_LOCK_DEPTH vCPUs, a situation that can easily happen on modern systems, where having more than 48 CPUs is common, and it's also common to run VMs that have vCPU counts approaching that number: [ 328.171264] BUG: MAX_LOCK_DEPTH too low! [ 328.175227] turning off the locking correctness validator. [ 328.180726] Please attach the output of /proc/lock_stat to the bug report [ 328.187531] depth: 48 max: 48! [ 328.190678] 48 locks held by qemu-kvm/11664: [ 328.194957] #0: ffff800086de5ba0 (&kvm->lock){+.+.}-{3:3}, at: kvm_ioctl_create_device+0x174/0x5b0 [ 328.204048] #1: ffff0800e78800b8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0 [ 328.212521] #2: ffff07ffeee51e98 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0 [ 328.220991] #3: ffff0800dc7d80b8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0 [ 328.229463] #4: ffff07ffe0c980b8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0 [ 328.237934] #5: ffff0800a3883c78 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0 [ 328.246405] #6: ffff07fffbe480b8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0 Luckily, in all instances that require locking all vCPUs, the 'kvm->lock' is taken a priori, and that fact makes it possible to use the little known feature of lockdep, called a 'nest_lock', to avoid this warning and subsequent lockdep self-disablement. The action of 'nested lock' being provided to lockdep's lock_acquire(), causes the lockdep to detect that the top of the held lock stack contains a lock of the same class and then increment its reference counter instead of pushing a new held_lock item onto that stack. See __lock_acquire for more information. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Message-ID: <20250512180407.659015-2-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-05-27Merge tag 'kvm-x86-svm-6.16' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM SVM changes for 6.16: - Wait for target vCPU to acknowledge KVM_REQ_UPDATE_PROTECTED_GUEST_STATE to fix a race between AP destroy and VMRUN. - Decrypt and dump the VMSA in dump_vmcb() if debugging enabled for the VM. - Add support for ALLOWED_SEV_FEATURES. - Add #VMGEXIT to the set of handlers special cased for CONFIG_RETPOLINE=y. - Treat DEBUGCTL[5:2] as reserved to pave the way for virtualizing features that utilize those bits. - Don't account temporary allocations in sev_send_update_data(). - Add support for KVM_CAP_X86_BUS_LOCK_EXIT on SVM, via Bus Lock Threshold.
2025-05-27Merge tag 'kvm-x86-vmx-6.16' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM VMX changes for 6.16: - Explicitly check MSR load/store list counts to fix a potential overflow on 32-bit kernels. - Flush shadow VMCSes on emergency reboot. - Revert mem_enc_ioctl() back to an optional hook, as it's nullified when SEV or TDX is disabled via Kconfig. - Macrofy the handling of vt_x86_ops to eliminate a pile of boilerplate code needed for TDX, and to optimize CONFIG_KVM_INTEL_TDX=n builds.
2025-05-27Merge tag 'kvm-x86-selftests-6.16' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM selftests changes for 6.16: - Add support for SNP to the various SEV selftests. - Add a selftest to verify fastops instructions via forced emulation. - Add MGLRU support to the access tracking perf test.
2025-05-27Merge tag 'kvm-x86-pir-6.16' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM x86 posted interrupt changes for 6.16: Refine and optimize KVM's software processing of the PIR, and ultimately share PIR harvesting code between KVM and the kernel's Posted MSI handler
2025-05-27Merge tag 'kvm-x86-mmu-6.16' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM x86 MMU changes for 6.16: - Refine and harden handling of spurious faults. - Use kvm_x86_call() instead of open coding static_call().
2025-05-27Merge tag 'kvm-x86-misc-6.16' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM x86 misc changes for 6.16: - Unify virtualization of IBRS on nested VM-Exit, and cross-vCPU IBPB, between SVM and VMX. - Advertise support to userspace for WRMSRNS and PREFETCHI. - Rescan I/O APIC routes after handling EOI that needed to be intercepted due to the old/previous routing, but not the new/current routing. - Add a module param to control and enumerate support for device posted interrupts. - Misc cleanups.
2025-05-27loop: add file_start_write() and file_end_write()Ming Lei
file_start_write() and file_end_write() should be added around ->write_iter(). Recently we switch to ->write_iter() from vfs_iter_write(), and the implied file_start_write() and file_end_write() are lost. Also we never add them for dio code path, so add them back for covering both. Cc: Jeff Moyer <jmoyer@redhat.com> Fixes: f2fed441c69b ("loop: stop using vfs_iter_{read,write} for buffered I/O") Fixes: bc07c10a3603 ("block: loop: support DIO & AIO") Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250527153405.837216-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-27KVM: VMX: use __always_inline for is_td_vcpu and is_tdEdward Adam Davis
is_td() and is_td_vcpu() are used in no-instrumentation sections; use __always_inline instead of inline. vmlinux.o: error: objtool: vmx_handle_nmi+0x47: call to is_td_vcpu.isra.0() leaves .noinstr.text section Fixes: 7172c753c26a ("KVM: VMX: Move common fields of struct vcpu_{vmx,tdx} to a struct") Signed-off-by: Edward Adam Davis <eadavis@qq.com> Message-ID: <tencent_1A767567C83C1137829622362E4A72756F09@qq.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-05-27io_uring/zcrx: init id for xa_findPavel Begunkov
xa_find() interprets id as the lower bound and thus expects it initialised. Reported-by: syzbot+c3ff04150c30d3df0f57@syzkaller.appspotmail.com Fixes: 76f1cc98b23ce ("io_uring/zcrx: add support for multiple ifqs") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/faea44ef63131e6968f635e1b6b7ca6056f1f533.1748359655.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-27Merge tag 'timers-core-2025-05-25' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer core updates from Thomas Gleixner: "Updates for the time/timer core code: - Rework the initialization of the posix-timer kmem_cache and move the cache pointer into the timer_data structure to prevent false sharing - Switch the alarmtimer code to lock guards - Improve the CPU selection criteria in the per CPU validation of the clocksource watchdog to avoid arbitrary selections (or omissions) on systems with a small number of CPUs - The usual cleanups and improvements" * tag 'timers-core-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: tick/nohz: Remove unused tick_nohz_full_add_cpus_to() clocksource: Fix the CPUs' choice in the watchdog per CPU verification alarmtimer: Switch spin_{lock,unlock}_irqsave() to guards alarmtimer: Remove dead return value in clock2alarm() time/jiffies: Change register_refined_jiffies() to void __init timers: Remove unused __round_jiffies(_up) posix-timers: Initialize cache early and move pointer into __timer_data
2025-05-27Merge tag 'timers-clocksource-2025-05-25' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull clocksource updates from Thomas Gleixner: "Updates for clocksource/clockevent drivers: - The final conversion of text formatted device tree binding to schemas - A new driver fot the System Timer Module on S32G NXP SoCs - A new driver fot the Econet HPT timer - The usual improvements and device tree binding updates" * tag 'timers-clocksource-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (31 commits) clocksource/drivers/renesas-ostm: Unconditionally enable reprobe support dt-bindings: timer: renesas,ostm: Document RZ/V2N (R9A09G056) support dt-bindings: timer: Convert marvell,armada-370-timer to DT schema dt-bindings: timer: Convert ti,keystone-timer to DT schema dt-bindings: timer: Convert st,spear-timer to DT schema dt-bindings: timer: Convert socionext,milbeaut-timer to DT schema dt-bindings: timer: Convert snps,arc-timer to DT schema dt-bindings: timer: Convert snps,archs-rtc to DT schema dt-bindings: timer: Convert snps,archs-gfrc to DT schema dt-bindings: timer: Convert lsi,zevio-timer to DT schema dt-bindings: timer: Convert jcore,pit to DT schema dt-bindings: timer: Convert img,pistachio-gptimer to DT schema dt-bindings: timer: Convert ezchip,nps400-timer to DT schema dt-bindings: timer: Convert cirrus,clps711x-timer to DT schema dt-bindings: timer: Convert altr,timer-1.0 to DT schema dt-bindings: timer: Add ESWIN EIC7700 CLINT clocksource/drivers: Add EcoNet Timer HPT driver dt-bindings: timer: Add EcoNet EN751221 "HPT" CPU Timer dt-bindings: timer: Convert arm,mps2-timer to DT schema dt-bindings: timer: Add Sophgo SG2044 ACLINT timer ...
2025-05-27Merge tag 'timers-cleanups-2025-05-25' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer cleanups from Thomas Gleixner: "Another set of timer API cleanups: - Convert init_timer*(), try_to_del_timer_sync() and destroy_timer_on_stack() over to the canonical timer_*() namespace convention. There is another large conversion pending, which has not been included because it would have caused a gazillion of merge conflicts in next. The conversion scripts will be run towards the end of the merge window and a pull request sent once all conflict dependencies have been merged" * tag 'timers-cleanups-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: treewide, timers: Rename destroy_timer_on_stack() as timer_destroy_on_stack() treewide, timers: Rename try_to_del_timer_sync() as timer_delete_sync_try() timers: Rename init_timers() as timers_init() timers: Rename NEXT_TIMER_MAX_DELTA as TIMER_NEXT_MAX_DELTA timers: Rename __init_timer_on_stack() as __timer_init_on_stack() timers: Rename __init_timer() as __timer_init() timers: Rename init_timer_on_stack_key() as timer_init_key_on_stack() timers: Rename init_timer_key() as timer_init_key()
2025-05-27ksmbd: allow a filename to contain special characters on SMB3.1.1 posix ↵Namjae Jeon
extension If client send SMB2_CREATE_POSIX_CONTEXT to ksmbd, Allow a filename to contain special characters. Reported-by: Philipp Kerling <pkerling@casix.org> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-05-27ksmbd: provide zero as a unique ID to the Mac clientNamjae Jeon
The Mac SMB client code seems to expect the on-disk file identifier to have the semantics of HFS+ Catalog Node Identifier (CNID). ksmbd provides the inode number as a unique ID to the client, but in the case of subvolumes of btrfs, there are cases where different files have the same inode number, so the mac smb client treats it as an error. There is a report that a similar problem occurs when the share is ZFS. Returning UniqueId of zero will make the Mac client to stop using and trusting the file id returned from the server. Reported-by: Justin Turner Arthur <justinarthur@gmail.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-05-27Merge tag 'irq-msi-2025-05-25' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull MSI updates from Thomas Gleixner: "Updates for the MSI subsystem (core code and PCI): - Switch the MSI descriptor locking to lock guards - Replace a broken and naive implementation of PCI/MSI-X control word updates in the PCI/TPH driver with a properly serialized variant in the PCI/MSI core code. - Remove the MSI descriptor abuse in the SCCI/UFS/QCOM driver by replacing the direct access to the MSI descriptors with the proper API function calls. People will never understand that APIs exist for a reason... - Provide core infrastructre for the upcoming PCI endpoint library extensions. Currently limited to ARM GICv3+, but in theory extensible to other architectures. - Provide a MSI domain::teardown() callback, which allows drivers to undo the effects of the prepare() callback. - Move the MSI domain::prepare() callback invocation to domain creation time to avoid redundant (and in case of ARM/GIC-V3-ITS confusing) invocations on every allocation. In combination with the new teardown callback this removes some ugly hacks in the GIC-V3-ITS driver, which pretended to work around the short comings of the core code so far. With this update the code is correct by design and implementation. - Make the irqchip MSI library globally available, provide a MSI parent domain creation helper and convert a bunch of (PCI/)MSI drivers over to the modern MSI parent mechanism. This is the first step to get rid of at least one incarnation of the three PCI/MSI management schemes. - The usual small cleanups and improvements" * tag 'irq-msi-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits) PCI/MSI: Use bool for MSI enable state tracking PCI: tegra: Convert to MSI parent infrastructure PCI: xgene: Convert to MSI parent infrastructure PCI: apple: Convert to MSI parent infrastructure irqchip/msi-lib: Honour the MSI_FLAG_NO_AFFINITY flag irqchip/mvebu: Convert to msi_create_parent_irq_domain() helper irqchip/gic: Convert to msi_create_parent_irq_domain() helper genirq/msi: Add helper for creating MSI-parent irq domains irqchip: Make irq-msi-lib.h globally available irqchip/gic-v3-its: Use allocation size from the prepare call genirq/msi: Engage the .msi_teardown() callback on domain removal genirq/msi: Move prepare() call to per-device allocation irqchip/gic-v3-its: Implement .msi_teardown() callback genirq/msi: Add .msi_teardown() callback as the reverse of .msi_prepare() irqchip/gic-v3-its: Add support for device tree msi-map and msi-mask dt-bindings: PCI: pci-ep: Add support for iommu-map and msi-map irqchip/gic-v3-its: Set IRQ_DOMAIN_FLAG_MSI_IMMUTABLE for ITS irqdomain: Add IRQ_DOMAIN_FLAG_MSI_IMMUTABLE and irq_domain_is_msi_immutable() platform-msi: Add msi_remove_device_irq_domain() in platform_device_msi_free_irqs_all() genirq/msi: Rename msi_[un]lock_descs() ...
2025-05-27Merge tag 'irq-cleanups-2025-05-25' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq cleanups from Thomas Gleixner: "A set of cleanups for the generic interrupt subsystem: - Consolidate on one set of functions for the interrupt domain code to get rid of pointlessly duplicated code with only marginal different semantics. - Update the documentation accordingly and consolidate the coding style of the irqdomain header" * tag 'irq-cleanups-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (46 commits) irqdomain: Consolidate coding style irqdomain: Fix kernel-doc and add it to Documentation Documentation: irqdomain: Update it Documentation: irq-domain.rst: Simple improvements Documentation: irq/concepts: Minor improvements Documentation: irq/concepts: Add commas and reflow irqdomain: Improve kernel-docs of functions irqdomain: Make struct irq_domain_info variables const irqdomain: Use irq_domain_instantiate()'s return value as initializers irqdomain: Drop irq_linear_revmap() pinctrl: keembay: Switch to irq_find_mapping() irqchip/armada-370-xp: Switch to irq_find_mapping() gpu: ipu-v3: Switch to irq_find_mapping() gpio: idt3243x: Switch to irq_find_mapping() sh: Switch to irq_find_mapping() powerpc: Switch to irq_find_mapping() irqdomain: Drop irq_domain_add_*() functions powerpc: Switch irq_domain_add_nomap() to use fwnode thermal: Switch to irq_domain_create_linear() soc: Switch to irq_domain_create_*() ...
2025-05-27Merge tag 'irq-drivers-2025-05-25' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq controller updates from Thomas Gleixner: "Update for interrupt chip drivers: - Convert the generic interrupt chip to lock guards to remove copy & pasta boilerplate code and gotos. - A new driver fot the interrupt controller in the EcoNet EN751221 MIPS SoC. - Extend the SG2042-MSI driver to support the new SG2044 SoC - Updates and cleanups for the (ancient) VT8500 driver - Improve the scalability of the ARM GICV4.1 ITS driver by utilizing node local copies a VM's interrupt translation table when possible. This results in a 12% reduction of VM IPI latency in certain workloads. - The usual cleanups and improvements all over the place" * tag 'irq-drivers-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (25 commits) irqchip/irq-pruss-intc: Simplify chained interrupt handler setup irqchip/gic-v4.1: Use local 4_1 ITS to generate VSGI irqchip/econet-en751221: Switch to of_fwnode_handle() irqchip/irq-vt8500: Switch to irq_domain_create_*() irqchip/econet-en751221: Switch to irq_domain_create_linear() irqchip/irq-vt8500: Use fewer global variables and add error handling irqchip/irq-vt8500: Use a dedicated chained handler function irqchip/irq-vt8500: Don't require 8 interrupts from a chained controller irqchip/irq-vt8500: Drop redundant copy of the device node pointer irqchip/irq-vt8500: Split up ack/mask functions irqchip/sg2042-msi: Fix wrong type cast in sg2044_msi_irq_ack() irqchip/sg2042-msi: Add the Sophgo SG2044 MSI interrupt controller irqchip/sg2042-msi: Introduce configurable chipinfo for SG2042 irqchip/sg2042-msi: Rename functions and data structures to be SG2042 agnostic dt-bindings: interrupt-controller: Add Sophgo SG2044 MSI controller genirq/generic-chip: Fix incorrect lock guard conversions genirq/generic-chip: Remove unused lock wrappers irqchip: Convert generic irqchip locking to guards gpio: mvebu: Convert generic irqchip locking to guard() ARM: orion/gpio:: Convert generic irqchip locking to guard() ...
2025-05-27Merge tag 'irq-core-2025-05-25' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq core updates from Thomas Gleixner: "Updates for the generic interrupt subsystem core code: - Address a long standing subtle problem in the CPU hotplug code for affinity-managed interrupts. Affinity-managed interrupts are shut down by the core code when the last CPU in the affinity set goes offline and started up again when the first CPU in the affinity set becomes online again. This unfortunately does not take into account whether an interrupt has been disabled before the last CPU goes offline and starts up the interrupt unconditionally when the first CPU becomes online again. That's obviously not what drivers expect. Address this by preserving the disabled state for affinity-managed interrupts accross these CPU hotplug operations. All non-managed interrupts are not affected by this because startup/shutdown is coupled to request/free_irq() which obviously has to reset state. - Support three-cell scheme interrupts to allow GPIO drivers to specify interrupts from an already existing scheme - Switch the interrupt subsystem core to lock guards. This gets rid of quite some copy & pasta boilerplate code all over the place. - The usual small cleanups and improvements all over the place" * tag 'irq-core-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (59 commits) genirq/irqdesc: Remove double locking in hwirq_show() genirq: Retain disable depth for managed interrupts across CPU hotplug genirq: Bump the size of the local variable for sprintf() genirq/manage: Use the correct lock guard in irq_set_irq_wake() genirq: Consistently use '%u' format specifier for unsigned int variables genirq: Ensure flags in lock guard is consistently initialized genirq: Fix inverted condition in handle_nested_irq() genirq/cpuhotplug: Fix up lock guards conversion brainf..t genirq: Use scoped_guard() to shut clang up genirq: Remove unused remove_percpu_irq() genirq: Remove irq_[get|put]_desc*() genirq/manage: Rework irq_set_irqchip_state() genirq/manage: Rework irq_get_irqchip_state() genirq/manage: Rework teardown_percpu_nmi() genirq/manage: Rework prepare_percpu_nmi() genirq/manage: Rework disable_percpu_irq() genirq/manage: Rework irq_percpu_is_enabled() genirq/manage: Rework enable_percpu_irq() genirq/manage: Rework irq_set_parent() genirq/manage: Rework can_request_irq() ...
2025-05-27Merge tag 'core-entry-2025-05-25' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull core entry code updates from Thomas Gleixner: "Updates for the generic and architecture entry code: - Move LoongArch and RISC-V ret_from_fork() implementations to C code so that syscall_exit_user_mode() can be inlined - Split the RISC-V ret_from_fork() implementation into return to user and return to kernel, which gives a measurable performance improvement - Inline syscall_exit_user_mode() which benefits all architectures by avoiding a function call and letting the compiler do better optimizations" * tag 'core-entry-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: LoongArch: entry: Fix include order entry: Inline syscall_exit_to_user_mode() LoongArch: entry: Migrate ret_from_fork() to C riscv: entry: Split ret_from_fork() into user and kernel riscv: entry: Convert ret_from_fork() to C
2025-05-27virtio_rtc: Add RTC class driverPeter Hilber
Expose the virtio-rtc UTC-like clock as an RTC clock to userspace - if it is present, and if it does not step on leap seconds. The RTC class enables the virtio-rtc device to resume the system from sleep states on RTC alarm. Support RTC alarm if the virtio-rtc alarm feature is present. The virtio-rtc device signals an alarm by marking an alarmq buffer as used. Peculiarities ------------- A virtio-rtc clock is a bit special for an RTC clock in that - the clock may step (also backwards) autonomously at any time and - the device, and its notification mechanism, will be reset during boot or resume from sleep. The virtio-rtc device avoids that the driver might miss an alarm. The device signals an alarm whenever the clock has reached or passed the alarm time, and also when the device is reset (on boot or resume from sleep), if the alarm time is in the past. Open Issue ---------- The CLOCK_BOOTTIME_ALARM will use the RTC clock to wake up from sleep, and implicitly assumes that no RTC clock steps will occur during sleep. The RTC class driver does not know whether the current alarm is a real-time alarm or a boot-time alarm. Perhaps this might be handled by the driver also setting a virtio-rtc monotonic alarm (which uses a clock similar to CLOCK_BOOTTIME_ALARM). The virtio-rtc monotonic alarm would just be used to wake up in case it was a CLOCK_BOOTTIME_ALARM alarm. Otherwise, the behavior should not differ from other RTC class drivers. Signed-off-by: Peter Hilber <quic_philber@quicinc.com> Acked-by: Alexandre Belloni <alexandre.belloni@bootlin.com> Message-Id: <20250509160734.1772-5-quic_philber@quicinc.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-05-27virtio_rtc: Add Arm Generic Timer cross-timestampingPeter Hilber
For platforms using the Arm Generic Timer, add precise cross-timestamping support to virtio_rtc. Always report the CP15 virtual counter as the HW counter in use by arm_arch_timer, since the Linux kernel's usage of the Arm Generic Timer should always be compatible with this. Signed-off-by: Peter Hilber <quic_philber@quicinc.com> Message-Id: <20250509160734.1772-4-quic_philber@quicinc.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-05-27virtio_rtc: Add PTP clocksPeter Hilber
Expose the virtio_rtc clocks as PTP clocks to userspace, similar to ptp_kvm. virtio_rtc can expose multiple clocks, e.g. a UTC clock and a monotonic clock. Userspace should distinguish different clocks through the name assigned by the driver. In particular, UTC-like clocks can also be distinguished by if and how leap seconds are smeared. udev rules such as the following can be used to get different symlinks for different clock types: SUBSYSTEM=="ptp", ATTR{clock_name}=="Virtio PTP type 0/variant 0", SYMLINK += "ptp_virtio" SUBSYSTEM=="ptp", ATTR{clock_name}=="Virtio PTP type 1/variant 0", SYMLINK += "ptp_virtio_tai" SUBSYSTEM=="ptp", ATTR{clock_name}=="Virtio PTP type 2/variant 0", SYMLINK += "ptp_virtio_monotonic" SUBSYSTEM=="ptp", ATTR{clock_name}=="Virtio PTP type 3/variant 0", SYMLINK += "ptp_virtio_smear_unspecified" SUBSYSTEM=="ptp", ATTR{clock_name}=="Virtio PTP type 3/variant 1", SYMLINK += "ptp_virtio_smear_noon_linear" SUBSYSTEM=="ptp", ATTR{clock_name}=="Virtio PTP type 3/variant 2", SYMLINK += "ptp_virtio_smear_sls" SUBSYSTEM=="ptp", ATTR{clock_name}=="Virtio PTP type 4/variant 0", SYMLINK += "ptp_virtio_maybe_smeared" The preferred PTP clock reading method is ioctl PTP_SYS_OFFSET_PRECISE2, through the ptp_clock_info.getcrosststamp() op. For now, PTP_SYS_OFFSET_PRECISE2 will return -EOPNOTSUPP through a weak function. PTP_SYS_OFFSET_PRECISE2 requires cross-timestamping support for specific clocksources, which will be added in the following. If the clocksource specific code is enabled, check that the Virtio RTC device supports the respective HW counter before obtaining an actual cross-timestamp from the Virtio device. The Virtio RTC device response time may be higher than the timekeeper seqcount increment interval. Therefore, obtain the cross-timestamp before calling get_device_system_crosststamp(). As a fallback, support the ioctl PTP_SYS_OFFSET_EXTENDED2 for all platforms. Assume that concurrency issues during PTP clock removal are avoided by the posix_clock framework. Kconfig recursive dependencies prevent virtio_rtc from implicitly enabling PTP_1588_CLOCK, therefore just warn the user if PTP_1588_CLOCK is not available. Since virtio_rtc should in the future also expose clocks as RTC class devices, do not depend VIRTIO_RTC on PTP_1588_CLOCK. Signed-off-by: Peter Hilber <quic_philber@quicinc.com> Message-Id: <20250509160734.1772-3-quic_philber@quicinc.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-05-27virtio_rtc: Add module and driver corePeter Hilber
Add the virtio_rtc module and driver core. The virtio_rtc module implements a driver compatible with the proposed Virtio RTC device specification. The Virtio RTC (Real Time Clock) device provides information about current time. The device can provide different clocks, e.g. for the UTC or TAI time standards, or for physical time elapsed since some past epoch. The driver can read the clocks with simple or more accurate methods. Implement the core, which interacts with the Virtio RTC device. Apart from this, the core does not expose functionality outside of the virtio_rtc module. Follow-up patches will expose PTP clocks and an RTC Class device. Provide synchronous messaging, which is enough for the expected time synchronization use cases through PTP clocks (similar to ptp_kvm) or RTC Class device. Signed-off-by: Peter Hilber <quic_philber@quicinc.com> Message-Id: <20250509160734.1772-2-quic_philber@quicinc.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-05-27vringh: use bvec_kmap_localChristoph Hellwig
Use the bvec_kmap_local helper rather than digging into the bvec internals. Signed-off-by: Christoph Hellwig <hch@lst.de> Message-Id: <20250501142244.2888227-1-hch@lst.de> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
2025-05-27vhost: vringh: Use matching allocation type in resize_iovec()Kees Cook
In preparation for making the kmalloc family of allocators type aware, we need to make sure that the returned type from the allocation matches the type of the variable being assigned. (Before, the allocator would always return "void *", which can be implicitly cast to any pointer type.) The assigned type is "struct kvec *", but the returned type will be "struct iovec *". These have the same allocation size, so there is no bug: struct kvec { void *iov_base; /* and that should *never* hold a userland pointer */ size_t iov_len; }; struct iovec { void __user *iov_base; /* BSD uses caddr_t (1003.1g requires void *) */ __kernel_size_t iov_len; /* Must be size_t (1003.1g) */ }; Adjust the allocation type to match the assignment. Signed-off-by: Kees Cook <kees@kernel.org> Message-Id: <20250426062214.work.334-kees@kernel.org> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-05-27virtio-pci: Fix result size returned for the admin command completionIsrael Rukshin
The result size returned by virtio_pci_admin_dev_parts_get() is 8 bytes larger than the actual result data size. This occurs because the result_sg_size field of the command is filled with the result length from virtqueue_get_buf(), which includes both the data size and an additional 8 bytes of status. This oversized result size causes two issues: 1. The state transferred to the destination includes 8 bytes of extra data at the end. 2. The allocated buffer in the kernel may be smaller than the returned size, leading to failures when reading beyond the allocated size. The commit fixes this by subtracting the status size from the result of virtqueue_get_buf(). This fix has been tested through live migrations with virtio-net, virtio-net-transitional, and virtio-blk devices. Fixes: 704806ca400e ("virtio: Extend the admin command to include the result size") Signed-off-by: Israel Rukshin <israelr@nvidia.com> Reviewed-by: Parav Pandit <parav@nvidia.com> Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com> Message-Id: <1745318025-23103-1-git-send-email-israelr@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-05-27vdpa/octeon_ep: Control PCI dev enabling manuallyPhilipp Stanner
PCI region request functions such as pci_request_region() currently have the problem of becoming sometimes managed functions, if pcim_enable_device() instead of pci_enable_device() was called. The PCI subsystem wants to remove this deprecated behavior from its interfaces. octeopn_ep enables its device with pcim_enable_device() (for VF. PF uses manual management), but does so only to get automatic disablement. The driver wants to manage its PCI resources for VF manually, without devres. The easiest way not to use automatic resource management at all is by also handling device enable- and disablement manually. Replace pcim_enable_device() with pci_enable_device(). Add the necessary calls to pci_disable_device(). Signed-off-by: Philipp Stanner <phasta@kernel.org> Acked-by: Vamsi Attunuru <vattunuru@marvell.com> Message-Id: <20250508085134.24084-2-phasta@kernel.org> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Tested-by: Lei Yang <leiyang@redhat.com> Signed-off-by: Philipp Stanner &lt;<a href="mailto:phasta@kernel.org" target="_blank">phasta@kernel.org</a>&gt;<br> Acked-by: Vamsi Attunuru &lt;<a href="mailto:vattunuru@marvell.com" target="_blank">vattunuru@marvell.com</a>&gt;<br>
2025-05-27Revert "drm/i915/gem: Allow EXEC_CAPTURE on recoverable contexts on DG1"Joonas Lahtinen
This reverts commit d6e020819612a4a06207af858e0978be4d3e3140. The IS_DGFX check was put in place because error capture of buffer objects is expected to be broken on devices with VRAM. Userspace fix[1] to the impacted media driver has been submitted, merged and a new driver release is out as 25.2.3 where the capture flag is dropped on DG1 thus unblocking the usage of media driver on DG1. [1] https://github.com/intel/media-driver/commit/93c07d9b4b96a78bab21f6acd4eb863f4313ea4a Cc: stable@vger.kernel.org # v6.0+ Cc: Ville Syrjälä <ville.syrjala@linux.intel.com> Cc: Andi Shyti <andi.shyti@linux.intel.com> Cc: Matthew Auld <matthew.auld@intel.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Tvrtko Ursulin <tursulin@ursulin.net> Acked-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com> Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com> Link: https://lore.kernel.org/r/20250522064127.24293-1-joonas.lahtinen@linux.intel.com [Joonas: Update message to point out the merged userspace fix] Signed-off-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> (cherry picked from commit d2dc30e0aa252830f908c8e793d3139d51321370) Signed-off-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
2025-05-27drm/i915/gem: Allow EXEC_CAPTURE on recoverable contexts on DG1Ville Syrjälä
The intel-media-driver is currently broken on DG1 because it uses EXEC_CAPTURE with recovarable contexts. Relax the check to allow that. I've also submitted a fix for the intel-media-driver: https://github.com/intel/media-driver/pull/1920 Cc: stable@vger.kernel.org # v6.0+ Cc: Matthew Auld <matthew.auld@intel.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Testcase: igt/gem_exec_capture/capture-invisible Fixes: 71b1669ea9bd ("drm/i915/uapi: tweak error capture on recoverable contexts") Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com> Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com> Signed-off-by: Andi Shyti <andi.shyti@kernel.org> Link: https://lore.kernel.org/r/20250411144313.11660-2-ville.syrjala@linux.intel.com (cherry picked from commit d6e020819612a4a06207af858e0978be4d3e3140) Signed-off-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
2025-05-27bcache: reserve more RESERVE_BTREE buckets to prevent allocator hangMingzhe Zou
Reported an IO hang and unrecoverable error in our testing environment. After careful research, we found that bch_allocator_thread is stuck, the call stack is as follows: [<0>] __switch_to+0xbc/0x108 [<0>] __closure_sync+0x7c/0xbc [bcache] [<0>] bch_prio_write+0x430/0x448 [bcache] [<0>] bch_allocator_thread+0xb44/0xb70 [bcache] [<0>] kthread+0x124/0x130 [<0>] ret_from_fork+0x10/0x18 Moreover, the RESERVE_BTREE type bucket slot are empty and journal_full occurs at the same time. When the cache disk is first used, the sb.nJournal_buckets defaults to 0. So, only 8 RESERVE_BTREE type buckets are reserved. If RESERVE_BTREE type buckets used up or btree_check_reserve() failed when request handle btree split, the request will be repeatedly retried and wait for alloc thread to fill in. After the alloc thread fills the buckets, it will call bch_prio_write(). If journal_full occurs simultaneously at this time, journal_reclaim() and btree_flush_write() will be called sequentially, journal_write cannot be completed. This is a low probability event, we believe that reserve more RESERVE_BTREE buckets can avoid the worst situation. Fixes: 682811b3ce1a ("bcache: fix for allocator and register thread race") Signed-off-by: Mingzhe Zou <mingzhe.zou@easystack.cn> Signed-off-by: Coly Li <colyli@kernel.org> Link: https://lore.kernel.org/r/20250527051601.74407-4-colyli@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>