summaryrefslogtreecommitdiff
path: root/arch
AgeCommit message (Collapse)Author
2025-04-24Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.15-rc4). This pull includes wireless and a fix to vxlan which isn't in Linus's tree just yet. The latter creates with a silent conflict / build breakage, so merging it now to avoid causing problems. drivers/net/vxlan/vxlan_vnifilter.c 094adad91310 ("vxlan: Use a single lock to protect the FDB table") 087a9eb9e597 ("vxlan: vnifilter: Fix unlocked deletion of default FDB entry") https://lore.kernel.org/20250423145131.513029-1-idosch@nvidia.com No "normal" conflicts, or adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-24KVM: SVM: Fix SNP AP destroy race with VMRUNTom Lendacky
An AP destroy request for a target vCPU is typically followed by an RMPADJUST to remove the VMSA attribute from the page currently being used as the VMSA for the target vCPU. This can result in a vCPU that is about to VMRUN to exit with #VMEXIT_INVALID. This usually does not happen as APs are typically sitting in HLT when being destroyed and therefore the vCPU thread is not running at the time. However, if HLT is allowed inside the VM, then the vCPU could be about to VMRUN when the VMSA attribute is removed from the VMSA page, resulting in a #VMEXIT_INVALID when the vCPU actually issues the VMRUN and causing the guest to crash. An RMPADJUST against an in-use (already running) VMSA results in a #NPF for the vCPU issuing the RMPADJUST, so the VMSA attribute cannot be changed until the VMRUN for target vCPU exits. The Qemu command line option '-overcommit cpu-pm=on' is an example of allowing HLT inside the guest. Update the KVM_REQ_UPDATE_PROTECTED_GUEST_STATE event to include the KVM_REQUEST_WAIT flag. The kvm_vcpu_kick() function will not wait for requests to be honored, so create kvm_make_request_and_kick() that will add a new event request and honor the KVM_REQUEST_WAIT flag. This will ensure that the target vCPU sees the AP destroy request before returning to the initiating vCPU should the target vCPU be in guest mode. Fixes: e366f92ea99e ("KVM: SEV: Support SEV-SNP AP Creation NAE event") Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/fe2c885bf35643dd224e91294edb6777d5df23a4.1743097196.git.thomas.lendacky@amd.com [sean: add a comment explaining the use of smp_send_reschedule()] Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-24x86/irq: KVM: Add helper for harvesting PIR to deduplicate KVM and posted MSIsSean Christopherson
Now that posted MSI and KVM harvesting of PIR is identical, extract the code (and posted MSI's wonderful comment) to a common helper. No functional change intended. Link: https://lore.kernel.org/r/20250401163447.846608-9-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-24KVM: VMX: Use arch_xchg() when processing PIR to avoid instrumentationSean Christopherson
Use arch_xchg() when moving IRQs from the PIR to the vIRR, purely to avoid instrumentation so that KVM is compatible with the needs of posted MSI. This will allow extracting the core PIR logic to common code and sharing it between KVM and posted MSI handling. Link: https://lore.kernel.org/r/20250401163447.846608-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-24KVM: VMX: Isolate pure loads from atomic XCHG when processing PIRSean Christopherson
Rework KVM's processing of the PIR to use the same algorithm as posted MSIs, i.e. to do READ(x4) => XCHG(x4) instead of (READ+XCHG)(x4). Given KVM's long-standing, sub-optimal use of 32-bit accesses to the PIR, it's safe to say far more thought and investigation was put into handling the PIR for posted MSIs, i.e. there's no reason to assume KVM's existing logic is meaningful, let alone superior. Matching the processing done by posted MSIs will also allow deduplicating the code between KVM and posted MSIs. See the comment for handle_pending_pir() added by commit 1b03d82ba15e ("x86/irq: Install posted MSI notification handler") for details on why isolating loads from XCHG is desirable. Suggested-by: Jim Mattson <jmattson@google.com> Link: https://lore.kernel.org/r/20250401163447.846608-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-24KVM: VMX: Process PIR using 64-bit accesses on 64-bit kernelsSean Christopherson
Process the PIR at the natural kernel width, i.e. in 64-bit chunks on 64-bit kernels, so that the worst case of having a posted IRQ in each chunk of the vIRR only requires 4 loads and xchgs from/to the PIR, not 8. Deliberately use a "continue" to skip empty entries so that the code is a carbon copy of handle_pending_pir(), in anticipation of deduplicating KVM and posted MSI logic. Suggested-by: Jim Mattson <jmattson@google.com> Link: https://lore.kernel.org/r/20250401163447.846608-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-24x86/irq: KVM: Track PIR bitmap as an "unsigned long" arraySean Christopherson
Track the PIR bitmap in posted interrupt descriptor structures as an array of unsigned longs instead of using unionized arrays for KVM (u32s) versus IRQ management (u64s). In practice, because the non-KVM usage is (sanely) restricted to 64-bit kernels, all existing usage of the u64 variant is already working with unsigned longs. Using "unsigned long" for the array will allow reworking KVM's processing of the bitmap to read/write in 64-bit chunks on 64-bit kernels, i.e. will allow optimizing KVM by reducing the number of atomic accesses to PIR. Opportunstically replace the open coded literals in the posted MSIs code with the appropriate macro. Deliberately don't use ARRAY_SIZE() in the for-loops, even though it would be cleaner from a certain perspective, in anticipation of decoupling the processing from the array declaration. No functional change intended. Link: https://lore.kernel.org/r/20250401163447.846608-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-24KVM: VMX: Ensure vIRR isn't reloaded at odd times when sync'ing PIRSean Christopherson
Read each vIRR exactly once when shuffling IRQs from the PIR to the vAPIC to ensure getting the highest priority IRQ from the chunk doesn't reload from the vIRR. In practice, a reload is functionally benign as vcpu->mutex is held and so IRQs can be consumed, i.e. new IRQs can appear, but existing IRQs can't disappear. Link: https://lore.kernel.org/r/20250401163447.846608-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-24x86/irq: Track if IRQ was found in PIR during initial loop (to load PIR vals)Sean Christopherson
Track whether or not at least one IRQ was found in PIR during the initial loop to load PIR chunks from memory. Doing so generates slightly better code (arguably) for processing the for-loop of XCHGs, especially for the case where there are no pending IRQs. Note, while PIR can be modified between the initial load and the XCHG, it can only _gain_ new IRQs, i.e. there is no danger of a false positive due to the final version of pir_copy[] being empty. Opportunistically convert the boolean to an "unsigned long" and compute the effective boolean result via bitwise-OR. Some compilers, e.g. clang-14, need the extra "hint" to elide conditional branches. Opportunistically rename the variable in anticipation of moving the PIR accesses to a common helper that can be shared by posted MSIs and KVM. Old: <+74>: test %rdx,%rdx <+77>: je 0xffffffff812bbeb0 <handle_pending_pir+144> <pir[0]> <+88>: mov $0x1,%dl> <+90>: test %rsi,%rsi <+93>: je 0xffffffff812bbe8c <handle_pending_pir+108> <pir[1]> <+106>: mov $0x1,%dl <+108>: test %rcx,%rcx <+111>: je 0xffffffff812bbe9e <handle_pending_pir+126> <pir[2]> <+124>: mov $0x1,%dl <+126>: test %rax,%rax <+129>: je 0xffffffff812bbeb9 <handle_pending_pir+153> <pir[3]> <+142>: jmp 0xffffffff812bbec1 <handle_pending_pir+161> <+144>: xor %edx,%edx <+146>: test %rsi,%rsi <+149>: jne 0xffffffff812bbe7f <handle_pending_pir+95> <+151>: jmp 0xffffffff812bbe8c <handle_pending_pir+108> <+153>: test %dl,%dl <+155>: je 0xffffffff812bbf8e <handle_pending_pir+366> New: <+74>: mov %rax,%r8 <+77>: or %rcx,%r8 <+80>: or %rdx,%r8 <+83>: or %rsi,%r8 <+86>: setne %bl <+89>: je 0xffffffff812bbf88 <handle_pending_pir+360> <+95>: test %rsi,%rsi <+98>: je 0xffffffff812bbe8d <handle_pending_pir+109> <pir[0]> <+109>: test %rdx,%rdx <+112>: je 0xffffffff812bbe9d <handle_pending_pir+125> <pir[1]> <+125>: test %rcx,%rcx <+128>: je 0xffffffff812bbead <handle_pending_pir+141> <pir[2]> <+141>: test %rax,%rax <+144>: je 0xffffffff812bbebd <handle_pending_pir+157> <pir[3]> Link: https://lore.kernel.org/r/20250401163447.846608-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-24x86/irq: Ensure initial PIR loads are performed exactly onceSean Christopherson
Ensure the PIR is read exactly once at the start of handle_pending_pir(), to guarantee that checking for an outstanding posted interrupt in a given chuck doesn't reload the chunk from the "real" PIR. Functionally, a reload is benign, but it would defeat the purpose of pre-loading into a copy. Fixes: 1b03d82ba15e ("x86/irq: Install posted MSI notification handler") Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20250401163447.846608-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-24x86/insn: Fix CTEST instruction decodingKirill A. Shutemov
insn_decoder_test found a problem with decoding APX CTEST instructions: Found an x86 instruction decoder bug, please report this. ffffffff810021df 62 54 94 05 85 ff ctestneq objdump says 6 bytes, but insn_get_length() says 5 It happens because x86-opcode-map.txt doesn't specify arguments for the instruction and the decoder doesn't expect to see ModRM byte. Fixes: 690ca3a3067f ("x86/insn: Add support for APX EVEX instructions to the opcode map") Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: stable@vger.kernel.org # v6.10+ Link: https://lore.kernel.org/r/20250423065815.2003231-1-kirill.shutemov@linux.intel.com
2025-04-24KVM: x86: Add module param to control and enumerate device posted IRQsSean Christopherson
Add a module param to each KVM vendor module to allow disabling device posted interrupts without having to sacrifice all of APICv/AVIC, and to also effectively enumerate to userspace whether or not KVM may be utilizing device posted IRQs. Disabling device posted interrupts is very desirable for testing, and can even be desirable for production environments, e.g. if the host kernel wants to interpose on device interrupts. Put the module param in kvm-{amd,intel}.ko instead of kvm.ko to match the overall APICv/AVIC controls, and to avoid complications with said controls. E.g. if the param is in kvm.ko, KVM needs to be snapshot the original user-defined value to play nice with a vendor module being reloaded with different enable_apicv settings. Link: https://lore.kernel.org/r/20250401161804.842968-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-24KVM: VMX: Don't send UNBLOCK when starting device assignment without APICvSean Christopherson
When starting device assignment, i.e. potential IRQ bypass, don't blast KVM_REQ_UNBLOCK if APICv is disabled/unsupported. There is no need to wake vCPUs if they can never use VT-d posted IRQs (sending UNBLOCK guards against races being vCPUs blocking and devices starting IRQ bypass). Opportunistically use kvm_arch_has_irq_bypass() for all relevant checks in the VMX Posted Interrupt code so that all checks in KVM x86 incorporate the same information (once AMD/AVIC is given similar treatment). Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Link: https://lore.kernel.org/r/20250401161804.842968-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-24KVM: x86: Rescan I/O APIC routes after EOI interception for old routingweizijie
Rescan I/O APIC routes for a vCPU after handling an intercepted I/O APIC EOI for an IRQ that is not targeting said vCPU, i.e. after handling what's effectively a stale EOI VM-Exit. If a level-triggered IRQ is in-flight when IRQ routing changes, e.g. because the guest changes routing from its IRQ handler, then KVM intercepts EOIs on both the new and old target vCPUs, so that the in-flight IRQ can be de-asserted when it's EOI'd. However, only the EOI for the in-flight IRQ needs to be intercepted, as IRQs on the same vector with the new routing are coincidental, i.e. occur only if the guest is reusing the vector for multiple interrupt sources. If the I/O APIC routes aren't rescanned, KVM will unnecessarily intercept EOIs for the vector and negative impact the vCPU's interrupt performance. Note, both commit db2bdcbbbd32 ("KVM: x86: fix edge EOI and IOAPIC reconfig race") and commit 0fc5a36dd6b3 ("KVM: x86: ioapic: Fix level-triggered EOI and IOAPIC reconfigure race") mentioned this issue, but it was considered a "rare" occurrence thus was not addressed. However in real environments, this issue can happen even in a well-behaved guest. Cc: Kai Huang <kai.huang@intel.com> Co-developed-by: xuyun <xuyun_xy.xy@linux.alibaba.com> Signed-off-by: xuyun <xuyun_xy.xy@linux.alibaba.com> Signed-off-by: weizijie <zijie.wei@linux.alibaba.com> [sean: massage changelog and comments, use int/-1, reset at scan] Reviewed-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20250304013335.4155703-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-24KVM: x86: Add a helper to deduplicate I/O APIC EOI interception logicSean Christopherson
Extract the vCPU specific EOI interception logic for I/O APIC emulation into a common helper for userspace and in-kernel emulation in anticipation of optimizing the "pending EOI" case. No functional change intended. Reviewed-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20250304013335.4155703-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-24KVM: x86: Isolate edge vs. level check in userspace I/O APIC route scanningSean Christopherson
Extract and isolate the trigger mode check in kvm_scan_ioapic_routes() in anticipation of moving destination matching logic to a common helper (for userspace vs. in-kernel I/O APIC emulation). No functional change intended. Reviewed-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20250304013335.4155703-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-24KVM: x86: Advertise support for AMD's PREFETCHIBabu Moger
The latest AMD platform has introduced a new instruction called PREFETCHI. This instruction loads a cache line from a specified memory address into the indicated data or instruction cache level, based on locality reference hints. Feature bit definition: CPUID_Fn80000021_EAX [bit 20] - Indicates support for IC prefetch. This feature is analogous to Intel's PREFETCHITI (CPUID.(EAX=7,ECX=1):EDX), though the CPUID bit definitions differ between AMD and Intel. Advertise support to userspace, as no additional enabling is necessary (PREFETCHI can't be intercepted as there's no instruction specific behavior that needs to be virtualize). The feature is documented in Processor Programming Reference (PPR) for AMD Family 1Ah Model 02h, Revision C1 (Link below). Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537 Signed-off-by: Babu Moger <babu.moger@amd.com> Acked-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/ee1c08fc400bb574a2b8f2c6a0bd9def10a29d35.1744130533.git.babu.moger@amd.com [sean: rewrite shortlog to highlight the KVM functionality] Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-24KVM: x86: Sort CPUID_8000_0021_EAX leaf bits properlyBorislav Petkov
WRMSR_XX_BASE_NS is bit 1 so put it there, add some new bits as comments only. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20250324160617.15379-1-bp@kernel.org [sean: skip the FSRS/FSRC placeholders to avoid confusion] Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-24KVM: x86: clean up a returnDan Carpenter
Returning a literal X86EMUL_CONTINUE is slightly clearer than returning rc. Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Link: https://lore.kernel.org/r/7604cbbf-15e6-45a8-afec-cf5be46c2924@stanley.mountain Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-24KVM: x86: Advertise support for WRMSRNSSean Christopherson
Advertise support for WRMSRNS (WRMSR non-serializing) to userspace if the instruction is supported by the underlying CPU. From a virtualization perspective, the only difference between WRMSRNS and WRMSR is that VM-Exits due to WRMSRNS set EXIT_QUALIFICATION to '1'. WRMSRNS doesn't require a new enabling control, shares the same basic exit reason, and behaves the same as WRMSR with respect to MSR interception. WRMSR and WRMSRNS use the same basic exit reason (see Appendix C). For WRMSR, the exit qualification is 0, while for WRMSRNS it is 1. Don't do anything different when emulating WRMSRNS vs. WRMSR, as KVM can't do anything less, i.e. can't make emulation non-serializing. The motivation for the guest to use WRMSRNS instead of WRMSR is to avoid immediately serializing the CPU when the necessary serialization is guaranteed by some other mechanism, i.e. WRMSRNS being fully serializing isn't guest-visible, just less performant. Suggested-by: Xin Li (Intel) <xin@zytor.com> Link: https://lore.kernel.org/r/20250227010111.3222742-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-24x86/msr: Rename the WRMSRNS opcode macro to ASM_WRMSRNS (for KVM)Sean Christopherson
Rename the WRMSRNS instruction opcode macro so that it doesn't collide with X86_FEATURE_WRMSRNS when using token pasting to generate references to X86_FEATURE_WRMSRNS. KVM heavily uses token pasting to generate KVM's set of support feature bits, and adding WRMSRNS support in KVM will run will run afoul of the opcode macro. arch/x86/kvm/cpuid.c:719:37: error: pasting "X86_FEATURE_" and "" "" does not give a valid preprocessing token 719 | u32 __leaf = __feature_leaf(X86_FEATURE_##name); \ | ^~~~~~~~~~~~ KVM has worked around one such collision in the past by #undef'ing the problematic macro in order to avoid blocking a KVM rework, but such games are generally undesirable, e.g. requires bleeding macro details into KVM, risks weird behavior if what KVM is #undef'ing changes, etc. Reviewed-by: Xin Li (Intel) <xin@zytor.com> Link: https://lore.kernel.org/r/20250227010111.3222742-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-24KVM: x86: Generalize IBRS virtualization on emulated VM-exitYosry Ahmed
Commit 2e7eab81425a ("KVM: VMX: Execute IBPB on emulated VM-exit when guest has IBRS") added an IBPB in the emulated VM-exit path on Intel to properly virtualize IBRS by providing separate predictor modes for L1 and L2. AMD requires similar handling, except when IbrsSameMode is enumerated by the host CPU (which is the case on most/all AMD CPUs). With IbrsSameMode, hardware IBRS is sufficient and no extra handling is needed from KVM. Generalize the handling in nested_vmx_vmexit() by moving it into a generic function, add the AMD handling, and use it in nested_svm_vmexit() too. The main reason for using a generic function is to have a single place to park the huge comment about virtualizing IBRS. Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev> Reviewed-by: Jim Mattson <jmattson@google.com> Link: https://lore.kernel.org/r/20250221163352.3818347-4-yosry.ahmed@linux.dev [sean: use kvm_nested_vmexit_handle_spec_ctrl() for the helper] Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-24KVM: x86: Propagate AMD's IbrsSameMode to the guestYosry Ahmed
If IBRS provides same mode (kernel/user or host/guest) protection on the host, then by definition it also provides same mode protection in the guest. In fact, all different modes from the guest's perspective are the same mode from the host's perspective anyway. Propagate IbrsSameMode to the guests. Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev> Reviewed-by: Jim Mattson <jmattson@google.com> Link: https://lore.kernel.org/r/20250221163352.3818347-3-yosry.ahmed@linux.dev Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-24x86/cpufeatures: Define X86_FEATURE_AMD_IBRS_SAME_MODEYosry Ahmed
Per the APM [1]: Some processors, identified by CPUID Fn8000_0008_EBX[IbrsSameMode] (bit 19) = 1, provide additional speculation limits. For these processors, when IBRS is set, indirect branch predictions are not influenced by any prior indirect branches, regardless of mode (CPL and guest/host) and regardless of whether the prior indirect branches occurred before or after the setting of IBRS. This is referred to as Same Mode IBRS. Define this feature bit, which will be used by KVM to determine if an IBPB is required on nested VM-exits in SVM. [1] AMD64 Architecture Programmer's Manual Pub. 40332, Rev 4.08 - April 2024, Volume 2, 3.2.9 Speculation Control MSRs Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev> Reviewed-by: Jim Mattson <jmattson@google.com> Link: https://lore.kernel.org/r/20250221163352.3818347-2-yosry.ahmed@linux.dev Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-24KVM: x86: Check that the high 32bits are clear in kvm_arch_vcpu_ioctl_run()Dan Carpenter
The "kvm_run->kvm_valid_regs" and "kvm_run->kvm_dirty_regs" variables are u64 type. We are only using the lowest 3 bits but we want to ensure that the users are not passing invalid bits so that we can use the remaining bits in the future. However "sync_valid_fields" and kvm_sync_valid_fields() are u32 type so the check only ensures that the lower 32 bits are clear. Fix this by changing the types to u64. Fixes: 74c1807f6c4f ("KVM: x86: block KVM_CAP_SYNC_REGS if guest state is protected") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Link: https://lore.kernel.org/r/ec25aad1-113e-4c6e-8941-43d432251398@stanley.mountain Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-24KVM: SVM: Forcibly leave SMM mode on SHUTDOWN interceptionMikhail Lobanov
Previously, commit ed129ec9057f ("KVM: x86: forcibly leave nested mode on vCPU reset") addressed an issue where a triple fault occurring in nested mode could lead to use-after-free scenarios. However, the commit did not handle the analogous situation for System Management Mode (SMM). This omission results in triggering a WARN when KVM forces a vCPU INIT after SHUTDOWN interception while the vCPU is in SMM. This situation was reprodused using Syzkaller by: 1) Creating a KVM VM and vCPU 2) Sending a KVM_SMI ioctl to explicitly enter SMM 3) Executing invalid instructions causing consecutive exceptions and eventually a triple fault The issue manifests as follows: WARNING: CPU: 0 PID: 25506 at arch/x86/kvm/x86.c:12112 kvm_vcpu_reset+0x1d2/0x1530 arch/x86/kvm/x86.c:12112 Modules linked in: CPU: 0 PID: 25506 Comm: syz-executor.0 Not tainted 6.1.130-syzkaller-00157-g164fe5dde9b6 #0 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 RIP: 0010:kvm_vcpu_reset+0x1d2/0x1530 arch/x86/kvm/x86.c:12112 Call Trace: <TASK> shutdown_interception+0x66/0xb0 arch/x86/kvm/svm/svm.c:2136 svm_invoke_exit_handler+0x110/0x530 arch/x86/kvm/svm/svm.c:3395 svm_handle_exit+0x424/0x920 arch/x86/kvm/svm/svm.c:3457 vcpu_enter_guest arch/x86/kvm/x86.c:10959 [inline] vcpu_run+0x2c43/0x5a90 arch/x86/kvm/x86.c:11062 kvm_arch_vcpu_ioctl_run+0x50f/0x1cf0 arch/x86/kvm/x86.c:11283 kvm_vcpu_ioctl+0x570/0xf00 arch/x86/kvm/../../../virt/kvm/kvm_main.c:4122 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:870 [inline] __se_sys_ioctl fs/ioctl.c:856 [inline] __x64_sys_ioctl+0x19a/0x210 fs/ioctl.c:856 do_syscall_x64 arch/x86/entry/common.c:51 [inline] do_syscall_64+0x35/0x80 arch/x86/entry/common.c:81 entry_SYSCALL_64_after_hwframe+0x6e/0xd8 Architecturally, INIT is blocked when the CPU is in SMM, hence KVM's WARN() in kvm_vcpu_reset() to guard against KVM bugs, e.g. to detect improper emulation of INIT. SHUTDOWN on SVM is a weird edge case where KVM needs to do _something_ sane with the VMCB, since it's technically undefined, and INIT is the least awful choice given KVM's ABI. So, double down on stuffing INIT on SHUTDOWN, and force the vCPU out of SMM to avoid any weirdness (and the WARN). Found by Linux Verification Center (linuxtesting.org) with Syzkaller. Fixes: ed129ec9057f ("KVM: x86: forcibly leave nested mode on vCPU reset") Cc: stable@vger.kernel.org Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Mikhail Lobanov <m.lobanov@rosa.ru> Link: https://lore.kernel.org/r/20250414171207.155121-1-m.lobanov@rosa.ru [sean: massage changelog, make it clear this isn't architectural behavior] Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-24perf/x86: Fix non-sampling (counting) events on certain x86 platformsLuo Gengkun
Perf doesn't work at perf stat for hardware events on certain x86 platforms: $perf stat -- sleep 1 Performance counter stats for 'sleep 1': 16.44 msec task-clock # 0.016 CPUs utilized 2 context-switches # 121.691 /sec 0 cpu-migrations # 0.000 /sec 54 page-faults # 3.286 K/sec <not supported> cycles <not supported> instructions <not supported> branches <not supported> branch-misses The reason is that the check in x86_pmu_hw_config() for sampling events is unexpectedly applied to counting events as well. It should only impact x86 platforms with limit_period used for non-PEBS events. For Intel platforms, it should only impact some older platforms, e.g., HSW, BDW and NHM. Fixes: 88ec7eedbbd2 ("perf/x86: Fix low freqency setting issue") Signed-off-by: Luo Gengkun <luogengkun@huaweicloud.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ravi Bangoria <ravi.bangoria@amd.com> Link: https://lore.kernel.org/r/20250423064724.3716211-1-luogengkun@huaweicloud.com
2025-04-24Merge branch 'kvm-fixes-6.15-rc4' into HEADPaolo Bonzini
* Single fix for broken usage of 'multi-MIDR' infrastructure in PI code, adding an open-coded erratum check for Cavium ThunderX * Bugfixes from a planned posted interrupt rework * Do not use kvm_rip_read() unconditionally to cater for guests with inaccessible register state.
2025-04-24Merge tag 'kvmarm-fixes-6.15-2' of ↵Paolo Bonzini
https://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 fixes for 6.15, round #2 - Single fix for broken usage of 'multi-MIDR' infrastructure in PI code, adding an open-coded erratum check for everyone's favorite pile of sand: Cavium ThunderX
2025-04-24x86/boot: Work around broken busybox 'truncate' toolArd Biesheuvel
The GNU coreutils version of truncate, which is the original, accepts a % prefix for the -s size argument which means the file in question should be padded to a multiple of the given size. This is currently used to pad the setup block of bzImage to a multiple of 4k before appending the decompressor. busybox reimplements truncate but does not support this idiom, and therefore fails the build since commit 9c54baab4401 ("x86/boot: Drop CRC-32 checksum and the build tool that generates it") Since very little build code within the kernel depends on the 'truncate' utility, work around this incompatibility by avoiding truncate altogether, and relying on dd to perform the padding. Fixes: 9c54baab4401 ("x86/boot: Drop CRC-32 checksum and the build tool that generates it") Reported-by: <phasta@kernel.org> Tested-by: Philipp Stanner <phasta@kernel.org> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250424101917.1552527-2-ardb+git@google.com
2025-04-24x86/sev: Share the sev_secrets_pa value againTom Lendacky
This commits breaks SNP guests: 234cf67fc3bd ("x86/sev: Split off startup code from core code") The SNP guest boots, but no longer has access to the VMPCK keys needed to communicate with the ASP, which is used, for example, to obtain an attestation report. The secrets_pa value is defined as static in both startup.c and core.c. It is set by a function in startup.c and so when used in core.c its value will be 0. Share it again and add the sev_ prefix to put it into the global SEV symbols namespace. [ mingo: Renamed to sev_secrets_pa ] Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Ard Biesheuvel <ardb@kernel.org> Cc: Dionna Amalie Glaze <dionnaglaze@google.com> Cc: Kevin Loughlin <kevinloughlin@google.com> Link: https://lore.kernel.org/r/cf878810-81ed-3017-52c6-ce6aa41b5f01@amd.com
2025-04-24KVM: x86: Do not use kvm_rip_read() unconditionally for KVM_PROFILINGAdrian Hunter
Not all VMs allow access to RIP. Check guest_state_protected before calling kvm_rip_read(). This avoids, for example, hitting WARN_ON_ONCE in vt_cache_reg() for TDX VMs. Fixes: 81bf912b2c15 ("KVM: TDX: Implement TDX vcpu enter/exit path") Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Message-ID: <20250415104821.247234-3-adrian.hunter@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-24KVM: x86: Do not use kvm_rip_read() unconditionally in KVM tracepointsAdrian Hunter
Not all VMs allow access to RIP. Check guest_state_protected before calling kvm_rip_read(). This avoids, for example, hitting WARN_ON_ONCE in vt_cache_reg() for TDX VMs. Fixes: 81bf912b2c15 ("KVM: TDX: Implement TDX vcpu enter/exit path") Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Message-ID: <20250415104821.247234-2-adrian.hunter@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-24KVM: SVM: WARN if an invalid posted interrupt IRTE entry is addedSean Christopherson
Now that the AMD IOMMU doesn't signal success incorrectly, WARN if KVM attempts to track an AMD IRTE entry without metadata. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20250404193923.1413163-8-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-24KVM: x86: Take irqfds.lock when adding/deleting IRQ bypass producerSean Christopherson
Take irqfds.lock when adding/deleting an IRQ bypass producer to ensure irqfd->producer isn't modified while kvm_irq_routing_update() is running. The only lock held when a producer is added/removed is irqbypass's mutex. Fixes: 872768800652 ("KVM: x86: select IRQ_BYPASS_MANAGER") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20250404193923.1413163-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-24KVM: x86: Explicitly treat routing entry type changes as changesSean Christopherson
Explicitly treat type differences as GSI routing changes, as comparing MSI data between two entries could get a false negative, e.g. if userspace changed the type but left the type-specific data as-is. Fixes: 515a0c79e796 ("kvm: irqfd: avoid update unmodified entries of the routing") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20250404193923.1413163-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-24KVM: x86: Reset IRTE to host control if *new* route isn't postableSean Christopherson
Restore an IRTE back to host control (remapped or posted MSI mode) if the *new* GSI route prevents posting the IRQ directly to a vCPU, regardless of the GSI routing type. Updating the IRTE if and only if the new GSI is an MSI results in KVM leaving an IRTE posting to a vCPU. The dangling IRTE can result in interrupts being incorrectly delivered to the guest, and in the worst case scenario can result in use-after-free, e.g. if the VM is torn down, but the underlying host IRQ isn't freed. Fixes: efc644048ecd ("KVM: x86: Update IRTE for posted-interrupts") Fixes: 411b44ba80ab ("svm: Implements update_pi_irte hook to setup posted interrupt") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20250404193923.1413163-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-24KVM: SVM: Allocate IR data using atomic allocationSean Christopherson
Allocate SVM's interrupt remapping metadata using GFP_ATOMIC as svm_ir_list_add() is called with IRQs are disabled and irqfs.lock held when kvm_irq_routing_update() reacts to GSI routing changes. Fixes: 411b44ba80ab ("svm: Implements update_pi_irte hook to setup posted interrupt") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20250404193923.1413163-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-24KVM: SVM: Don't update IRTEs if APICv/AVIC is disabledSean Christopherson
Skip IRTE updates if AVIC is disabled/unsupported, as forcing the IRTE into remapped mode (kvm_vcpu_apicv_active() will never be true) is unnecessary and wasteful. The IOMMU driver is responsible for putting IRTEs into remapped mode when an IRQ is allocated by a device, long before that device is assigned to a VM. I.e. the kernel as a whole has major issues if the IRTE isn't already in remapped mode. Opportunsitically kvm_arch_has_irq_bypass() to query for APICv/AVIC, so so that all checks in KVM x86 incorporate the same information. Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Jim Mattson <jmattson@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20250401161804.842968-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-24KVM: arm64, x86: make kvm_arch_has_irq_bypass() inlinePaolo Bonzini
kvm_arch_has_irq_bypass() is a small function and even though it does not appear in any *really* hot paths, it's also not entirely rare. Make it inline---it also works out nicely in preparation for using it in kvm-intel.ko and kvm-amd.ko, since the function is not currently exported. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-24arm64: dts: renesas: r8a779h0: Add ISP core function blockNiklas Söderlund
The first ISP instance on V4M has both a channel select and core function block, describe the core region in addition to the existing cs region. While at it update the second ISP to match the new bindings and add the reg-names and interrupt-names property. Signed-off-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se> Reviewed-by: Jacopo Mondi <jacopo.mondi@ideasonboard.com> Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be> Link: https://lore.kernel.org/20250423163113.2961049-5-niklas.soderlund+renesas@ragnatech.se Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
2025-04-24arm64: dts: renesas: r8a779g0: Add ISP core function blockNiklas Söderlund
All ISP instances on V4H have both a channel select and core function block, describe the core region in addition to the existing cs region. Signed-off-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se> Reviewed-by: Jacopo Mondi <jacopo.mondi@ideasonboard.com> Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be> Link: https://lore.kernel.org/20250423163113.2961049-4-niklas.soderlund+renesas@ragnatech.se Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
2025-04-24arm64: dts: renesas: r8a779a0: Add ISP core function blockNiklas Söderlund
All ISP instances on V3U have both a channel select and core function block, describe the core region in addition to the existing cs region. The interrupt number already described intended to reflect the cs function but did incorrectly describe the core block. This was not noticed until now as the driver do not make use of the interrupt for the cs block. Signed-off-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se> Reviewed-by: Jacopo Mondi <jacopo.mondi@ideasonboard.com> Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be> Link: https://lore.kernel.org/20250423163113.2961049-3-niklas.soderlund+renesas@ragnatech.se Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
2025-04-24arm64: dts: renesas: r8a779g3: Add Retronix R-Car V4H Sparrow Hawk board supportMarek Vasut
Add Retronix R-Car V4H Sparrow Hawk board based on Renesas R-Car V4H ES3.0 (R8A779G3) SoC. This is a single-board computer with single gigabit ethernet, DSI-to-eDP bridge, DSI and two CSI2 interfaces, audio codec, two CANFD ports, micro SD card slot, USB PD supply, USB 3.0 ports, M.2 Key-M slot for NVMe SSD, debug UART and JTAG. Tested-by: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com> Tested-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se> Signed-off-by: Marek Vasut <marek.vasut+renesas@mailbox.org> Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be> Link: https://lore.kernel.org/20250420173829.200553-4-marek.vasut+renesas@mailbox.org Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
2025-04-24arm64: dts: mediatek: Add MT8186 Ponyta ChromebooksJianeng Ceng
MT8186 ponyta, known as huaqin custom label, is a MT8186 based laptop. It is based on the "corsola" design. It includes LTE, touchpad combinations. Reviewed-by: Matthias Brugger <matthias.bgg@gmail.com> Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> Signed-off-by: Jianeng Ceng <cengjianeng@huaqin.corp-partner.google.com> Link: https://lore.kernel.org/r/20250424010850.994288-3-cengjianeng@huaqin.corp-partner.google.com Signed-off-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
2025-04-24arm64: dts: mediatek: mt8186-corsola: make SDIO card removableAxe Yang
Under specific conditions, the SDIO function driver needs to remove/add SDIO card to perform a reset. Remove the non-removable property to support this scenario. Signed-off-by: Axe Yang <axe.yang@mediatek.com> Reviewed-by: Matthias Brugger <matthias.bgg@gmail.com> Link: https://lore.kernel.org/r/20250424013603.32351-1-axe.yang@mediatek.com Signed-off-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
2025-04-24powerpc/boot: Fix dash warningMadhavan Srinivasan
'commit b2accfe7ca5b ("powerpc/boot: Check for ld-option support")' suppressed linker warnings, but the expressed used did not go well with POSIX shell (dash) resulting with this warning arch/powerpc/boot/wrapper: 237: [: 0: unexpected operator ld: warning: arch/powerpc/boot/zImage.epapr has a LOAD segment with RWX permissions Fix the check to handle the reported warning. Patch also fixes couple of shellcheck reported errors for the same line. In arch/powerpc/boot/wrapper line 237: if [ $(${CROSS}ld -v --no-warn-rwx-segments &>/dev/null; echo $?) -eq 0 ]; then ^-- SC2046 (warning): Quote this to prevent word splitting. ^------^ SC2086 (info): Double quote to prevent globbing and word splitting. ^---------^ SC3020 (warning): In POSIX sh, &> is undefined. Fixes: b2accfe7ca5b ("powerpc/boot: Check for ld-option support") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Suggested-by: Stephen Rothwell <sfr@canb.auug.org.au> Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Reviewed-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com> Link: https://patch.msgid.link/20250423082154.30625-1-maddy@linux.ibm.com
2025-04-23Fix mis-uses of 'cc-option' for warning disablementLinus Torvalds
This was triggered by one of my mis-uses causing odd build warnings on sparc in linux-next, but while figuring out why the "obviously correct" use of cc-option caused such odd breakage, I found eight other cases of the same thing in the tree. The root cause is that 'cc-option' doesn't work for checking negative warning options (ie things like '-Wno-stringop-overflow') because gcc will silently accept options it doesn't recognize, and so 'cc-option' ends up thinking they are perfectly fine. And it all works, until you have a situation where _another_ warning is emitted. At that point the compiler will go "Hmm, maybe the user intended to disable this warning but used that wrong option that I didn't recognize", and generate a warning for the unrecognized negative option. Which explains why we have several cases of this in the tree: the 'cc-option' test really doesn't work for this situation, but most of the time it simply doesn't matter that ity doesn't work. The reason my recently added case caused problems on sparc was pointed out by Thomas Weißschuh: the sparc build had a previous explicit warning that then triggered the new one. I think the best fix for this would be to make 'cc-option' a bit smarter about this sitation, possibly by adding an intentional warning to the test case that then triggers the unrecognized option warning reliably. But the short-term fix is to replace 'cc-option' with an existing helper designed for this exact case: 'cc-disable-warning', which picks the negative warning but uses the positive form for testing the compiler support. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Link: https://lore.kernel.org/all/20250422204718.0b4e3f81@canb.auug.org.au/ Explained-by: Thomas Weißschuh <linux@weissschuh.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2025-04-23x86/mm: Fix _pgd_alloc() for Xen PV modeJuergen Gross
Recently _pgd_alloc() was switched from using __get_free_pages() to pagetable_alloc_noprof(), which might return a compound page in case the allocation order is larger than 0. On x86 this will be the case if CONFIG_MITIGATION_PAGE_TABLE_ISOLATION is set, even if PTI has been disabled at runtime. When running as a Xen PV guest (this will always disable PTI), using a compound page for a PGD will result in VM_BUG_ON_PGFLAGS being triggered when the Xen code tries to pin the PGD. Fix the Xen issue together with the not needed 8k allocation for a PGD with PTI disabled by replacing PGD_ALLOCATION_ORDER with an inline helper returning the needed order for PGD allocations. Fixes: a9b3c355c2e6 ("asm-generic: pgalloc: provide generic __pgd_{alloc,free}") Reported-by: Petr Vaněk <arkamar@atlas.cz> Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Tested-by: Petr Vaněk <arkamar@atlas.cz> Cc:stable@vger.kernel.org Link: https://lore.kernel.org/all/20250422131717.25724-1-jgross%40suse.com
2025-04-23arm64: dts: mediatek: mt8395-nio-12l: Enable Audio DSP and sound cardJulien Massot
Add memory regions for the Audio DSP (ADSP) and Audio Front-End (AFE), and enable both components in the device tree. Also, define the required pin configuration and add a sound card node configured to use the ADSP. This enables audio output through the 3.5mm headphone jack available on the board. Signed-off-by: Julien Massot <julien.massot@collabora.com> Link: https://lore.kernel.org/r/20250423-mt8395-audio-sof-v2-1-5e6dc7fba0fc@collabora.com Signed-off-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>