summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2024-09-09KVM: x86/mmu: Add a helper to walk and zap rmaps for a memslotSean Christopherson
Add a dedicated helper to walk and zap rmaps for a given memslot so that the code can be shared between KVM-initiated zaps and mmu_notifier invalidations. No functional change intended. Link: https://lore.kernel.org/r/20240809194335.1726916-13-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09KVM: x86/mmu: Plumb a @can_yield parameter into __walk_slot_rmaps()Sean Christopherson
Add a @can_yield param to __walk_slot_rmaps() to control whether or not dropping mmu_lock and conditionally rescheduling is allowed. This will allow using __walk_slot_rmaps() and thus cond_resched() to handle mmu_notifier invalidations, which usually allow blocking/yielding, but not when invoked by the OOM killer. Link: https://lore.kernel.org/r/20240809194335.1726916-12-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09KVM: x86/mmu: Move walk_slot_rmaps() up near for_each_slot_rmap_range()Sean Christopherson
Move walk_slot_rmaps() and friends up near for_each_slot_rmap_range() so that the walkers can be used to handle mmu_notifier invalidations, and so that similar function has some amount of locality in code. No functional change intended. Link: https://lore.kernel.org/r/20240809194335.1726916-11-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09KVM: x86/mmu: WARN on MMIO cache hit when emulating write-protected gfnSean Christopherson
WARN if KVM gets an MMIO cache hit on a RET_PF_WRITE_PROTECTED fault, as KVM should return RET_PF_WRITE_PROTECTED if and only if there is a memslot, and creating a memslot is supposed to invalidate the MMIO cache by virtue of changing the memslot generation. Keep the code around mainly to provide a convenient location to document why emulated MMIO should be impossible. Suggested-by: Yuan Yao <yuan.yao@linux.intel.com> Link: https://lore.kernel.org/r/20240831001538.336683-23-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09KVM: x86/mmu: Detect if unprotect will do anything based on invalid_listSean Christopherson
Explicitly query the list of to-be-zapped shadow pages when checking to see if unprotecting a gfn for retry has succeeded, i.e. if KVM should retry the faulting instruction. Add a comment to explain why the list needs to be checked before zapping, which is the primary motivation for this change. No functional change intended. Reviewed-by: Yuan Yao <yuan.yao@intel.com> Link: https://lore.kernel.org/r/20240831001538.336683-22-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09KVM: x86/mmu: Subsume kvm_mmu_unprotect_page() into the and_retry() versionSean Christopherson
Fold kvm_mmu_unprotect_page() into kvm_mmu_unprotect_gfn_and_retry() now that all other direct usage is gone. No functional change intended. Link: https://lore.kernel.org/r/20240831001538.336683-21-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09KVM: x86: Rename reexecute_instruction()=>kvm_unprotect_and_retry_on_failure()Sean Christopherson
Rename reexecute_instruction() to kvm_unprotect_and_retry_on_failure() to make the intent and purpose of the helper much more obvious. No functional change intended. Reviewed-by: Yuan Yao <yuan.yao@intel.com> Link: https://lore.kernel.org/r/20240831001538.336683-20-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09KVM: x86: Update retry protection fields when forcing retry on emulation failureSean Christopherson
When retrying the faulting instruction after emulation failure, refresh the infinite loop protection fields even if no shadow pages were zapped, i.e. avoid hitting an infinite loop even when retrying the instruction as a last-ditch effort to avoid terminating the guest. Link: https://lore.kernel.org/r/20240831001538.336683-19-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09KVM: x86: Apply retry protection to "unprotect on failure" pathSean Christopherson
Use kvm_mmu_unprotect_gfn_and_retry() in reexecute_instruction() to pick up protection against infinite loops, e.g. if KVM somehow manages to encounter an unsupported instruction and unprotecting the gfn doesn't allow the vCPU to make forward progress. Other than that, the retry-on- failure logic is a functionally equivalent, open coded version of kvm_mmu_unprotect_gfn_and_retry(). Note, the emulation failure path still isn't fully protected, as KVM won't update the retry protection fields if no shadow pages are zapped (but this change is still a step forward). That flaw will be addressed in a future patch. Reviewed-by: Yuan Yao <yuan.yao@intel.com> Link: https://lore.kernel.org/r/20240831001538.336683-18-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09KVM: x86: Check EMULTYPE_WRITE_PF_TO_SP before unprotecting gfnSean Christopherson
Don't bother unprotecting the target gfn if EMULTYPE_WRITE_PF_TO_SP is set, as KVM will simply report the emulation failure to userspace. This will allow converting reexecute_instruction() to use kvm_mmu_unprotect_gfn_instead_retry() instead of kvm_mmu_unprotect_page(). Link: https://lore.kernel.org/r/20240831001538.336683-17-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09KVM: x86: Remove manual pfn lookup when retrying #PF after failed emulationSean Christopherson
Drop the manual pfn look when retrying an instruction that KVM failed to emulation in response to a #PF due to a write-protected gfn. Now that KVM sets EMULTYPE_ALLOW_RETRY_PF if and only if the page fault hit a write- protected gfn, i.e. if and only if there's a writable memslot, there's no need to redo the lookup to avoid retrying an instruction that failed on emulated MMIO (no slot, or a write to a read-only slot). I.e. KVM will never attempt to retry an instruction that failed on emulated MMIO, whereas that was not the case prior to the introduction of RET_PF_WRITE_PROTECTED. Reviewed-by: Yuan Yao <yuan.yao@intel.com> Link: https://lore.kernel.org/r/20240831001538.336683-16-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09KVM: x86/mmu: Move event re-injection unprotect+retry into common pathSean Christopherson
Move the event re-injection unprotect+retry logic into kvm_mmu_write_protect_fault(), i.e. unprotect and retry if and only if the #PF actually hit a write-protected gfn. Note, there is a small possibility that the gfn was unprotected by a different tasking between hitting the #PF and acquiring mmu_lock, but in that case, KVM will resume the guest immediately anyways because KVM will treat the fault as spurious. As a bonus, unprotecting _after_ handling the page fault also addresses the case where the installing a SPTE to handle fault encounters a shadowed PTE, i.e. *creates* a read-only SPTE. Opportunstically add a comment explaining what on earth the intent of the code is, as based on the changelog from commit 577bdc496614 ("KVM: Avoid instruction emulation when event delivery is pending"). Link: https://lore.kernel.org/r/20240831001538.336683-15-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09KVM: x86/mmu: Always walk guest PTEs with WRITE access when unprotectingSean Christopherson
When getting a gpa from a gva to unprotect the associated gfn when an event is awating reinjection, walk the guest PTEs for WRITE as there's no point in unprotecting the gfn if the guest is unable to write the page, i.e. if write-protection can't trigger emulation. Note, the entire flow should be guarded on the access being a write, and even better should be conditioned on actually triggering a write-protect fault. This will be addressed in a future commit. Reviewed-by: Yuan Yao <yuan.yao@intel.com> Link: https://lore.kernel.org/r/20240831001538.336683-14-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09KVM: x86/mmu: Don't try to unprotect an INVALID_GPASean Christopherson
If getting the gpa for a gva fails, e.g. because the gva isn't mapped in the guest page tables, don't try to unprotect the invalid gfn. This is mostly a performance fix (avoids unnecessarily taking mmu_lock), as for_each_gfn_valid_sp_with_gptes() won't explode on garbage input, it's simply pointless. Link: https://lore.kernel.org/r/20240831001538.336683-13-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09KVM: x86: Fold retry_instruction() into x86_emulate_instruction()Sean Christopherson
Now that retry_instruction() is reasonably tiny, fold it into its sole caller, x86_emulate_instruction(). In addition to getting rid of the absurdly confusing retry_instruction() name, handling the retry in x86_emulate_instruction() pairs it back up with the code that resets last_retry_{eip,address}. No functional change intended. Reviewed-by: Yuan Yao <yuan.yao@intel.com> Link: https://lore.kernel.org/r/20240831001538.336683-12-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09KVM: x86: Move EMULTYPE_ALLOW_RETRY_PF to x86_emulate_instruction()Sean Christopherson
Move the sanity checks for EMULTYPE_ALLOW_RETRY_PF to the top of x86_emulate_instruction(). In addition to deduplicating a small amount of code, this makes the connection between EMULTYPE_ALLOW_RETRY_PF and EMULTYPE_PF even more explicit, and will allow dropping retry_instruction() entirely. Reviewed-by: Yuan Yao <yuan.yao@intel.com> Link: https://lore.kernel.org/r/20240831001538.336683-11-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09KVM: x86/mmu: Try "unprotect for retry" iff there are indirect SPsSean Christopherson
Try to unprotect shadow pages if and only if indirect_shadow_pages is non- zero, i.e. iff there is at least one protected such shadow page. Pre- checking indirect_shadow_pages avoids taking mmu_lock for write when the gfn is write-protected by a third party, i.e. not for KVM shadow paging, and in the *extremely* unlikely case that a different task has already unprotected the last shadow page. Link: https://lore.kernel.org/r/20240831001538.336683-10-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09KVM: x86/mmu: Apply retry protection to "fast nTDP unprotect" pathSean Christopherson
Move the anti-infinite-loop protection provided by last_retry_{eip,addr} into kvm_mmu_write_protect_fault() so that it guards unprotect+retry that never hits the emulator, as well as reexecute_instruction(), which is the last ditch "might as well try it" logic that kicks in when emulation fails on an instruction that faulted on a write-protected gfn. Add a new helper, kvm_mmu_unprotect_gfn_and_retry(), to set the retry fields and deduplicate other code (with more to come). Link: https://lore.kernel.org/r/20240831001538.336683-9-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09KVM: x86: Store gpa as gpa_t, not unsigned long, when unprotecting for retrySean Christopherson
Store the gpa used to unprotect the faulting gfn for retry as a gpa_t, not an unsigned long. This fixes a bug where 32-bit KVM would unprotect and retry the wrong gfn if the gpa had bits 63:32!=0. In practice, this bug is functionally benign, as unprotecting the wrong gfn is purely a performance issue (thanks to the anti-infinite-loop logic). And of course, almost no one runs 32-bit KVM these days. Reviewed-by: Yuan Yao <yuan.yao@intel.com> Link: https://lore.kernel.org/r/20240831001538.336683-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09KVM: x86: Get RIP from vCPU state when storing it to last_retry_eipSean Christopherson
Read RIP from vCPU state instead of pulling it from the emulation context when filling last_retry_eip, which is part of the anti-infinite-loop protection used when unprotecting and retrying instructions that hit a write-protected gfn. This will allow reusing the anti-infinite-loop protection in flows that never make it into the emulator. No functional change intended, as ctxt->eip is set to kvm_rip_read() in init_emulate_ctxt(), and EMULTYPE_PF emulation is mutually exclusive with EMULTYPE_NO_DECODE and EMULTYPE_SKIP, i.e. always goes through x86_decode_emulated_instruction() and hasn't advanced ctxt->eip (yet). Reviewed-by: Yuan Yao <yuan.yao@intel.com> Link: https://lore.kernel.org/r/20240831001538.336683-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09KVM: x86: Retry to-be-emulated insn in "slow" unprotect path iff sp is zappedSean Christopherson
Resume the guest and thus skip emulation of a non-PTE-writing instruction if and only if unprotecting the gfn actually zapped at least one shadow page. If the gfn is write-protected for some reason other than shadow paging, attempting to unprotect the gfn will effectively fail, and thus retrying the instruction is all but guaranteed to be pointless. This bug has existed for a long time, but was effectively fudged around by the retry RIP+address anti-loop detection. Reviewed-by: Yuan Yao <yuan.yao@intel.com> Link: https://lore.kernel.org/r/20240831001538.336683-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09KVM: x86/mmu: Skip emulation on page fault iff 1+ SPs were unprotectedSean Christopherson
When doing "fast unprotection" of nested TDP page tables, skip emulation if and only if at least one gfn was unprotected, i.e. continue with emulation if simply resuming is likely to hit the same fault and risk putting the vCPU into an infinite loop. Note, it's entirely possible to get a false negative, e.g. if a different vCPU faults on the same gfn and unprotects the gfn first, but that's a relatively rare edge case, and emulating is still functionally ok, i.e. saving a few cycles by avoiding emulation isn't worth the risk of putting the vCPU into an infinite loop. Opportunistically rewrite the relevant comment to document in gory detail exactly what scenario the "fast unprotect" logic is handling. Fixes: 147277540bbc ("kvm: svm: Add support for additional SVM NPF error codes") Cc: Yuan Yao <yuan.yao@intel.com> Reviewed-by: Yuan Yao <yuan.yao@intel.com> Link: https://lore.kernel.org/r/20240831001538.336683-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09KVM: x86/mmu: Trigger unprotect logic only on write-protection page faultsSean Christopherson
Trigger KVM's various "unprotect gfn" paths if and only if the page fault was a write to a write-protected gfn. To do so, add a new page fault return code, RET_PF_WRITE_PROTECTED, to explicitly and precisely track such page faults. If a page fault requires emulation for any MMIO (or any reason besides write-protection), trying to unprotect the gfn is pointless and risks putting the vCPU into an infinite loop. E.g. KVM will put the vCPU into an infinite loop if the vCPU manages to trigger MMIO on a page table walk. Fixes: 147277540bbc ("kvm: svm: Add support for additional SVM NPF error codes") Reviewed-by: Yuan Yao <yuan.yao@intel.com> Link: https://lore.kernel.org/r/20240831001538.336683-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09KVM: x86/mmu: Replace PFERR_NESTED_GUEST_PAGE with a more descriptive helperSean Christopherson
Drop the globally visible PFERR_NESTED_GUEST_PAGE and replace it with a more appropriately named is_write_to_guest_page_table(). The macro name is misleading, because while all nNPT walks match PAGE|WRITE|PRESENT, the reverse is not true. No functional change intended. Link: https://lore.kernel.org/r/20240831001538.336683-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09KVM: Harden guest memory APIs against out-of-bounds accessesSean Christopherson
When reading or writing a guest page, WARN and bail if offset+len would result in a read to a different page so that KVM bugs are more likely to be detected, and so that any such bugs are less likely to escalate to an out-of-bounds access. E.g. if userspace isn't using guard pages and the target page is at the end of a memslot. Note, KVM already hardens itself in similar APIs, e.g. in the "cached" variants, it's just the vanilla APIs that are playing with fire. Link: https://lore.kernel.org/r/20240829191413.900740-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09KVM: Write the per-page "segment" when clearing (part of) a guest pageSean Christopherson
Pass "seg" instead of "len" when writing guest memory in kvm_clear_guest(), as "seg" holds the number of bytes to write for the current page, while "len" holds the total bytes remaining. Luckily, all users of kvm_clear_guest() are guaranteed to not cross a page boundary, and so the bug is unhittable in the current code base. Fixes: 2f5414423ef5 ("KVM: remove kvm_clear_guest_page") Reported-by: zyr_ms@outlook.com Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219104 Link: https://lore.kernel.org/r/20240829191413.900740-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09KVM: nVMX: Assert that vcpu->mutex is held when accessing secondary VMCSesSean Christopherson
Add lockdep assertions in get_vmcs12() and get_shadow_vmcs12() to verify the vCPU's mutex is held, as the returned VMCS objects are dynamically allocated/freed when nested VMX is turned on/off, i.e. accessing vmcs12 structures without holding vcpu->mutex is susceptible to use-after-free. Waive the assertion if the VM is being destroyed, as KVM currently forces a nested VM-Exit when freeing the vCPU. If/when that wart is fixed, the assertion can/should be converted to an unqualified lockdep assertion. See also https://lore.kernel.org/all/Zsd0TqCeY3B5Sb5b@google.com. Link: https://lore.kernel.org/r/20240906043413.1049633-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09KVM: nVMX: Explicitly invalidate posted_intr_nv if PI is disabled at VM-EnterSean Christopherson
Explicitly invalidate posted_intr_nv when emulating nested VM-Enter and posted interrupts are disabled to make it clear that posted_intr_nv is valid if and only if nested posted interrupts are enabled, and as a cheap way to harden against KVM bugs. KVM initializes posted_intr_nv to -1 at vCPU creation and resets it to -1 when unloading vmcs12 and/or leaving nested mode, i.e. this is not a bug fix (or at least, it's not intended to be a bug fix). Note, tracking nested.posted_intr_nv as a u16 subtly adds a measure of safety, as it prevents unintentionally matching KVM's informal "no IRQ" vector of -1, stored as a signed int. Because a u16 can be always be represented as a signed int, the effective "invalid" value of posted_intr_nv, 65535, will be preserved as-is when comparing against an int, i.e. will be zero-extended, not sign-extended, and thus won't get a false positive if KVM is buggy and compares posted_intr_nv against -1. Opportunistically add a comment in vmx_deliver_nested_posted_interrupt() to call out that it must check vmx->nested.posted_intr_nv, not the vector in vmcs12, which is presumably the _entire_ reason nested.posted_intr_nv exists. E.g. vmcs12 is a KVM-controlled snapshot, so there are no TOCTOU races to worry about, the only potential badness is if the vCPU leaves nested and frees vmcs12 between the sender checking is_guest_mode() and dereferencing the vmcs12 pointer. Link: https://lore.kernel.org/r/20240906043413.1049633-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09KVM: x86: Fold kvm_get_apic_interrupt() into kvm_cpu_get_interrupt()Sean Christopherson
Fold kvm_get_apic_interrupt() into kvm_cpu_get_interrupt() now that nVMX essentially open codes kvm_get_apic_interrupt() in order to correctly emulate nested posted interrupts. Opportunistically stop exporting kvm_cpu_get_interrupt(), as the aforementioned nVMX flow was the only user in vendor code. Link: https://lore.kernel.org/r/20240906043413.1049633-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09KVM: nVMX: Detect nested posted interrupt NV at nested VM-Exit injectionSean Christopherson
When synthensizing a nested VM-Exit due to an external interrupt, pend a nested posted interrupt if the external interrupt vector matches L2's PI notification vector, i.e. if the interrupt is a PI notification for L2. This fixes a bug where KVM will incorrectly inject VM-Exit instead of processing nested posted interrupt when IPI virtualization is enabled. Per the SDM, detection of the notification vector doesn't occur until the interrupt is acknowledge and deliver to the CPU core. If the external-interrupt exiting VM-execution control is 1, any unmasked external interrupt causes a VM exit (see Section 26.2). If the "process posted interrupts" VM-execution control is also 1, this behavior is changed and the processor handles an external interrupt as follows: 1. The local APIC is acknowledged; this provides the processor core with an interrupt vector, called here the physical vector. 2. If the physical vector equals the posted-interrupt notification vector, the logical processor continues to the next step. Otherwise, a VM exit occurs as it would normally due to an external interrupt; the vector is saved in the VM-exit interruption-information field. For the most part, KVM has avoided problems because a PI NV for L2 that arrives will L2 is active will be processed by hardware, and KVM checks for a pending notification vector during nested VM-Enter. Thus, to hit the bug, the PI NV interrupt needs to sneak its way into L1's vIRR while L2 is active. Without IPI virtualization, the scenario is practically impossible to hit, modulo L1 doing weird things (see below), as the ordering between vmx_deliver_posted_interrupt() and nested VM-Enter effectively guarantees that either the sender will see the vCPU as being in_guest_mode(), or the receiver will see the interrupt in its vIRR. With IPI virtualization, introduced by commit d588bb9be1da ("KVM: VMX: enable IPI virtualization"), the sending CPU effectively implements a rough equivalent of vmx_deliver_posted_interrupt(), sans the nested PI NV check. If the target vCPU has a valid PID, the CPU will send a PI NV interrupt based on _L1's_ PID, as the sender's because IPIv table points at L1 PIDs. PIR := 32 bytes at PID_ADDR; // under lock PIR[V] := 1; store PIR at PID_ADDR; // release lock NotifyInfo := 8 bytes at PID_ADDR + 32; // under lock IF NotifyInfo.ON = 0 AND NotifyInfo.SN = 0; THEN NotifyInfo.ON := 1; SendNotify := 1; ELSE SendNotify := 0; FI; store NotifyInfo at PID_ADDR + 32; // release lock IF SendNotify = 1; THEN send an IPI specified by NotifyInfo.NDST and NotifyInfo.NV; FI; As a result, the target vCPU ends up receiving an interrupt on KVM's POSTED_INTR_VECTOR while L2 is running, with an interrupt in L1's PIR for L2's nested PI NV. The POSTED_INTR_VECTOR interrupt triggers a VM-Exit from L2 to L0, KVM moves the interrupt from L1's PIR to vIRR, triggers a KVM_REQ_EVENT prior to re-entry to L2, and calls vmx_check_nested_events(), effectively bypassing all of KVM's "early" checks on nested PI NV. Without IPI virtualization, the bug can likely be hit only if L1 programs an assigned device to _post_ an interrupt to L2's notification vector, by way of L1's PID.PIR. Doing so would allow the interrupt to get into L1's vIRR without KVM checking vmcs12's NV. Which is architecturally allowed, but unlikely behavior for a hypervisor. Cc: Zeng Guang <guang.zeng@intel.com> Reviewed-by: Chao Gao <chao.gao@intel.com> Link: https://lore.kernel.org/r/20240906043413.1049633-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09KVM: nVMX: Suppress external interrupt VM-Exit injection if there's no IRQSean Christopherson
In the should-be-impossible scenario that kvm_cpu_get_interrupt() doesn't return a valid vector after checking kvm_cpu_has_interrupt(), skip VM-Exit injection to reduce the probability of crashing/confusing L1. Now that KVM gets the IRQ _before_ calling nested_vmx_vmexit(), squashing the VM-Exit injection is trivial since there are no actions that need to be undone. Reviewed-by: Chao Gao <chao.gao@intel.com> Link: https://lore.kernel.org/r/20240906043413.1049633-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09KVM: nVMX: Get to-be-acknowledge IRQ for nested VM-Exit at injection siteSean Christopherson
Move the logic to get the to-be-acknowledge IRQ for a nested VM-Exit from nested_vmx_vmexit() to vmx_check_nested_events(), which is subtly the one and only path where KVM invokes nested_vmx_vmexit() with EXIT_REASON_EXTERNAL_INTERRUPT. A future fix will perform a last-minute check on L2's nested posted interrupt notification vector, just before injecting a nested VM-Exit. To handle that scenario correctly, KVM needs to get the interrupt _before_ injecting VM-Exit, as simply querying the highest priority interrupt, via kvm_cpu_has_interrupt(), would result in TOCTOU bug, as a new, higher priority interrupt could arrive between kvm_cpu_has_interrupt() and kvm_cpu_get_interrupt(). Unfortunately, simply moving the call to kvm_cpu_get_interrupt() doesn't suffice, as a VMWRITE to GUEST_INTERRUPT_STATUS.SVI is hiding in kvm_get_apic_interrupt(), and acknowledging the interrupt before nested VM-Exit would cause the VMWRITE to hit vmcs02 instead of vmcs01. Open code a rough equivalent to kvm_cpu_get_interrupt() so that the IRQ is acknowledged after emulating VM-Exit, taking care to avoid the TOCTOU issue described above. Opportunistically convert the WARN_ON() to a WARN_ON_ONCE(). If KVM has a bug that results in a false positive from kvm_cpu_has_interrupt(), spamming dmesg won't help the situation. Note, nested_vmx_reflect_vmexit() can never reflect external interrupts as they are always "wanted" by L0. Link: https://lore.kernel.org/r/20240906043413.1049633-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09KVM: x86: Move "ack" phase of local APIC IRQ delivery to separate APISean Christopherson
Split the "ack" phase, i.e. the movement of an interrupt from IRR=>ISR, out of kvm_get_apic_interrupt() and into a separate API so that nested VMX can acknowledge a specific interrupt _after_ emulating a VM-Exit from L2 to L1. To correctly emulate nested posted interrupts while APICv is active, KVM must: 1. find the highest pending interrupt. 2. check if that IRQ is L2's notification vector 3. emulate VM-Exit if the IRQ is NOT the notification vector 4. ACK the IRQ in L1 _after_ VM-Exit When APICv is active, the process of moving the IRQ from the IRR to the ISR also requires a VMWRITE to update vmcs01.GUEST_INTERRUPT_STATUS.SVI, and so acknowledging the interrupt before switching to vmcs01 would result in marking the IRQ as in-service in the wrong VMCS. KVM currently fudges around this issue by doing kvm_get_apic_interrupt() smack dab in the middle of emulating VM-Exit, but that hack doesn't play nice with nested posted interrupts, as notification vector IRQs don't trigger a VM-Exit in the first place. Cc: Nathan Chancellor <nathan@kernel.org> Link: https://lore.kernel.org/r/20240906043413.1049633-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09KVM: VMX: Also clear SGX EDECCSSA in KVM CPU caps when SGX is disabledKai Huang
When SGX EDECCSSA support was added to KVM in commit 16a7fe3728a8 ("KVM/VMX: Allow exposing EDECCSSA user leaf function to KVM guest"), it forgot to clear the X86_FEATURE_SGX_EDECCSSA bit in KVM CPU caps when KVM SGX is disabled. Fix it. Fixes: 16a7fe3728a8 ("KVM/VMX: Allow exposing EDECCSSA user leaf function to KVM guest") Signed-off-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20240905120837.579102-1-kai.huang@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09KVM: x86: Remove some unused declarationsYue Haibing
Commit 238adc77051a ("KVM: Cleanup LAPIC interface") removed kvm_lapic_get_base() but leave declaration. And other two declarations were never implenmented since introduction. Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Link: https://lore.kernel.org/r/20240830022537.2403873-1-yuehaibing@huawei.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09KVM: selftests: Verify single-stepping a fastpath VM-Exit exits to userspaceSean Christopherson
In x86's debug_regs test, change the RDMSR(MISC_ENABLES) in the single-step testcase to a WRMSR(TSC_DEADLINE) in order to verify that KVM honors KVM_GUESTDBG_SINGLESTEP when handling a fastpath VM-Exit. Note, the extra coverage is effectively Intel-only, as KVM only handles TSC_DEADLINE in the fastpath when the timer is emulated via the hypervisor timer, a.k.a. the VMX preemption timer. Link: https://lore.kernel.org/r/20240830044448.130449-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09KVM: x86: Forcibly leave nested if RSM to L2 hits shutdownSean Christopherson
Leave nested mode before synthesizing shutdown (a.k.a. TRIPLE_FAULT) if RSM fails when resuming L2 (a.k.a. guest mode). Architecturally, shutdown on RSM occurs _before_ the transition back to guest mode on both Intel and AMD. On Intel, per the SDM pseudocode, SMRAM state is loaded before critical VMX state: restore state normally from SMRAM; ... CR4.VMXE := value stored internally; IF internal storage indicates that the logical processor had been in VMX operation (root or non-root) THEN enter VMX operation (root or non-root); restore VMX-critical state as defined in Section 32.14.1; ... restore current VMCS pointer; FI; AMD's APM is both less clearcut and more explicit. Because AMD CPUs save VMCB and guest state in SMRAM itself, given the lack of anything in the APM to indicate a shutdown in guest mode is possible, a straightforward reading of the clause on invalid state is that _what_ state is invalid is irrelevant, i.e. all roads lead to shutdown. An RSM causes a processor shutdown if an invalid-state condition is found in the SMRAM state-save area. This fixes a bug found by syzkaller where synthesizing shutdown for L2 led to a nested VM-Exit (if L1 is intercepting shutdown), which in turn caused KVM to complain about trying to cancel a nested VM-Enter (see commit 759cbd59674a ("KVM: x86: nSVM/nVMX: set nested_run_pending on VM entry which is a result of RSM"). Note, Paolo pointed out that KVM shouldn't set nested_run_pending until after loading SMRAM state. But as above, that's only half the story, KVM shouldn't transition to guest mode either. Unfortunately, fixing that mess requires rewriting the nVMX and nSVM RSM flows to not piggyback their nested VM-Enter flows, as executing the nested VM-Enter flows after loading state from SMRAM would clobber much of said state. For now, add a FIXME to call out that transitioning to guest mode before loading state from SMRAM is wrong. Link: https://lore.kernel.org/all/CABgObfYaUHXyRmsmg8UjRomnpQ0Jnaog9-L2gMjsjkqChjDYUQ@mail.gmail.com Reported-by: syzbot+988d9efcdf137bc05f66@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/0000000000007a9acb06151e1670@google.com Reported-by: Zheyu Ma <zheyuma97@gmail.com> Closes: https://lore.kernel.org/all/CAMhUBjmXMYsEoVYw_M8hSZjBMHh24i88QYm-RY6HDta5YZ7Wgw@mail.gmail.com Analyzed-by: Michal Wilczynski <michal.wilczynski@intel.com> Cc: Kishen Maloor <kishen.maloor@intel.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20240906161337.1118412-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-08Linux 6.11-rc7Linus Torvalds
2024-09-08Merge tag 'timers_urgent_for_v6.11_rc7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fixes from Borislav Petkov: - Remove percpu irq related code in the timer-of initialization routine as it is broken but also unused (Daniel Lezcano) - Fix return -ETIME when delta exceeds INT_MAX and the next event not taking effect sometimes (Jacky Bai) * tag 'timers_urgent_for_v6.11_rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: clocksource/drivers/imx-tpm: Fix next event not taking effect sometime clocksource/drivers/imx-tpm: Fix return -ETIME when delta exceeds INT_MAX clocksource/drivers/timer-of: Remove percpu irq related code
2024-09-08Merge tag 'perf_urgent_for_v6.11_rc7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Borislav Petkov: - Fix perf's AUX buffer serialization - Prevent uninitialized struct members in perf's uprobes handling * tag 'perf_urgent_for_v6.11_rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/aux: Fix AUX buffer serialization uprobes: Use kzalloc to allocate xol area
2024-09-08Merge tag 'char-misc-6.11-rc7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc Pull char/misc driver fixes from Greg KH: "Here are some small char/misc/other driver fixes for 6.11-rc7. It's nothing huge, just a bunch of small fixes of reported problems, including: - lots of tiny iio driver fixes - nvmem driver fixex - binder UAF bugfix - uio driver crash fix - other small fixes All of these have been in linux-next this week with no reported problems" * tag 'char-misc-6.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (21 commits) VMCI: Fix use-after-free when removing resource in vmci_resource_remove() Drivers: hv: vmbus: Fix rescind handling in uio_hv_generic uio_hv_generic: Fix kernel NULL pointer dereference in hv_uio_rescind misc: keba: Fix sysfs group creation dt-bindings: nvmem: Use soc-nvmem node name instead of nvmem nvmem: Fix return type of devm_nvmem_device_get() in kerneldoc nvmem: u-boot-env: error if NVMEM device is too small misc: fastrpc: Fix double free of 'buf' in error path binder: fix UAF caused by offsets overwrite iio: imu: inv_mpu6050: fix interrupt status read for old buggy chips iio: adc: ad7173: fix GPIO device info iio: adc: ad7124: fix DT configuration parsing iio: adc: ad_sigma_delta: fix irq_flags on irq request iio: adc: ads1119: Fix IRQ flags iio: fix scale application in iio_convert_raw_to_processed_unlocked iio: adc: ad7124: fix config comparison iio: adc: ad7124: fix chip ID mismatch iio: adc: ad7173: Fix incorrect compatible string iio: buffer-dmaengine: fix releasing dma channel on error iio: adc: ad7606: remove frstdata check for serial mode ...
2024-09-08Merge tag 'usb-6.11-rc7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb Pull USB fixes from Greg KH: "Here are a handful of small USB fixes for 6.11-rc7. Included in here are: - dwc3 driver fixes for two reported problems - two typec ucsi driver fixes - cdns2 controller reset fix All of these have been in linux-next this week with no reported problems" * tag 'usb-6.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: usb: typec: ucsi: Fix cable registration usb: typec: ucsi: Fix the partner PD revision usb: cdns2: Fix controller reset issue usb: dwc3: core: update LC timer as per USB Spec V3.2 usb: dwc3: Avoid waking up gadget during startxfer
2024-09-07Merge tag 'clk-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux Pull clk fixes from Stephen Boyd: "A pile of Qualcomm clk driver fixes with two main themes: the alpha PLL driver and shared RCGs, and one fix for the Starfive JH7110 SoC. - The Alpha PLL clk_ops had multiple problems around setting rates. There are a handful of patches here that fix masks and skip enabling the clk from set_rate() when the PLL is disabled. The PLLs are crucial to operation of the system as almost all frequencies in the system are derived from them. - Parking shared RCGs at a slow always on clk at registration time breaks stuff. USB host mode can't handle such a slow frequency and the serial console gets all garbled when the UART clk is handed over to the kernel. There's a few patches that don't use the shared clk_ops for the UART clks and another one to skip parking the USB clk at registration time. - The Starfive PLL driver used for the CPU was busted causing cpufreq to fail because the clk didn't change to a safe parent during set_rate(). The fix is to register a notifier and switch to a safe parent so the PLL can change rate in a glitch free manner" * tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux: clk: qcom: gcc-sc8280xp: don't use parking clk_ops for QUPs clk: starfive: jh7110-sys: Add notifier for PLL0 clock clk: qcom: gcc-sm8650: Don't use shared clk_ops for QUPs clk: qcom: gcc-sm8550: Don't park the USB RCG at registration time clk: qcom: gcc-sm8550: Don't use parking clk_ops for QUPs clk: qcom: gcc-x1e80100: Don't use parking clk_ops for QUPs clk: qcom: ipq9574: Update the alpha PLL type for GPLLs clk: qcom: gcc-x1e80100: Fix USB 0 and 1 PHY GDSC pwrsts flags clk: qcom: clk-alpha-pll: Update set_rate for Zonda PLL clk: qcom: clk-alpha-pll: Fix zonda set_rate failure when PLL is disabled clk: qcom: clk-alpha-pll: Fix the trion pll postdiv set rate API clk: qcom: clk-alpha-pll: Fix the pll post div mask
2024-09-07Merge tag 'scsi-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi Pull SCSI fix from James Bottomley: "Single ufs driver fix quirking around another device spec violation" * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: scsi: ufs: ufs-mediatek: Add UFSHCD_QUIRK_BROKEN_LSDBS_CAP
2024-09-07Merge tag 'pinctrl-v6.11-3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl Pull pin control fix from Linus Walleij: "A single fix for Qualcomm laptops that are affected by missing wakeup IRQs" * tag 'pinctrl-v6.11-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl: pinctrl: qcom: x1e80100: Bypass PDC wakeup parent for now
2024-09-06Merge tag 'linux_kselftest-kunit-fixes-6.11-rc7-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest PullKUnit fix from Shuah Khan: "Fix to a missing function parameter warning found during documentation build in linux-next" * tag 'linux_kselftest-kunit-fixes-6.11-rc7-2' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest: kunit: Fix missing kerneldoc comment
2024-09-06Merge tag 'pci-v6.11-fixes-3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci Pull pci fixes from Bjorn Helgaas: - Unregister platform devices for child nodes when stopping a PCI device, even if the PCI core has already cleared the OF_POPULATED bit and of_platform_depopulate() doesn't do anything (Bartosz Golaszewski) - Rescan the bus from a separate thread so we don't deadlock when triggering rescan from sysfs (Bartosz Golaszewski) * tag 'pci-v6.11-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci: PCI/pwrctl: Rescan bus on a separate thread PCI: Don't rely on of_platform_depopulate() for reused OF-nodes
2024-09-06Merge tag 'v6.11-rc6-cifs-client-fixes' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds
Pull smb client fixes from Steve French: - fix potential mount hang - fix retry problem in two types of compound operations - important netfs integration fix in SMB1 read paths - fix potential uninitialized zero point of inode - minor patch to improve debugging for potential crediting problems * tag 'v6.11-rc6-cifs-client-fixes' of git://git.samba.org/sfrench/cifs-2.6: netfs, cifs: Improve some debugging bits cifs: Fix SMB1 readv/writev callback in the same way as SMB2/3 cifs: Fix zero_point init on inode initialisation smb: client: fix double put of @cfile in smb2_set_path_size() smb: client: fix double put of @cfile in smb2_rename_path() smb: client: fix hang in wait_for_response() for negproto
2024-09-06KVM: x86: don't fall through case statements without annotationsLinus Torvalds
clang warns on this because it has an unannotated fall-through between cases: arch/x86/kvm/x86.c:4819:2: error: unannotated fall-through between switch labels [-Werror,-Wimplicit-fallthrough] and while we could annotate it as a fallthrough, the proper fix is to just add the break for this case, instead of falling through to the default case and the break there. gcc also has that warning, but it looks like gcc only warns for the cases where they fall through to "real code", rather than to just a break. Odd. Fixes: d30d9ee94cc0 ("KVM: x86: Only advertise KVM_CAP_READONLY_MEM when supported by VM") Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Tom Dohrmann <erbse.13@gmx.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-09-06Merge tag 'arm64-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 fix from Catalin Marinas: "Fix the arm64 usage of ftrace_graph_ret_addr() to pass the &state->graph_idx pointer instead of NULL, otherwise this function just returns early" * tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: arm64: stacktrace: fix the usage of ftrace_graph_ret_addr()