summaryrefslogtreecommitdiff
path: root/arch/x86/kvm/svm
AgeCommit message (Collapse)Author
2023-08-07Merge tag 'x86_bugs_srso' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86/srso fixes from Borislav Petkov: "Add a mitigation for the speculative RAS (Return Address Stack) overflow vulnerability on AMD processors. In short, this is yet another issue where userspace poisons a microarchitectural structure which can then be used to leak privileged information through a side channel" * tag 'x86_bugs_srso' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/srso: Tie SBPB bit setting to microcode patch detection x86/srso: Add a forgotten NOENDBR annotation x86/srso: Fix return thunks in generated code x86/srso: Add IBPB on VMEXIT x86/srso: Add IBPB x86/srso: Add SRSO_NO support x86/srso: Add IBPB_BRTYPE support x86/srso: Add a Speculative RAS Overflow mitigation x86/bugs: Increase the x86 bugs vector size to two u32s
2023-08-04KVM: SEV: remove ghcb variable declarationsPaolo Bonzini
To avoid possible time-of-check/time-of-use issues, the GHCB should almost never be accessed outside dump_ghcb, sev_es_sync_to_ghcb and sev_es_sync_from_ghcb. The only legitimate uses are to set the exitinfo fields and to find the address of the scratch area embedded in the ghcb. Accessing ghcb_usage also goes through svm->sev_es.ghcb in sev_es_validate_vmgexit(), but that is because anyway the value is not used. Removing a shortcut variable that contains the value of svm->sev_es.ghcb makes these cases a bit more verbose, but it limits the chance of someone reading the ghcb by mistake. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-08-04KVM: SEV: only access GHCB fields oncePaolo Bonzini
A KVM guest using SEV-ES or SEV-SNP with multiple vCPUs can trigger a double fetch race condition vulnerability and invoke the VMGEXIT handler recursively. sev_handle_vmgexit() maps the GHCB page using kvm_vcpu_map() and then fetches the exit code using ghcb_get_sw_exit_code(). Soon after, sev_es_validate_vmgexit() fetches the exit code again. Since the GHCB page is shared with the guest, the guest is able to quickly swap the values with another vCPU and hence bypass the validation. One vmexit code that can be rejected by sev_es_validate_vmgexit() is SVM_EXIT_VMGEXIT; if sev_handle_vmgexit() observes it in the second fetch, the call to svm_invoke_exit_handler() will invoke sev_handle_vmgexit() again recursively. To avoid the race, always fetch the GHCB data from the places where sev_es_sync_from_ghcb stores it. Exploiting recursions on linux kernel has been proven feasible in the past, but the impact is mitigated by stack guard pages (CONFIG_VMAP_STACK). Still, if an attacker manages to call the handler multiple times, they can theoretically trigger a stack overflow and cause a denial-of-service, or potentially guest-to-host escape in kernel configurations without stack guard pages. Note that winning the race reliably in every iteration is very tricky due to the very tight window of the fetches; depending on the compiler settings, they are often consecutive because of optimization and inlining. Tested by booting an SEV-ES RHEL9 guest. Fixes: CVE-2023-4155 Fixes: 291bd20d5d88 ("KVM: SVM: Add initial support for a VMGEXIT VMEXIT") Cc: stable@vger.kernel.org Reported-by: Andy Nguyen <theflow@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-08-04KVM: SEV: snapshot the GHCB before accessing itPaolo Bonzini
Validation of the GHCB is susceptible to time-of-check/time-of-use vulnerabilities. To avoid them, we would like to always snapshot the fields that are read in sev_es_validate_vmgexit(), and not use the GHCB anymore after it returns. This means: - invoking sev_es_sync_from_ghcb() before any GHCB access, including before sev_es_validate_vmgexit() - snapshotting all fields including the valid bitmap and the sw_scratch field, which are currently not caching anywhere. The valid bitmap is the first thing to be copied out of the GHCB; then, further accesses will use the copy in svm->sev_es. Fixes: 291bd20d5d88 ("KVM: SVM: Add initial support for a VMGEXIT VMEXIT") Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-08-03KVM: nSVM: Skip writes to MSR_AMD64_TSC_RATIO if guest state isn't loadedSean Christopherson
Skip writes to MSR_AMD64_TSC_RATIO that are done in the context of a vCPU if guest state isn't loaded, i.e. if KVM will update MSR_AMD64_TSC_RATIO during svm_prepare_switch_to_guest() before entering the guest. Checking guest_state_loaded may or may not be a net positive for performance as the current_tsc_ratio cache will optimize away duplicate WRMSRs in the vast majority of scenarios. However, the cost of the check is negligible, and the real motivation is to document that KVM needs to load the vCPU's value only when running the vCPU. Link: https://lore.kernel.org/r/20230729011608.1065019-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-03KVM: x86: Always write vCPU's current TSC offset/ratio in vendor hooksSean Christopherson
Drop the @offset and @multiplier params from the kvm_x86_ops hooks for propagating TSC offsets/multipliers into hardware, and instead have the vendor implementations pull the information directly from the vCPU structure. The respective vCPU fields _must_ be written at the same time in order to maintain consistent state, i.e. it's not random luck that the value passed in by all callers is grabbed from the vCPU. Explicitly grabbing the value from the vCPU field in SVM's implementation in particular will allow for additional cleanup without introducing even more subtle dependencies. Specifically, SVM can skip the WRMSR if guest state isn't loaded, i.e. svm_prepare_switch_to_guest() will load the correct value for the vCPU prior to entering the guest. This also reconciles KVM's handling of related values that are stored in the vCPU, as svm_write_tsc_offset() already assumes/requires the caller to have updated l1_tsc_offset. Link: https://lore.kernel.org/r/20230729011608.1065019-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-03KVM: SVM: Clean up preemption toggling related to MSR_AMD64_TSC_RATIOSean Christopherson
Explicitly disable preemption when writing MSR_AMD64_TSC_RATIO only in the "outer" helper, as all direct callers of the "inner" helper now run with preemption already disabled. And that isn't a coincidence, as the outer helper requires a vCPU and is intended to be used when modifying guest state and/or emulating guest instructions, which are typically done with preemption enabled. Direct use of the inner helper should be extremely limited, as the only time KVM should modify MSR_AMD64_TSC_RATIO without a vCPU is when sanitizing the MSR for a specific pCPU (currently done when {en,dis}abling disabling SVM). The other direct caller is svm_prepare_switch_to_guest(), which does have a vCPU, but is a one-off special case: KVM is about to enter the guest on a specific pCPU and thus must have preemption disabled. Link: https://lore.kernel.org/r/20230729011608.1065019-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-03KVM: nSVM: Use the "outer" helper for writing multiplier to MSR_AMD64_TSC_RATIOSean Christopherson
When emulating nested SVM transitions, use the outer helper for writing the TSC multiplier for L2. Using the inner helper only for one-off cases, i.e. for paths where KVM is NOT emulating or modifying vCPU state, will allow for multiple cleanups: - Explicitly disabling preemption only in the outer helper - Getting the multiplier from the vCPU field in the outer helper - Skipping the WRMSR in the outer helper if guest state isn't loaded Opportunistically delete an extra newline. No functional change intended. Link: https://lore.kernel.org/r/20230729011608.1065019-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-03KVM: nSVM: Load L1's TSC multiplier based on L1 state, not L2 stateSean Christopherson
When emulating nested VM-Exit, load L1's TSC multiplier if L1's desired ratio doesn't match the current ratio, not if the ratio L1 is using for L2 diverges from the default. Functionally, the end result is the same as KVM will run L2 with L1's multiplier if L2's multiplier is the default, i.e. checking that L1's multiplier is loaded is equivalent to checking if L2 has a non-default multiplier. However, the assertion that TSC scaling is exposed to L1 is flawed, as userspace can trigger the WARN at will by writing the MSR and then updating guest CPUID to hide the feature (modifying guest CPUID is allowed anytime before KVM_RUN). E.g. hacking KVM's state_test selftest to do vcpu_set_msr(vcpu, MSR_AMD64_TSC_RATIO, 0); vcpu_clear_cpuid_feature(vcpu, X86_FEATURE_TSCRATEMSR); after restoring state in a new VM+vCPU yields an endless supply of: ------------[ cut here ]------------ WARNING: CPU: 10 PID: 206939 at arch/x86/kvm/svm/nested.c:1105 nested_svm_vmexit+0x6af/0x720 [kvm_amd] Call Trace: nested_svm_exit_handled+0x102/0x1f0 [kvm_amd] svm_handle_exit+0xb9/0x180 [kvm_amd] kvm_arch_vcpu_ioctl_run+0x1eab/0x2570 [kvm] kvm_vcpu_ioctl+0x4c9/0x5b0 [kvm] ? trace_hardirqs_off+0x4d/0xa0 __se_sys_ioctl+0x7a/0xc0 __x64_sys_ioctl+0x21/0x30 do_syscall_64+0x41/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd Unlike the nested VMRUN path, hoisting the svm->tsc_scaling_enabled check into the if-statement is wrong as KVM needs to ensure L1's multiplier is loaded in the above scenario. Alternatively, the WARN_ON() could simply be deleted, but that would make KVM's behavior even more subtle, e.g. it's not immediately obvious why it's safe to write MSR_AMD64_TSC_RATIO when checking only tsc_ratio_msr. Fixes: 5228eb96a487 ("KVM: x86: nSVM: implement nested TSC scaling") Cc: Maxim Levitsky <mlevitsk@redhat.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20230729011608.1065019-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-03KVM: nSVM: Check instead of asserting on nested TSC scaling supportSean Christopherson
Check for nested TSC scaling support on nested SVM VMRUN instead of asserting that TSC scaling is exposed to L1 if L1's MSR_AMD64_TSC_RATIO has diverged from KVM's default. Userspace can trigger the WARN at will by writing the MSR and then updating guest CPUID to hide the feature (modifying guest CPUID is allowed anytime before KVM_RUN). E.g. hacking KVM's state_test selftest to do vcpu_set_msr(vcpu, MSR_AMD64_TSC_RATIO, 0); vcpu_clear_cpuid_feature(vcpu, X86_FEATURE_TSCRATEMSR); after restoring state in a new VM+vCPU yields an endless supply of: ------------[ cut here ]------------ WARNING: CPU: 164 PID: 62565 at arch/x86/kvm/svm/nested.c:699 nested_vmcb02_prepare_control+0x3d6/0x3f0 [kvm_amd] Call Trace: <TASK> enter_svm_guest_mode+0x114/0x560 [kvm_amd] nested_svm_vmrun+0x260/0x330 [kvm_amd] vmrun_interception+0x29/0x30 [kvm_amd] svm_invoke_exit_handler+0x35/0x100 [kvm_amd] svm_handle_exit+0xe7/0x180 [kvm_amd] kvm_arch_vcpu_ioctl_run+0x1eab/0x2570 [kvm] kvm_vcpu_ioctl+0x4c9/0x5b0 [kvm] __se_sys_ioctl+0x7a/0xc0 __x64_sys_ioctl+0x21/0x30 do_syscall_64+0x41/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0033:0x45ca1b Note, the nested #VMEXIT path has the same flaw, but needs a different fix and will be handled separately. Fixes: 5228eb96a487 ("KVM: x86: nSVM: implement nested TSC scaling") Cc: Maxim Levitsky <mlevitsk@redhat.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20230729011608.1065019-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-03KVM: SVM: Use "standard" stgi() helper when disabling SVMSean Christopherson
Now that kvm_rebooting is guaranteed to be true prior to disabling SVM in an emergency, use the existing stgi() helper instead of open coding STGI. In effect, eat faults on STGI if and only if kvm_rebooting==true. Link: https://lore.kernel.org/r/20230721201859.2307736-19-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-03KVM: x86: Force kvm_rebooting=true during emergency reboot/crashSean Christopherson
Set kvm_rebooting when virtualization is disabled in an emergency so that KVM eats faults on virtualization instructions even if kvm_reboot() isn't reached. Reviewed-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20230721201859.2307736-18-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-03x86/virt: KVM: Move "disable SVM" helper into KVM SVMSean Christopherson
Move cpu_svm_disable() into KVM proper now that all hardware virtualization management is routed through KVM. Remove the now-empty virtext.h. No functional change intended. Reviewed-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20230721201859.2307736-17-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-03KVM: SVM: Check that the current CPU supports SVM in kvm_is_svm_supported()Sean Christopherson
Check "this" CPU instead of the boot CPU when querying SVM support so that the per-CPU checks done during hardware enabling actually function as intended, i.e. will detect issues where SVM isn't support on all CPUs. Disable migration for the use from svm_init() mostly so that the standard accessors for the per-CPU data can be used without getting yelled at by CONFIG_DEBUG_PREEMPT=y sanity checks. Preventing the "disabled by BIOS" error message from reporting the wrong CPU is largely a bonus, as ensuring a stable CPU during module load is a non-goal for KVM. Link: https://lore.kernel.org/all/ZAdxNgv0M6P63odE@google.com Cc: Kai Huang <kai.huang@intel.com> Cc: Chao Gao <chao.gao@intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20230721201859.2307736-15-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-03x86/virt: KVM: Open code cpu_has_svm() into kvm_is_svm_supported()Sean Christopherson
Fold the guts of cpu_has_svm() into kvm_is_svm_supported(), its sole remaining user. No functional change intended. Reviewed-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20230721201859.2307736-14-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-03x86/reboot: KVM: Disable SVM during reboot via virt/KVM reboot callbackSean Christopherson
Use the virt callback to disable SVM (and set GIF=1) during an emergency instead of blindly attempting to disable SVM. Like the VMX case, if a hypervisor, i.e. KVM, isn't loaded/active, SVM can't be in use. Acked-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20230721201859.2307736-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-02KVM: SVM: Use svm_get_lbr_vmcb() helper to handle writes to DEBUGCTLSean Christopherson
Use the recently introduced svm_get_lbr_vmcb() instead an open coded equivalent to retrieve the target VMCB when emulating writes to MSR_IA32_DEBUGCTLMSR. No functional change intended. Link: https://lore.kernel.org/r/20230607203519.1570167-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-02KVM: SVM: Clean up handling of LBR virtualization enabledSean Christopherson
Clean up the enable_lbrv computation in svm_update_lbrv() to consolidate the logic for computing enable_lbrv into a single statement, and to remove the coding style violations (lack of curly braces on nested if). No functional change intended. Link: https://lore.kernel.org/r/20230607203519.1570167-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-02KVM: SVM: Fix dead KVM_BUG() code in LBR MSR virtualizationSean Christopherson
Refactor KVM's handling of LBR MSRs on SVM to avoid a second layer of case statements, and thus eliminate a dead KVM_BUG() call, which (a) will never be hit in the current code base and (b) if a future commit breaks things, will never fire as KVM passes "false" instead "true" or '1' for the KVM_BUG() condition. Reported-by: Michal Luczaj <mhal@rbox.co> Cc: Yuan Yao <yuan.yao@intel.com> Link: https://lore.kernel.org/r/20230607203519.1570167-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-07-29KVM: x86: Disallow KVM_SET_SREGS{2} if incoming CR0 is invalidSean Christopherson
Reject KVM_SET_SREGS{2} with -EINVAL if the incoming CR0 is invalid, e.g. due to setting bits 63:32, illegal combinations, or to a value that isn't allowed in VMX (non-)root mode. The VMX checks in particular are "fun" as failure to disallow Real Mode for an L2 that is configured with unrestricted guest disabled, when KVM itself has unrestricted guest enabled, will result in KVM forcing VM86 mode to virtual Real Mode for L2, but then fail to unwind the related metadata when synthesizing a nested VM-Exit back to L1 (which has unrestricted guest enabled). Opportunistically fix a benign typo in the prototype for is_valid_cr4(). Cc: stable@vger.kernel.org Reported-by: syzbot+5feef0b9ee9c8e9e5689@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/000000000000f316b705fdf6e2b4@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20230613203037.1968489-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-07-29Revert "KVM: SVM: Skip WRMSR fastpath on VM-Exit if next RIP isn't valid"Sean Christopherson
Now that handle_fastpath_set_msr_irqoff() acquires kvm->srcu, i.e. allows dereferencing memslots during WRMSR emulation, drop the requirement that "next RIP" is valid. In hindsight, acquiring kvm->srcu would have been a better fix than avoiding the pastpath, but at the time it was thought that accessing SRCU-protected data in the fastpath was a one-off edge case. This reverts commit 5c30e8101e8d5d020b1d7119117889756a6ed713. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20230721224337.2335137-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-07-28KVM: SVM: Don't try to pointlessly single-step SEV-ES guests for NMI windowSean Christopherson
Bail early from svm_enable_nmi_window() for SEV-ES guests without trying to enable single-step of the guest, as single-stepping an SEV-ES guest is impossible and the guest is responsible for *telling* KVM when it is ready for an new NMI to be injected. Functionally, setting TF and RF in svm->vmcb->save.rflags is benign as the field is ignored by hardware, but it's all kinds of confusing. Signed-off-by: Alexey Kardashevskiy <aik@amd.com> Link: https://lore.kernel.org/r/20230615063757.3039121-10-aik@amd.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-07-28KVM: SVM: Don't defer NMI unblocking until next exit for SEV-ES guestsSean Christopherson
Immediately mark NMIs as unmasked in response to #VMGEXIT(NMI complete) instead of setting awaiting_iret_completion and waiting until the *next* VM-Exit to unmask NMIs. The whole point of "NMI complete" is that the guest is responsible for telling the hypervisor when it's safe to inject an NMI, i.e. there's no need to wait. And because there's no IRET to single-step, the next VM-Exit could be a long time coming, i.e. KVM could incorrectly hold an NMI pending for far longer than what is required and expected. Opportunistically fix a stale reference to HF_IRET_MASK. Fixes: 916b54a7688b ("KVM: x86: Move HF_NMI_MASK and HF_IRET_MASK into "struct vcpu_svm"") Fixes: 4444dfe4050b ("KVM: SVM: Add NMI support for an SEV-ES guest") Cc: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20230615063757.3039121-9-aik@amd.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-07-28KVM: SEV-ES: Eliminate #DB intercept when DebugSwap enabledAlexey Kardashevskiy
Disable #DB for SEV-ES guests when DebugSwap is enabled. There is no point in such intercept as KVM does not allow guest debug for SEV-ES guests. Signed-off-by: Alexey Kardashevskiy <aik@amd.com> Link: https://lore.kernel.org/r/20230615063757.3039121-8-aik@amd.com [sean: add comment as to why KVM disables #DB intercept iff DebugSwap=1] Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-07-28KVM: SEV: Enable data breakpoints in SEV-ESAlexey Kardashevskiy
Add support for "DebugSwap for SEV-ES guests", which provides support for swapping DR[0-3] and DR[0-3]_ADDR_MASK on VMRUN and VMEXIT, i.e. allows KVM to expose debug capabilities to SEV-ES guests. Without DebugSwap support, the CPU doesn't save/load most _guest_ debug registers (except DR6/7), and KVM cannot manually context switch guest DRs due the VMSA being encrypted. Enable DebugSwap if and only if the CPU also supports NoNestedDataBp, which causes the CPU to ignore nested #DBs, i.e. #DBs that occur when vectoring a #DB. Without NoNestedDataBp, a malicious guest can DoS the host by putting the CPU into an infinite loop of vectoring #DBs (see https://bugzilla.redhat.com/show_bug.cgi?id=1278496) Set the features bit in sev_es_sync_vmsa() which is the last point when VMSA is not encrypted yet as sev_(es_)init_vmcb() (where the most init happens) is called not only when VCPU is initialised but also on intrahost migration when VMSA is encrypted. Eliminate DR7 intercepts as KVM can't modify guest DR7, and intercepting DR7 would completely defeat the purpose of enabling DebugSwap. Make X86_FEATURE_DEBUG_SWAP appear in /proc/cpuinfo (by not adding "") to let the operator know if the VM can debug. Signed-off-by: Alexey Kardashevskiy <aik@amd.com> Link: https://lore.kernel.org/r/20230615063757.3039121-7-aik@amd.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-07-28KVM: SVM/SEV/SEV-ES: Rework interceptsAlexey Kardashevskiy
Currently SVM setup is done sequentially in init_vmcb() -> sev_init_vmcb() -> sev_es_init_vmcb() and tries keeping SVM/SEV/SEV-ES bits separated. One of the exceptions is DR intercepts which is for SEV-ES before sev_es_init_vmcb() runs. Move the SEV-ES intercept setup to sev_es_init_vmcb(). From now on set_dr_intercepts()/clr_dr_intercepts() handle SVM/SEV only. No functional change intended. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Alexey Kardashevskiy <aik@amd.com> Reviewed-by: Santosh Shukla <santosh.shukla@amd.com> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20230615063757.3039121-6-aik@amd.com [sean: drop comment about intercepting DR7] Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-07-28KVM: SEV-ES: explicitly disable debugAlexey Kardashevskiy
SVM/SEV enable debug registers intercepts to skip swapping DRs on entering/exiting the guest. When the guest is in control of debug registers (vcpu->guest_debug == 0), there is an optimisation to reduce the number of context switches: intercepts are cleared and the KVM_DEBUGREG_WONT_EXIT flag is set to tell KVM to do swapping on guest enter/exit. The same code also executes for SEV-ES, however it has no effect as - it always takes (vcpu->guest_debug == 0) branch; - KVM_DEBUGREG_WONT_EXIT is set but DR7 intercept is not cleared; - vcpu_enter_guest() writes DRs but VMRUN for SEV-ES swaps them with the values from _encrypted_ VMSA. Be explicit about SEV-ES not supporting debug: - return right away from dr_interception() and skip unnecessary processing; - return an error right away from the KVM_SEV_LAUNCH_UPDATE_VMSA handler if debugging was already enabled. KVM_SET_GUEST_DEBUG are failing already after KVM_SEV_LAUNCH_UPDATE_VMSA is finished due to vcpu->arch.guest_state_protected set to true. Add WARN_ON to kvm_x86::sync_dirty_debug_regs() (saves guest DRs on guest exit) to signify that SEV-ES won't hit that path. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Alexey Kardashevskiy <aik@amd.com> Link: https://lore.kernel.org/r/20230615063757.3039121-5-aik@amd.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-07-28KVM: SVM: Rewrite sev_es_prepare_switch_to_guest()'s comment about swap typesSean Christopherson
Rewrite the comment(s) in sev_es_prepare_switch_to_guest() to explain the swap types employed by the CPU for SEV-ES guests, i.e. to explain why KVM needs to save a seemingly random subset of host state, and to provide a decoder for the APM's Type-A/B/C terminology. Signed-off-by: Alexey Kardashevskiy <aik@amd.com> Link: https://lore.kernel.org/r/20230615063757.3039121-4-aik@amd.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-07-28KVM: SEV: Move SEV's GP_VECTOR intercept setup to SEVAlexey Kardashevskiy
Currently SVM setup is done sequentially in init_vmcb() -> sev_init_vmcb() -> sev_es_init_vmcb() and tries keeping SVM/SEV/SEV-ES bits separated. One of the exceptions is #GP intercept which init_vmcb() skips setting for SEV guests and then sev_es_init_vmcb() needlessly clears it. Remove the SEV check from init_vmcb(). Clear the #GP intercept in sev_init_vmcb(). SEV-ES will use the SEV setting. No functional change intended. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Alexey Kardashevskiy <aik@amd.com> Reviewed-by: Carlos Bilbao <carlos.bilbao@amd.com> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Reviewed-by: Santosh Shukla <santosh.shukla@amd.com> Link: https://lore.kernel.org/r/20230615063757.3039121-3-aik@amd.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-07-28KVM: SEV: move set_dr_intercepts/clr_dr_intercepts from the headerAlexey Kardashevskiy
Static functions set_dr_intercepts() and clr_dr_intercepts() are only called from SVM so move them to .c. No functional change intended. Signed-off-by: Alexey Kardashevskiy <aik@amd.com> Reviewed-by: Carlos Bilbao <carlos.bilbao@amd.com> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Reviewed-by: Santosh Shukla <santosh.shukla@amd.com> Link: https://lore.kernel.org/r/20230615063757.3039121-2-aik@amd.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-07-27x86/srso: Add IBPB on VMEXITBorislav Petkov (AMD)
Add the option to flush IBPB only on VMEXIT in order to protect from malicious guests but one otherwise trusts the software that runs on the hypervisor. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
2023-07-01Merge tag 'kvm-x86-svm-6.5' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM SVM changes for 6.5: - Drop manual TR/TSS load after VM-Exit now that KVM uses VMLOAD for host state - Fix a not-yet-problematic missing call to trace_kvm_exit() for VM-Exits that are handled in the fastpath - Print more descriptive information about the status of SEV and SEV-ES during module load - Assert that misc_cg_set_capacity() doesn't fail to avoid should-be-impossible memory leaks
2023-07-01Merge tag 'kvm-x86-pmu-6.5' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM x86/pmu changes for 6.5: - Add support for AMD PerfMonV2, with a variety of cleanups and minor fixes included along the way
2023-07-01Merge tag 'kvm-x86-misc-6.5' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM x86 changes for 6.5: * Move handling of PAT out of MTRR code and dedup SVM+VMX code * Fix output of PIC poll command emulation when there's an interrupt * Add a maintainer's handbook to document KVM x86 processes, preferred coding style, testing expectations, etc. * Misc cleanups
2023-06-13KVM: SVM: WARN, but continue, if misc_cg_set_capacity() failsSean Christopherson
WARN and continue if misc_cg_set_capacity() fails, as the only scenario in which it can fail is if the specified resource is invalid, which should never happen when CONFIG_KVM_AMD_SEV=y. Deliberately not bailing "fixes" a theoretical bug where KVM would leak the ASID bitmaps on failure, which again can't happen. If the impossible should happen, the end result is effectively the same with respect to SEV and SEV-ES (they are unusable), while continuing on has the advantage of letting KVM load, i.e. userspace can still run non-SEV guests. Reported-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Link: https://lore.kernel.org/r/20230607004449.1421131-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-06-06KVM: x86/cpuid: Add AMD CPUID ExtPerfMonAndDbg leaf 0x80000022Like Xu
CPUID leaf 0x80000022 i.e. ExtPerfMonAndDbg advertises some new performance monitoring features for AMD processors. Bit 0 of EAX indicates support for Performance Monitoring Version 2 (PerfMonV2) features. If found to be set during PMU initialization, the EBX bits of the same CPUID function can be used to determine the number of available PMCs for different PMU types. Expose the relevant bits via KVM_GET_SUPPORTED_CPUID so that guests can make use of the PerfMonV2 features. Co-developed-by: Sandipan Das <sandipan.das@amd.com> Signed-off-by: Sandipan Das <sandipan.das@amd.com> Signed-off-by: Like Xu <likexu@tencent.com> Link: https://lore.kernel.org/r/20230603011058.1038821-13-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-06-06KVM: x86/svm/pmu: Add AMD PerfMonV2 supportLike Xu
If AMD Performance Monitoring Version 2 (PerfMonV2) is detected by the guest, it can use a new scheme to manage the Core PMCs using the new global control and status registers. In addition to benefiting from the PerfMonV2 functionality in the same way as the host (higher precision), the guest also can reduce the number of vm-exits by lowering the total number of MSRs accesses. In terms of implementation details, amd_is_valid_msr() is resurrected since three newly added MSRs could not be mapped to one vPMC. The possibility of emulating PerfMonV2 on the mainframe has also been eliminated for reasons of precision. Co-developed-by: Sandipan Das <sandipan.das@amd.com> Signed-off-by: Sandipan Das <sandipan.das@amd.com> Signed-off-by: Like Xu <likexu@tencent.com> [sean: drop "Based on the observed HW." comments] Link: https://lore.kernel.org/r/20230603011058.1038821-12-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-06-06KVM: x86/pmu: Constrain the num of guest counters with kvm_pmu_capLike Xu
Cap the number of general purpose counters enumerated on AMD to what KVM actually supports, i.e. don't allow userspace to coerce KVM into thinking there are more counters than actually exist, e.g. by enumerating X86_FEATURE_PERFCTR_CORE in guest CPUID when its not supported. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Like Xu <likexu@tencent.com> [sean: massage changelog] Link: https://lore.kernel.org/r/20230603011058.1038821-10-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-06-06KVM: x86/pmu: Advertise PERFCTR_CORE iff the min nr of counters is metLike Xu
Enable and advertise PERFCTR_CORE if and only if the minimum number of required counters are available, i.e. if perf says there are less than six general purpose counters. Opportunistically, use kvm_cpu_cap_check_and_set() instead of open coding the check for host support. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Like Xu <likexu@tencent.com> [sean: massage shortlog and changelog] Link: https://lore.kernel.org/r/20230603011058.1038821-9-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-06-06KVM: x86/pmu: Disable vPMU if the minimum num of counters isn't metLike Xu
Disable PMU support when running on AMD and perf reports fewer than four general purpose counters. All AMD PMUs must define at least four counters due to AMD's legacy architecture hardcoding the number of counters without providing a way to enumerate the number of counters to software, e.g. from AMD's APM: The legacy architecture defines four performance counters (PerfCtrn) and corresponding event-select registers (PerfEvtSeln). Virtualizing fewer than four counters can lead to guest instability as software expects four counters to be available. Rather than bleed AMD details into the common code, just define a const unsigned int and provide a convenient location to document why Intel and AMD have different mins (in particular, AMD's lack of any way to enumerate less than four counters to the guest). Keep the minimum number of counters at Intel at one, even though old P6 and Core Solo/Duo processor effectively require a minimum of two counters. KVM can, and more importantly has up until this point, supported a vPMU so long as the CPU has at least one counter. Perf's support for P6/Core CPUs does require two counters, but perf will happily chug along with a single counter when running on a modern CPU. Cc: Jim Mattson <jmattson@google.com> Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Like Xu <likexu@tencent.com> [sean: set Intel min to '1', not '2'] Link: https://lore.kernel.org/r/20230603011058.1038821-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-06-06KVM: x86/pmu: Provide Intel PMU's pmc_is_enabled() as generic x86 codeLike Xu
Move the Intel PMU implementation of pmc_is_enabled() to common x86 code as pmc_is_globally_enabled(), and drop AMD's implementation. AMD PMU currently supports only v1, and thus not PERF_GLOBAL_CONTROL, thus the semantics for AMD are unchanged. And when support for AMD PMU v2 comes along, the common behavior will also Just Work. Signed-off-by: Like Xu <likexu@tencent.com> Co-developed-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20230603011058.1038821-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-06-06KVM: x86: Clean up: remove redundant bool conversionsMichal Luczaj
As test_bit() returns bool, explicitly converting result to bool is unnecessary. Get rid of '!!'. No functional change intended. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Michal Luczaj <mhal@rbox.co> Link: https://lore.kernel.org/r/20230605200158.118109-1-mhal@rbox.co Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-06-06KVM: SVM: enhance info printk's in SEV initAlexander Mikhalitsyn
Let's print available ASID ranges for SEV/SEV-ES guests. This information can be useful for system administrator to debug if SEV/SEV-ES fails to enable. There are a few reasons. SEV: - NPT is disabled (module parameter) - CPU lacks some features (sev, decodeassists) - Maximum SEV ASID is 0 SEV-ES: - mmio_caching is disabled (module parameter) - CPU lacks sev_es feature - Minimum SEV ASID value is 1 (can be adjusted in BIOS/UEFI) Cc: Sean Christopherson <seanjc@google.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Stéphane Graber <stgraber@ubuntu.com> Cc: kvm@vger.kernel.org Cc: linux-kernel@vger.kernel.org Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Link: https://lore.kernel.org/r/20230522161249.800829-3-aleksandr.mikhalitsyn@canonical.com [sean: print '0' for min SEV-ES ASID if there are no available ASIDs] Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-06-02KVM: SVM: Invoke trace_kvm_exit() for fastpath VM-ExitsSean Christopherson
Move SVM's call to trace_kvm_exit() from the "slow" VM-Exit handler to svm_vcpu_run() so that KVM traces fastpath VM-Exits that re-enter the guest without bouncing through the slow path. This bug is benign in the current code base as KVM doesn't currently support any such exits on SVM. Fixes: a9ab13ff6e84 ("KVM: X86: Improve latency for single target IPI fastpath") Link: https://lore.kernel.org/r/20230602011920.787844-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-06-02KVM: SVM: vNMI pending bit is V_NMI_PENDING_MASK not V_NMI_BLOCKING_MASKMaciej S. Szmigiero
While testing Hyper-V enabled Windows Server 2019 guests on Zen4 hardware I noticed that with vCPU count large enough (> 16) they sometimes froze at boot. With vCPU count of 64 they never booted successfully - suggesting some kind of a race condition. Since adding "vnmi=0" module parameter made these guests boot successfully it was clear that the problem is most likely (v)NMI-related. Running kvm-unit-tests quickly showed failing NMI-related tests cases, like "multiple nmi" and "pending nmi" from apic-split, x2apic and xapic tests and the NMI parts of eventinj test. The issue was that once one NMI was being serviced no other NMI was allowed to be set pending (NMI limit = 0), which was traced to svm_is_vnmi_pending() wrongly testing for the "NMI blocked" flag rather than for the "NMI pending" flag. Fix this by testing for the right flag in svm_is_vnmi_pending(). Once this is done, the NMI-related kvm-unit-tests pass successfully and the Windows guest no longer freezes at boot. Fixes: fa4c027a7956 ("KVM: x86: Add support for SVM's Virtual NMI") Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/be4ca192eb0c1e69a210db3009ca984e6a54ae69.1684495380.git.maciej.szmigiero@oracle.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-06-01KVM: x86: Move common handling of PAT MSR writes to kvm_set_msr_common()Sean Christopherson
Move the common check-and-set handling of PAT MSR writes out of vendor code and into kvm_set_msr_common(). This aligns writes with reads, which are already handled in common code, i.e. makes the handling of reads and writes symmetrical in common code. Alternatively, the common handling in kvm_get_msr_common() could be moved to vendor code, but duplicating code is generally undesirable (even though the duplicatated code is trivial in this case), and guest writes to PAT should be rare, i.e. the overhead of the extra function call is a non-issue in practice. Suggested-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20230511233351.635053-9-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-06-01KVM: SVM: Use kvm_pat_valid() directly instead of kvm_mtrr_valid()Ke Guo
Use kvm_pat_valid() directly instead of bouncing through kvm_mtrr_valid(). The PAT is not an MTRR, and kvm_mtrr_valid() just redirects to kvm_pat_valid(), i.e. is exempt from KVM's "zap SPTEs" logic that's needed to honor guest MTRRs when the VM has a passthrough device with non-coherent DMA (KVM does NOT set "ignore guest PAT" in this case, and so enables hardware virtualization of the guest's PAT, i.e. doesn't need to manually emulate the PAT memtype). Signed-off-by: Ke Guo <guoke@uniontech.com> [sean: massage changelog] Reviewed-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20230511233351.635053-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-06-01KVM: SVM: Remove TSS reloading code after VMEXITMingwei Zhang
Remove the dedicated post-VMEXIT TSS reloading code now that KVM uses VMLOAD to load host segment state, which includes TSS state. Fixes: e79b91bb3c91 ("KVM: SVM: use vmsave/vmload for saving/restoring additional host state") Reported-by: Venkatesh Srinivas <venkateshs@google.com> Suggested-by: Jim Mattson <jmattson@google.com> Signed-off-by: Mingwei Zhang <mizhang@google.com> Link: https://lore.kernel.org/r/20230523165635.4002711-1-mizhang@google.com [sean: massage changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-05-01Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm updates from Paolo Bonzini: "s390: - More phys_to_virt conversions - Improvement of AP management for VSIE (nested virtualization) ARM64: - Numerous fixes for the pathological lock inversion issue that plagued KVM/arm64 since... forever. - New framework allowing SMCCC-compliant hypercalls to be forwarded to userspace, hopefully paving the way for some more features being moved to VMMs rather than be implemented in the kernel. - Large rework of the timer code to allow a VM-wide offset to be applied to both virtual and physical counters as well as a per-timer, per-vcpu offset that complements the global one. This last part allows the NV timer code to be implemented on top. - A small set of fixes to make sure that we don't change anything affecting the EL1&0 translation regime just after having having taken an exception to EL2 until we have executed a DSB. This ensures that speculative walks started in EL1&0 have completed. - The usual selftest fixes and improvements. x86: - Optimize CR0.WP toggling by avoiding an MMU reload when TDP is enabled, and by giving the guest control of CR0.WP when EPT is enabled on VMX (VMX-only because SVM doesn't support per-bit controls) - Add CR0/CR4 helpers to query single bits, and clean up related code where KVM was interpreting kvm_read_cr4_bits()'s "unsigned long" return as a bool - Move AMD_PSFD to cpufeatures.h and purge KVM's definition - Avoid unnecessary writes+flushes when the guest is only adding new PTEs - Overhaul .sync_page() and .invlpg() to utilize .sync_page()'s optimizations when emulating invalidations - Clean up the range-based flushing APIs - Revamp the TDP MMU's reaping of Accessed/Dirty bits to clear a single A/D bit using a LOCK AND instead of XCHG, and skip all of the "handle changed SPTE" overhead associated with writing the entire entry - Track the number of "tail" entries in a pte_list_desc to avoid having to walk (potentially) all descriptors during insertion and deletion, which gets quite expensive if the guest is spamming fork() - Disallow virtualizing legacy LBRs if architectural LBRs are available, the two are mutually exclusive in hardware - Disallow writes to immutable feature MSRs (notably PERF_CAPABILITIES) after KVM_RUN, similar to CPUID features - Overhaul the vmx_pmu_caps selftest to better validate PERF_CAPABILITIES - Apply PMU filters to emulated events and add test coverage to the pmu_event_filter selftest - AMD SVM: - Add support for virtual NMIs - Fixes for edge cases related to virtual interrupts - Intel AMX: - Don't advertise XTILE_CFG in KVM_GET_SUPPORTED_CPUID if XTILE_DATA is not being reported due to userspace not opting in via prctl() - Fix a bug in emulation of ENCLS in compatibility mode - Allow emulation of NOP and PAUSE for L2 - AMX selftests improvements - Misc cleanups MIPS: - Constify MIPS's internal callbacks (a leftover from the hardware enabling rework that landed in 6.3) Generic: - Drop unnecessary casts from "void *" throughout kvm_main.c - Tweak the layout of "struct kvm_mmu_memory_cache" to shrink the struct size by 8 bytes on 64-bit kernels by utilizing a padding hole Documentation: - Fix goof introduced by the conversion to rST" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (211 commits) KVM: s390: pci: fix virtual-physical confusion on module unload/load KVM: s390: vsie: clarifications on setting the APCB KVM: s390: interrupt: fix virtual-physical confusion for next alert GISA KVM: arm64: Have kvm_psci_vcpu_on() use WRITE_ONCE() to update mp_state KVM: arm64: Acquire mp_state_lock in kvm_arch_vcpu_ioctl_vcpu_init() KVM: selftests: Test the PMU event "Instructions retired" KVM: selftests: Copy full counter values from guest in PMU event filter test KVM: selftests: Use error codes to signal errors in PMU event filter test KVM: selftests: Print detailed info in PMU event filter asserts KVM: selftests: Add helpers for PMC asserts in PMU event filter test KVM: selftests: Add a common helper for the PMU event filter guest code KVM: selftests: Fix spelling mistake "perrmited" -> "permitted" KVM: arm64: vhe: Drop extra isb() on guest exit KVM: arm64: vhe: Synchronise with page table walker on MMU update KVM: arm64: pkvm: Document the side effects of kvm_flush_dcache_to_poc() KVM: arm64: nvhe: Synchronise with page table walker on TLBI KVM: arm64: Handle 32bit CNTPCTSS traps KVM: arm64: nvhe: Synchronise with page table walker on vcpu run KVM: arm64: vgic: Don't acquire its_lock before config_lock KVM: selftests: Add test to verify KVM's supported XCR0 ...
2023-04-28Merge tag 'smp-core-2023-04-27' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull SMP cross-CPU function-call updates from Ingo Molnar: - Remove diagnostics and adjust config for CSD lock diagnostics - Add a generic IPI-sending tracepoint, as currently there's no easy way to instrument IPI origins: it's arch dependent and for some major architectures it's not even consistently available. * tag 'smp-core-2023-04-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: trace,smp: Trace all smp_function_call*() invocations trace: Add trace_ipi_send_cpu() sched, smp: Trace smp callback causing an IPI smp: reword smp call IPI comment treewide: Trace IPIs sent via smp_send_reschedule() irq_work: Trace self-IPIs sent via arch_irq_work_raise() smp: Trace IPIs sent via arch_send_call_function_ipi_mask() sched, smp: Trace IPIs sent via send_call_function_single_ipi() trace: Add trace_ipi_send_cpumask() kernel/smp: Make csdlock_debug= resettable locking/csd_lock: Remove per-CPU data indirection from CSD lock debugging locking/csd_lock: Remove added data from CSD lock debugging locking/csd_lock: Add Kconfig option for csd_debug default