summaryrefslogtreecommitdiff
path: root/arch/x86/kvm/vmx
AgeCommit message (Collapse)Author
2022-09-26KVM: nVMX: Treat General Detect #DB (DR7.GD=1) as fault-likeSean Christopherson
Exclude General Detect #DBs, which have fault-like behavior but also have a non-zero payload (DR6.BD=1), from nVMX's handling of pending debug traps. Opportunistically rewrite the comment to better document what is being checked, i.e. "has a non-zero payload" vs. "has a payload", and to call out the many caveats surrounding #DBs that KVM dodges one way or another. Cc: Oliver Upton <oupton@google.com> Cc: Peter Shier <pshier@google.com> Fixes: 684c0422da71 ("KVM: nVMX: Handle pending #DB when injecting INIT VM-exit") Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20220830231614.3580124-7-seanjc@google.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26KVM: VMX: Drop bits 31:16 when shoving exception error code into VMCSSean Christopherson
Deliberately truncate the exception error code when shoving it into the VMCS (VM-Entry field for vmcs01 and vmcs02, VM-Exit field for vmcs12). Intel CPUs are incapable of handling 32-bit error codes and will never generate an error code with bits 31:16, but userspace can provide an arbitrary error code via KVM_SET_VCPU_EVENTS. Failure to drop the bits on exception injection results in failed VM-Entry, as VMX disallows setting bits 31:16. Setting the bits on VM-Exit would at best confuse L1, and at worse induce a nested VM-Entry failure, e.g. if L1 decided to reinject the exception back into L2. Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20220830231614.3580124-3-seanjc@google.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26KVM: nVMX: Unconditionally purge queued/injected events on nested "exit"Sean Christopherson
Drop pending exceptions and events queued for re-injection when leaving nested guest mode, even if the "exit" is due to VM-Fail, SMI, or forced by host userspace. Failure to purge events could result in an event belonging to L2 being injected into L1. This _should_ never happen for VM-Fail as all events should be blocked by nested_run_pending, but it's possible if KVM, not the L1 hypervisor, is the source of VM-Fail when running vmcs02. SMI is a nop (barring unknown bugs) as recognition of SMI and thus entry to SMM is blocked by pending exceptions and re-injected events. Forced exit is definitely buggy, but has likely gone unnoticed because userspace probably follows the forced exit with KVM_SET_VCPU_EVENTS (or some other ioctl() that purges the queue). Fixes: 4f350c6dbcb9 ("kvm: nVMX: Handle deferred early VMLAUNCH/VMRESUME failure properly") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20220830231614.3580124-2-seanjc@google.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26KVM: nVMX: Use cached host MSR_IA32_VMX_MISC value for setting up nested MSRVitaly Kuznetsov
vmcs_config has cached host MSR_IA32_VMX_MISC value, use it for setting up nested MSR_IA32_VMX_MISC in nested_vmx_setup_ctls_msrs() and avoid the redundant rdmsr(). No (real) functional change intended. Reviewed-by: Jim Mattson <jmattson@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20220830133737.1539624-34-vkuznets@redhat.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26KVM: VMX: Cache MSR_IA32_VMX_MISC in vmcs_configVitaly Kuznetsov
Like other host VMX control MSRs, MSR_IA32_VMX_MISC can be cached in vmcs_config to avoid the need to re-read it later, e.g. from cpu_has_vmx_intel_pt() or cpu_has_vmx_shadow_vmcs(). No (real) functional change intended. Reviewed-by: Jim Mattson <jmattson@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20220830133737.1539624-33-vkuznets@redhat.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26KVM: nVMX: Use sanitized allowed-1 bits for VMX control MSRsVitaly Kuznetsov
Using raw host MSR values for setting up nested VMX control MSRs is incorrect as some features need to disabled, e.g. when KVM runs as a nested hypervisor on Hyper-V and uses Enlightened VMCS or when a workaround for IA32_PERF_GLOBAL_CTRL is applied. For non-nested VMX, this is done in setup_vmcs_config() and the result is stored in vmcs_config. Use it for setting up allowed-1 bits in nested VMX MSRs too. Suggested-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20220830133737.1539624-32-vkuznets@redhat.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26KVM: nVMX: Always set required-1 bits of pinbased_ctls to ↵Vitaly Kuznetsov
PIN_BASED_ALWAYSON_WITHOUT_TRUE_MSR Similar to exit_ctls_low, entry_ctls_low, and procbased_ctls_low, pinbased_ctls_low should be set to PIN_BASED_ALWAYSON_WITHOUT_TRUE_MSR and not host's MSR_IA32_VMX_PINBASED_CTLS value |= PIN_BASED_ALWAYSON_WITHOUT_TRUE_MSR. The commit eabeaaccfca0 ("KVM: nVMX: Clean up and fix pin-based execution controls") which introduced '|=' doesn't mention anything about why this is needed, the change seems rather accidental. Note: normally, required-1 portion of MSR_IA32_VMX_PINBASED_CTLS should be equal to PIN_BASED_ALWAYSON_WITHOUT_TRUE_MSR so no behavioral change is expected, however, it is (in theory) possible to observe something different there when e.g. KVM is running as a nested hypervisor. Hope this doesn't happen in practice. Reported-by: Jim Mattson <jmattson@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20220830133737.1539624-31-vkuznets@redhat.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26KVM: VMX: Move LOAD_IA32_PERF_GLOBAL_CTRL errata handling out of ↵Vitaly Kuznetsov
setup_vmcs_config() As a preparation to reusing the result of setup_vmcs_config() for setting up nested VMX control MSRs, move LOAD_IA32_PERF_GLOBAL_CTRL errata handling to vmx_vmexit_ctrl()/vmx_vmentry_ctrl() and print the warning from hardware_setup(). While it seems reasonable to not expose LOAD_IA32_PERF_GLOBAL_CTRL controls to L1 hypervisor on buggy CPUs, such change would inevitably break live migration from older KVMs where the controls are exposed. Keep the status quo for now, L1 hypervisor itself is supposed to take care of the errata. Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20220830133737.1539624-30-vkuznets@redhat.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26KVM: x86: VMX: Replace some Intel model numbers with mnemonicsJim Mattson
Intel processor code names are more familiar to many readers than their decimal model numbers. Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20220830133737.1539624-29-vkuznets@redhat.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26KVM: VMX: Adjust CR3/INVPLG interception for EPT=y at runtime, not setupSean Christopherson
Clear the CR3 and INVLPG interception controls at runtime based on whether or not EPT is being _used_, as opposed to clearing the bits at setup if EPT is _supported_ in hardware, and then restoring them when EPT is not used. Not mucking with the base config will allow using the base config as the starting point for emulating the VMX capability MSRs. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20220830133737.1539624-28-vkuznets@redhat.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26KVM: VMX: Add missing CPU based VM execution controls to vmcs_configVitaly Kuznetsov
As a preparation to reusing the result of setup_vmcs_config() in nested VMX MSR setup, add the CPU based VM execution controls which KVM doesn't use but supports for nVMX to KVM_OPT_VMX_CPU_BASED_VM_EXEC_CONTROL and filter them out in vmx_exec_control(). No functional change intended. Reviewed-by: Jim Mattson <jmattson@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20220830133737.1539624-27-vkuznets@redhat.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26KVM: VMX: Add missing VMEXIT controls to vmcs_configVitaly Kuznetsov
As a preparation to reusing the result of setup_vmcs_config() in nested VMX MSR setup, add the VMEXIT controls which KVM doesn't use but supports for nVMX to KVM_OPT_VMX_VM_EXIT_CONTROLS and filter them out in vmx_vmexit_ctrl(). No functional change intended. Reviewed-by: Jim Mattson <jmattson@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20220830133737.1539624-26-vkuznets@redhat.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26KVM: VMX: Move CPU_BASED_CR8_{LOAD,STORE}_EXITING filtering out of ↵Vitaly Kuznetsov
setup_vmcs_config() As a preparation to reusing the result of setup_vmcs_config() in nested VMX MSR setup, move CPU_BASED_CR8_{LOAD,STORE}_EXITING filtering to vmx_exec_control(). No functional change intended. Reviewed-by: Jim Mattson <jmattson@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20220830133737.1539624-25-vkuznets@redhat.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26KVM: VMX: Extend VMX controls macro shenanigansVitaly Kuznetsov
When VMX controls macros are used to set or clear a control bit, make sure that this bit was checked in setup_vmcs_config() and thus is properly reflected in vmcs_config. Opportunistically drop pointless "< 0" check for adjust_vmx_controls()'s return value. No functional change intended. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20220830133737.1539624-24-vkuznets@redhat.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26KVM: VMX: Don't toggle VM_ENTRY_IA32E_MODE for 32-bit kernels/KVMSean Christopherson
Don't toggle VM_ENTRY_IA32E_MODE in 32-bit kernels/KVM and instead bug the VM if KVM attempts to run the guest with EFER.LMA=1. KVM doesn't support running 64-bit guests with 32-bit hosts. Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20220830133737.1539624-23-vkuznets@redhat.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26KVM: VMX: Tweak the special handling of SECONDARY_EXEC_ENCLS_EXITING in ↵Vitaly Kuznetsov
setup_vmcs_config() SECONDARY_EXEC_ENCLS_EXITING is the only control which is conditionally added to the 'optional' checklist in setup_vmcs_config() but the special case can be avoided by always checking for its presence first and filtering out the result later. Note: the situation when SECONDARY_EXEC_ENCLS_EXITING is present but cpu_has_sgx() is false is possible when SGX is "soft-disabled", e.g. if software writes MCE control MSRs or there's an uncorrectable #MC. Reviewed-by: Jim Mattson <jmattson@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20220830133737.1539624-22-vkuznets@redhat.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26KVM: VMX: Check CPU_BASED_{INTR,NMI}_WINDOW_EXITING in setup_vmcs_config()Vitaly Kuznetsov
CPU_BASED_{INTR,NMI}_WINDOW_EXITING controls are toggled dynamically by vmx_enable_{irq,nmi}_window, handle_interrupt_window(), handle_nmi_window() but setup_vmcs_config() doesn't check their existence. Add the check and filter the controls out in vmx_exec_control(). Note: KVM explicitly supports CPUs without VIRTUAL_NMIS and all these CPUs are supposedly lacking NMI_WINDOW_EXITING too. Adjust cpu_has_virtual_nmis() accordingly. No functional change intended. Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20220830133737.1539624-21-vkuznets@redhat.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26KVM: VMX: Check VM_ENTRY_IA32E_MODE in setup_vmcs_config()Vitaly Kuznetsov
VM_ENTRY_IA32E_MODE control is toggled dynamically by vmx_set_efer() and setup_vmcs_config() doesn't check its existence. On the contrary, nested_vmx_setup_ctls_msrs() doesn set it on x86_64. Add the missing check and filter the bit out in vmx_vmentry_ctrl(). No (real) functional change intended as all existing CPUs supporting long mode and VMX are supposed to have it. Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Jim Mattson <jmattson@google.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20220830133737.1539624-20-vkuznets@redhat.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26KVM: nVMX: Always emulate PERF_GLOBAL_CTRL VM-Entry/VM-Exit controlsSean Christopherson
Advertise VM_{ENTRY,EXIT}_LOAD_IA32_PERF_GLOBAL_CTRL as being supported for nested VMs irrespective of hardware support. KVM fully emulates the controls, i.e. manually emulates MSR writes on entry/exit, and never propagates the guest settings directly to vmcs02. In addition to allowing L1 VMMs to use the controls on older hardware, unconditionally advertising the controls will also allow KVM to use its vmcs01 configuration as the basis for the nested VMX configuration without causing a regression (due the errata which causes KVM to "hide" the control from vmcs01 but not vmcs12). Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20220830133737.1539624-19-vkuznets@redhat.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26KVM: nVMX: Don't propagate vmcs12's PERF_GLOBAL_CTRL settings to vmcs02Sean Christopherson
Don't propagate vmcs12's VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL to vmcs02. KVM doesn't disallow L1 from using VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL even when KVM itself doesn't use the control, e.g. due to the various CPU errata that where the MSR can be corrupted on VM-Exit. Preserve KVM's (vmcs01) setting to hopefully avoid having to toggle the bit in vmcs02 at a later point. E.g. if KVM is loading PERF_GLOBAL_CTRL when running L1, then odds are good KVM will also load the MSR when running L2. Fixes: 8bf00a529967 ("KVM: VMX: add support for switching of PERF_GLOBAL_CTRL") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Link: https://lore.kernel.org/r/20220830133737.1539624-18-vkuznets@redhat.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26KVM: VMX: Get rid of eVMCS specific VMX controls sanitizationVitaly Kuznetsov
With the updated eVMCSv1 definition, there's no known 'problematic' controls which are exposed in VMX control MSRs but are not present in eVMCSv1: all known Hyper-V versions either don't expose the new fields by not setting bits in the VMX feature controls or support the new eVMCS revision. Get rid of VMX control MSRs filtering for KVM on Hyper-V. Note: VMX control MSRs filtering for Hyper-V on KVM (nested_evmcs_filter_control_msr()) stays as even the updated eVMCSv1 definition doesn't have all the features implemented by KVM and some fields are still missing. Moreover, nested_evmcs_filter_control_msr() has to support the original eVMCSv1 version when VMM wishes so. Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20220830133737.1539624-17-vkuznets@redhat.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26KVM: nVMX: Support PERF_GLOBAL_CTRL with enlightened VMCSVitaly Kuznetsov
Enlightened VMCS v1 got updated and now includes the required fields for loading PERF_GLOBAL_CTRL upon VMENTER/VMEXIT features. For KVM on Hyper-V enablement, KVM can just observe VMX control MSRs and use the features (with or without eVMCS) when possible. Hyper-V on KVM is messier as Windows 11 guests fail to boot if the controls are advertised and a new PV feature flag, CPUID.0x4000000A.EBX BIT(0), is not set. Honor the Hyper-V CPUID feature flag to play nice with Windows guests. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Link: https://lore.kernel.org/r/20220830133737.1539624-16-vkuznets@redhat.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26KVM: nVMX: WARN once and fail VM-Enter if eVMCS sees VMFUNC[63:32] != 0Sean Christopherson
WARN and reject nested VM-Enter if KVM is using eVMCS and manages to allow a non-zero value in the upper 32 bits of VM-function controls. The eVMCS code assumes all inputs are 32-bit values and subtly drops the upper bits. WARN instead of adding proper "support", it's unlikely the upper bits will be defined/used in the next decade. Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20220830133737.1539624-15-vkuznets@redhat.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26KVM: nVMX: Support several new fields in eVMCSv1Vitaly Kuznetsov
Enlightened VMCS v1 definition was updated with new fields, add support for them for Hyper-V on KVM. Note: SSP, CET and Guest LBR features are not supported by KVM yet and 'struct vmcs12' has no corresponding fields. Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20220830133737.1539624-11-vkuznets@redhat.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26KVM: VMX: Define VMCS-to-EVMCS conversion for the new fieldsVitaly Kuznetsov
Enlightened VMCS v1 definition was updated with new fields, support them in KVM by defining VMCS-to-EVMCS conversion. Note: SSP, CET and Guest LBR features are not supported by KVM yet and the corresponding fields are not defined in 'enum vmcs_field', leave them commented out for now. Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20220830133737.1539624-10-vkuznets@redhat.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26KVM: nVMX: Use CC() macro to handle eVMCS unsupported controls checksSean Christopherson
Locally #define and use the nested virtualization Consistency Check (CC) macro to handle eVMCS unsupported controls checks. Using the macro loses the existing printing of the unsupported controls, but that's a feature and not a bug. The existing approach is flawed because the @err param to trace_kvm_nested_vmenter_failed() is the error code, not the error value. The eVMCS trickery mostly works as __print_symbolic() falls back to printing the raw hex value, but that subtly relies on not having a match between the unsupported value and VMX_VMENTER_INSTRUCTION_ERRORS. If it's really truly necessary to snapshot the bad value, then the tracepoint can be extended in the future. Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20220830133737.1539624-9-vkuznets@redhat.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26KVM: nVMX: Refactor unsupported eVMCS controls logic to use 2-d arrayVitaly Kuznetsov
Refactor the handling of unsupported eVMCS to use a 2-d array to store the set of unsupported controls. KVM's handling of eVMCS is completely broken as there is no way for userspace to query which features are unsupported, nor does KVM prevent userspace from attempting to enable unsupported features. A future commit will remedy that by filtering and enforcing unsupported features when eVMCS, but that needs to be opt-in from userspace to avoid breakage, i.e. KVM needs to maintain its legacy behavior by snapshotting the exact set of controls that are currently (un)supported by eVMCS. No functional change intended. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> [sean: split to standalone patch, write changelog] Signed-off-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20220830133737.1539624-8-vkuznets@redhat.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26KVM: nVMX: Treat eVMCS as enabled for guest iff Hyper-V is also enabledSean Christopherson
When querying whether or not eVMCS is enabled on behalf of the guest, treat eVMCS as enable if and only if Hyper-V is enabled/exposed to the guest. Note, flows that come from the host, e.g. KVM_SET_NESTED_STATE, must NOT check for Hyper-V being enabled as KVM doesn't require guest CPUID to be set before most ioctls(). Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20220830133737.1539624-7-vkuznets@redhat.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26KVM: VMX: Do not declare vmread_error() asmlinkageUros Bizjak
There is no need to declare vmread_error() asmlinkage, its arguments can be passed via registers for both 32-bit and 64-bit targets. Function argument registers are considered call-clobbered registers, they are saved in the trampoline just before the function call and restored afterwards. Dropping "asmlinkage" patch unifies trampoline function argument handling between 32-bit and 64-bit targets and improves generated code for 32-bit targets. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Link: https://lore.kernel.org/r/20220817144045.3206-1-ubizjak@gmail.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26KVM: x86: Print guest pgd in kvm_nested_vmenter()Mingwei Zhang
Print guest pgd in kvm_nested_vmenter() to enrich the information for tracing. When tdp is enabled, print the value of tdp page table (EPT/NPT); when tdp is disabled, print the value of non-nested CR3. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Mingwei Zhang <mizhang@google.com> Link: https://lore.kernel.org/r/20220825225755.907001-4-mizhang@google.com [sean: print nested_cr3 vs. nested_eptp vs. guest_cr3] Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26KVM: nVMX: Add tracepoint for nested VM-EnterDavid Matlack
Call trace_kvm_nested_vmenter() during nested VMLAUNCH/VMRESUME to bring parity with nSVM's usage of the tracepoint during nested VMRUN. Attempt to use analagous VMCS fields to the VMCB fields that are reported in the SVM case: "int_ctl": 32-bit field of the VMCB that the CPU uses to deliver virtual interrupts. The analagous VMCS field is the 16-bit "guest interrupt status". "event_inj": 32-bit field of VMCB that is used to inject events (exceptions and interrupts) into the guest. The analagous VMCS field is the "VM-entry interruption-information field". "npt_enabled": 1 when the VCPU has enabled nested paging. The analagous VMCS field is the enable-EPT execution control. "npt_addr": 64-bit field when the VCPU has enabled nested paging. The analagous VMCS field is the ept_pointer. Signed-off-by: David Matlack <dmatlack@google.com> [move the code into the nested_vmx_enter_non_root_mode().] Signed-off-by: Mingwei Zhang <mizhang@google.com> Link: https://lore.kernel.org/r/20220825225755.907001-3-mizhang@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26KVM: Add extra information in kvm_page_fault trace pointWonhyuk Yang
Currently, kvm_page_fault trace point provide fault_address and error code. However it is not enough to find which cpu and instruction cause kvm_page_faults. So add vcpu id and instruction pointer in kvm_page_fault trace point. Cc: Baik Song An <bsahn@etri.re.kr> Cc: Hong Yeon Kim <kimhy@etri.re.kr> Cc: Taeung Song <taeung@reallinux.co.kr> Cc: linuxgeek@linuxgeek.io Signed-off-by: Wonhyuk Yang <vvghjk1234@gmail.com> Link: https://lore.kernel.org/r/20220510071001.87169-1-vvghjk1234@gmail.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26KVM/VMX: Avoid stack engine synchronization uop in __vmx_vcpu_runUros Bizjak
Avoid instructions with explicit uses of the stack pointer between instructions that implicitly refer to it. The sequence of POP %reg; ADD $x, %RSP; POP %reg forces emission of synchronization uop to synchronize the value of the stack pointer in the stack engine and the out-of-order core. Using POP with the dummy register instead of ADD $x, %RSP results in a smaller code size and faster code. The patch also fixes the reference to the wrong register in the nearby comment. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Link: https://lore.kernel.org/r/20220816211010.25693-1-ubizjak@gmail.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-19KVM: VMX: Heed the 'msr' argument in msr_write_intercepted()Jim Mattson
Regardless of the 'msr' argument passed to the VMX version of msr_write_intercepted(), the function always checks to see if a specific MSR (IA32_SPEC_CTRL) is intercepted for write. This behavior seems unintentional and unexpected. Modify the function so that it checks to see if the provided 'msr' index is intercepted for write. Fixes: 67f4b9969c30 ("KVM: nVMX: Handle dynamic MSR intercept toggling") Cc: Sean Christopherson <seanjc@google.com> Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220810213050.2655000-1-jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-15x86: Fix various duplicate-word comment typosJason Wang
[ mingo: Consolidated 4 very similar patches into one, it's silly to spread this out. ] Signed-off-by: Jason Wang <wangborong@cdjrlc.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20220715044809.20572-1-wangborong@cdjrlc.com
2022-08-11Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull more kvm updates from Paolo Bonzini: - Xen timer fixes - Documentation formatting fixes - Make rseq selftest compatible with glibc-2.35 - Fix handling of illegal LEA reg, reg - Cleanup creation of debugfs entries - Fix steal time cache handling bug - Fixes for MMIO caching - Optimize computation of number of LBRs - Fix uninitialized field in guest_maxphyaddr < host_maxphyaddr path * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (26 commits) KVM: x86/MMU: properly format KVM_CAP_VM_DISABLE_NX_HUGE_PAGES capability table Documentation: KVM: extend KVM_CAP_VM_DISABLE_NX_HUGE_PAGES heading underline KVM: VMX: Adjust number of LBR records for PERF_CAPABILITIES at refresh KVM: VMX: Use proper type-safe functions for vCPU => LBRs helpers KVM: x86: Refresh PMU after writes to MSR_IA32_PERF_CAPABILITIES KVM: selftests: Test all possible "invalid" PERF_CAPABILITIES.LBR_FMT vals KVM: selftests: Use getcpu() instead of sched_getcpu() in rseq_test KVM: selftests: Make rseq compatible with glibc-2.35 KVM: Actually create debugfs in kvm_create_vm() KVM: Pass the name of the VM fd to kvm_create_vm_debugfs() KVM: Get an fd before creating the VM KVM: Shove vcpu stats_id init into kvm_vcpu_init() KVM: Shove vm stats_id init into kvm_create_vm() KVM: x86/mmu: Add sanity check that MMIO SPTE mask doesn't overlap gen KVM: x86/mmu: rename trace function name for asynchronous page fault KVM: x86/xen: Stop Xen timer before changing IRQ KVM: x86/xen: Initialize Xen timer only once KVM: SVM: Disable SEV-ES support if MMIO caching is disable KVM: x86/mmu: Fully re-evaluate MMIO caching when SPTE masks change KVM: x86: Tag kvm_mmu_x86_module_init() with __init ...
2022-08-10KVM: VMX: Adjust number of LBR records for PERF_CAPABILITIES at refreshSean Christopherson
Now that the PMU is refreshed when MSR_IA32_PERF_CAPABILITIES is written by host userspace, zero out the number of LBR records for a vCPU during PMU refresh if PMU_CAP_LBR_FMT is not set in PERF_CAPABILITIES instead of handling the check at run-time. guest_cpuid_has() is expensive due to the linear search of guest CPUID entries, intel_pmu_lbr_is_enabled() is checked on every VM-Enter, _and_ simply enumerating the same "Model" as the host causes KVM to set the number of LBR records to a non-zero value. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220727233424.2968356-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-10KVM: VMX: Use proper type-safe functions for vCPU => LBRs helpersSean Christopherson
Turn vcpu_to_lbr_desc() and vcpu_to_lbr_records() into functions in order to provide type safety, to document exactly what they return, and to allow consuming the helpers in vmx.h. Move the definitions as necessary (the macros "reference" to_vmx() before its definition). No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220727233424.2968356-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-09Merge tag 'x86_bugs_pbrsb' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 eIBRS fixes from Borislav Petkov: "More from the CPU vulnerability nightmares front: Intel eIBRS machines do not sufficiently mitigate against RET mispredictions when doing a VM Exit therefore an additional RSB, one-entry stuffing is needed" * tag 'x86_bugs_pbrsb' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/speculation: Add LFENCE to RSB fill sequence x86/speculation: Add RSB VM Exit protections
2022-08-03x86/speculation: Add RSB VM Exit protectionsDaniel Sneddon
tl;dr: The Enhanced IBRS mitigation for Spectre v2 does not work as documented for RET instructions after VM exits. Mitigate it with a new one-entry RSB stuffing mechanism and a new LFENCE. == Background == Indirect Branch Restricted Speculation (IBRS) was designed to help mitigate Branch Target Injection and Speculative Store Bypass, i.e. Spectre, attacks. IBRS prevents software run in less privileged modes from affecting branch prediction in more privileged modes. IBRS requires the MSR to be written on every privilege level change. To overcome some of the performance issues of IBRS, Enhanced IBRS was introduced. eIBRS is an "always on" IBRS, in other words, just turn it on once instead of writing the MSR on every privilege level change. When eIBRS is enabled, more privileged modes should be protected from less privileged modes, including protecting VMMs from guests. == Problem == Here's a simplification of how guests are run on Linux' KVM: void run_kvm_guest(void) { // Prepare to run guest VMRESUME(); // Clean up after guest runs } The execution flow for that would look something like this to the processor: 1. Host-side: call run_kvm_guest() 2. Host-side: VMRESUME 3. Guest runs, does "CALL guest_function" 4. VM exit, host runs again 5. Host might make some "cleanup" function calls 6. Host-side: RET from run_kvm_guest() Now, when back on the host, there are a couple of possible scenarios of post-guest activity the host needs to do before executing host code: * on pre-eIBRS hardware (legacy IBRS, or nothing at all), the RSB is not touched and Linux has to do a 32-entry stuffing. * on eIBRS hardware, VM exit with IBRS enabled, or restoring the host IBRS=1 shortly after VM exit, has a documented side effect of flushing the RSB except in this PBRSB situation where the software needs to stuff the last RSB entry "by hand". IOW, with eIBRS supported, host RET instructions should no longer be influenced by guest behavior after the host retires a single CALL instruction. However, if the RET instructions are "unbalanced" with CALLs after a VM exit as is the RET in #6, it might speculatively use the address for the instruction after the CALL in #3 as an RSB prediction. This is a problem since the (untrusted) guest controls this address. Balanced CALL/RET instruction pairs such as in step #5 are not affected. == Solution == The PBRSB issue affects a wide variety of Intel processors which support eIBRS. But not all of them need mitigation. Today, X86_FEATURE_RSB_VMEXIT triggers an RSB filling sequence that mitigates PBRSB. Systems setting RSB_VMEXIT need no further mitigation - i.e., eIBRS systems which enable legacy IBRS explicitly. However, such systems (X86_FEATURE_IBRS_ENHANCED) do not set RSB_VMEXIT and most of them need a new mitigation. Therefore, introduce a new feature flag X86_FEATURE_RSB_VMEXIT_LITE which triggers a lighter-weight PBRSB mitigation versus RSB_VMEXIT. The lighter-weight mitigation performs a CALL instruction which is immediately followed by a speculative execution barrier (INT3). This steers speculative execution to the barrier -- just like a retpoline -- which ensures that speculation can never reach an unbalanced RET. Then, ensure this CALL is retired before continuing execution with an LFENCE. In other words, the window of exposure is opened at VM exit where RET behavior is troublesome. While the window is open, force RSB predictions sampling for RET targets to a dead end at the INT3. Close the window with the LFENCE. There is a subset of eIBRS systems which are not vulnerable to PBRSB. Add these systems to the cpu_vuln_whitelist[] as NO_EIBRS_PBRSB. Future systems that aren't vulnerable will set ARCH_CAP_PBRSB_NO. [ bp: Massage, incorporate review comments from Andy Cooper. ] Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com> Co-developed-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Borislav Petkov <bp@suse.de>
2022-08-01Merge remote-tracking branch 'kvm/next' into kvm-next-5.20Paolo Bonzini
KVM/s390, KVM/x86 and common infrastructure changes for 5.20 x86: * Permit guests to ignore single-bit ECC errors * Fix races in gfn->pfn cache refresh; do not pin pages tracked by the cache * Intel IPI virtualization * Allow getting/setting pending triple fault with KVM_GET/SET_VCPU_EVENTS * PEBS virtualization * Simplify PMU emulation by just using PERF_TYPE_RAW events * More accurate event reinjection on SVM (avoid retrying instructions) * Allow getting/setting the state of the speaker port data bit * Refuse starting the kvm-intel module if VM-Entry/VM-Exit controls are inconsistent * "Notify" VM exit (detect microarchitectural hangs) for Intel * Cleanups for MCE MSR emulation s390: * add an interface to provide a hypervisor dump for secure guests * improve selftests to use TAP interface * enable interpretive execution of zPCI instructions (for PCI passthrough) * First part of deferred teardown * CPU Topology * PV attestation * Minor fixes Generic: * new selftests API using struct kvm_vcpu instead of a (vm, id) tuple x86: * Use try_cmpxchg64 instead of cmpxchg64 * Bugfixes * Ignore benign host accesses to PMU MSRs when PMU is disabled * Allow disabling KVM's "MONITOR/MWAIT are NOPs!" behavior * x86/MMU: Allow NX huge pages to be disabled on a per-vm basis * Port eager page splitting to shadow MMU as well * Enable CMCI capability by default and handle injected UCNA errors * Expose pid of vcpu threads in debugfs * x2AVIC support for AMD * cleanup PIO emulation * Fixes for LLDT/LTR emulation * Don't require refcounted "struct page" to create huge SPTEs x86 cleanups: * Use separate namespaces for guest PTEs and shadow PTEs bitmasks * PIO emulation * Reorganize rmap API, mostly around rmap destruction * Do not workaround very old KVM bugs for L0 that runs with nesting enabled * new selftests API for CPUID
2022-07-28KVM: nVMX: Set UMIP bit CR4_FIXED1 MSR when emulating UMIPSean Christopherson
Make UMIP an "allowed-1" bit CR4_FIXED1 MSR when KVM is emulating UMIP. KVM emulates UMIP for both L1 and L2, and so should enumerate that L2 is allowed to have CR4.UMIP=1. Not setting the bit doesn't immediately break nVMX, as KVM does set/clear the bit in CR4_FIXED1 in response to a guest CPUID update, i.e. KVM will correctly (dis)allow nested VM-Entry based on whether or not UMIP is exposed to L1. That said, KVM should enumerate the bit as being allowed from time zero, e.g. userspace will see the wrong value if the MSR is read before CPUID is written. Fixes: 0367f205a3b7 ("KVM: vmx: add support for emulating UMIP") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220607213604.3346000-12-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28Revert "KVM: nVMX: Expose load IA32_PERF_GLOBAL_CTRL VM-{Entry,Exit} control"Paolo Bonzini
This reverts commit 03a8871add95213827e2bea84c12133ae5df952e. Since commit 03a8871add95 ("KVM: nVMX: Expose load IA32_PERF_GLOBAL_CTRL VM-{Entry,Exit} control"), KVM has taken ownership of the "load IA32_PERF_GLOBAL_CTRL" VMX entry/exit control bits, trying to set these bits in the IA32_VMX_TRUE_{ENTRY,EXIT}_CTLS MSRs if the guest's CPUID supports the architectural PMU (CPUID[EAX=0Ah].EAX[7:0]=1), and clear otherwise. This was a misguided attempt at mimicking what commit 5f76f6f5ff96 ("KVM: nVMX: Do not expose MPX VMX controls when guest MPX disabled", 2018-10-01) did for MPX. However, that commit was a workaround for another KVM bug and not something that should be imitated. Mucking with the VMX MSRs creates a subtle, difficult to maintain ABI as KVM must ensure that any internal changes, e.g. to how KVM handles _any_ guest CPUID changes, yield the same functional result. Therefore, KVM's policy is to let userspace have full control of the guest vCPU model so long as the host kernel is not at risk. Now that KVM really truly ensures kvm_set_msr() will succeed by loading PERF_GLOBAL_CTRL if and only if it exists, revert KVM's misguided and roundabout behavior. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> [sean: make it a pure revert] Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220722224409.1336532-6-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28KVM: nVMX: Attempt to load PERF_GLOBAL_CTRL on nVMX xfer iff it existsSean Christopherson
Attempt to load PERF_GLOBAL_CTRL during nested VM-Enter/VM-Exit if and only if the MSR exists (according to the guest vCPU model). KVM has very misguided handling of VM_{ENTRY,EXIT}_LOAD_IA32_PERF_GLOBAL_CTRL and attempts to force the nVMX MSR settings to match the vPMU model, i.e. to hide/expose the control based on whether or not the MSR exists from the guest's perspective. KVM's modifications fail to handle the scenario where the vPMU is hidden from the guest _after_ being exposed to the guest, e.g. by userspace doing multiple KVM_SET_CPUID2 calls, which is allowed if done before any KVM_RUN. nested_vmx_pmu_refresh() is called if and only if there's a recognized vPMU, i.e. KVM will leave the bits in the allow state and then ultimately reject the MSR load and WARN. KVM should not force the VMX MSRs in the first place. KVM taking control of the MSRs was a misguided attempt at mimicking what commit 5f76f6f5ff96 ("KVM: nVMX: Do not expose MPX VMX controls when guest MPX disabled", 2018-10-01) did for MPX. However, the MPX commit was a workaround for another KVM bug and not something that should be imitated (and it should never been done in the first place). In other words, KVM's ABI _should_ be that userspace has full control over the MSRs, at which point triggering the WARN that loading the MSR must not fail is trivial. The intent of the WARN is still valid; KVM has consistency checks to ensure that vmcs12->{guest,host}_ia32_perf_global_ctrl is valid. The problem is that '0' must be considered a valid value at all times, and so the simple/obvious solution is to just not actually load the MSR when it does not exist. It is userspace's responsibility to provide a sane vCPU model, i.e. KVM is well within its ABI and Intel's VMX architecture to skip the loads if the MSR does not exist. Fixes: 03a8871add95 ("KVM: nVMX: Expose load IA32_PERF_GLOBAL_CTRL VM-{Entry,Exit} control") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220722224409.1336532-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28KVM: VMX: Add helper to check if the guest PMU has PERF_GLOBAL_CTRLSean Christopherson
Add a helper to check of the guest PMU has PERF_GLOBAL_CTRL, which is unintuitive _and_ diverges from Intel's architecturally defined behavior. Even worse, KVM currently implements the check using two different (but equivalent) checks, _and_ there has been at least one attempt to add a _third_ flavor. Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220722224409.1336532-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28KVM: VMX: Mark all PERF_GLOBAL_(OVF)_CTRL bits reserved if there's no vPMUSean Christopherson
Mark all MSR_CORE_PERF_GLOBAL_CTRL and MSR_CORE_PERF_GLOBAL_OVF_CTRL bits as reserved if there is no guest vPMU. The nVMX VM-Entry consistency checks do not check for a valid vPMU prior to consuming the masks via kvm_valid_perf_global_ctrl(), i.e. may incorrectly allow a non-zero mask to be loaded via VM-Enter or VM-Exit (well, attempted to be loaded, the actual MSR load will be rejected by intel_is_valid_msr()). Fixes: f5132b01386b ("KVM: Expose a version 2 architectural PMU to a guests") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220722224409.1336532-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28Revert "KVM: nVMX: Do not expose MPX VMX controls when guest MPX disabled"Paolo Bonzini
Since commit 5f76f6f5ff96 ("KVM: nVMX: Do not expose MPX VMX controls when guest MPX disabled"), KVM has taken ownership of the "load IA32_BNDCFGS" and "clear IA32_BNDCFGS" VMX entry/exit controls, trying to set these bits in the IA32_VMX_TRUE_{ENTRY,EXIT}_CTLS MSRs if the guest's CPUID supports MPX, and clear otherwise. The intent of the patch was to apply it to L0 in order to work around L1 kernels that lack the fix in commit 691bd4340bef ("kvm: vmx: allow host to access guest MSR_IA32_BNDCFGS", 2017-07-04): by hiding the control bits from L0, L1 hides BNDCFGS from KVM_GET_MSR_INDEX_LIST, and the L1 bug is neutralized even in the lack of commit 691bd4340bef. This was perhaps a sensible kludge at the time, but a horrible idea in the long term and in fact it has not been extended to other CPUID bits like these: X86_FEATURE_LM => VM_EXIT_HOST_ADDR_SPACE_SIZE, VM_ENTRY_IA32E_MODE, VMX_MISC_SAVE_EFER_LMA X86_FEATURE_TSC => CPU_BASED_RDTSC_EXITING, CPU_BASED_USE_TSC_OFFSETTING, SECONDARY_EXEC_TSC_SCALING X86_FEATURE_INVPCID_SINGLE => SECONDARY_EXEC_ENABLE_INVPCID X86_FEATURE_MWAIT => CPU_BASED_MONITOR_EXITING, CPU_BASED_MWAIT_EXITING X86_FEATURE_INTEL_PT => SECONDARY_EXEC_PT_CONCEAL_VMX, SECONDARY_EXEC_PT_USE_GPA, VM_EXIT_CLEAR_IA32_RTIT_CTL, VM_ENTRY_LOAD_IA32_RTIT_CTL X86_FEATURE_XSAVES => SECONDARY_EXEC_XSAVES These days it's sort of common knowledge that any MSR in KVM_GET_MSR_INDEX_LIST must allow *at least* setting it with KVM_SET_MSR to a default value, so it is unlikely that something like commit 5f76f6f5ff96 will be needed again. So revert it, at the potential cost of breaking L1s with a 6 year old kernel. While in principle the L0 owner doesn't control what runs on L1, such an old hypervisor would probably have many other bugs. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28KVM: nVMX: Let userspace set nVMX MSR to any _host_ supported valueSean Christopherson
Restrict the nVMX MSRs based on KVM's config, not based on the guest's current config. Using the guest's config to audit the new config prevents userspace from restoring the original config (KVM's config) if at any point in the past the guest's config was restricted in any way. Fixes: 62cc6b9dc61e ("KVM: nVMX: support restore of VMX capability MSRs") Cc: stable@vger.kernel.org Cc: David Matlack <dmatlack@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220607213604.3346000-6-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28KVM: nVMX: Rename handle_vm{on,off}() to handle_vmx{on,off}()Sean Christopherson
Rename the exit handlers for VMXON and VMXOFF to match the instruction names, the terms "vmon" and "vmoff" are not used anywhere in Intel's documentation, nor are they used elsehwere in KVM. Sadly, the exit reasons are exposed to userspace and so cannot be renamed without breaking userspace. :-( Fixes: ec378aeef9df ("KVM: nVMX: Implement VMXON and VMXOFF") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220607213604.3346000-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28KVM: nVMX: Inject #UD if VMXON is attempted with incompatible CR0/CR4Sean Christopherson
Inject a #UD if L1 attempts VMXON with a CR0 or CR4 that is disallowed per the associated nested VMX MSRs' fixed0/1 settings. KVM cannot rely on hardware to perform the checks, even for the few checks that have higher priority than VM-Exit, as (a) KVM may have forced CR0/CR4 bits in hardware while running the guest, (b) there may incompatible CR0/CR4 bits that have lower priority than VM-Exit, e.g. CR0.NE, and (c) userspace may have further restricted the allowed CR0/CR4 values by manipulating the guest's nested VMX MSRs. Note, despite a very strong desire to throw shade at Jim, commit 70f3aac964ae ("kvm: nVMX: Remove superfluous VMX instruction fault checks") is not to blame for the buggy behavior (though the comment...). That commit only removed the CR0.PE, EFLAGS.VM, and COMPATIBILITY mode checks (though it did erroneously drop the CPL check, but that has already been remedied). KVM may force CR0.PE=1, but will do so only when also forcing EFLAGS.VM=1 to emulate Real Mode, i.e. hardware will still #UD. Link: https://bugzilla.kernel.org/show_bug.cgi?id=216033 Fixes: ec378aeef9df ("KVM: nVMX: Implement VMXON and VMXOFF") Reported-by: Eric Li <ercli@ucdavis.edu> Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220607213604.3346000-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>