summaryrefslogtreecommitdiff
path: root/arch/x86/kvm
AgeCommit message (Collapse)Author
2019-04-18KVM: lapic: Convert guest TSC to host time domain if necessarySean Christopherson
To minimize the latency of timer interrupts as observed by the guest, KVM adjusts the values it programs into the host timers to account for the host's overhead of programming and handling the timer event. In the event that the adjustments are too aggressive, i.e. the timer fires earlier than the guest expects, KVM busy waits immediately prior to entering the guest. Currently, KVM manually converts the delay from nanoseconds to clock cycles. But, the conversion is done in the guest's time domain, while the delay occurs in the host's time domain. This is perfectly ok when the guest and host are using the same TSC ratio, but if the guest is using a different ratio then the delay may not be accurate and could wait too little or too long. When the guest is not using the host's ratio, convert the delay from guest clock cycles to host nanoseconds and use ndelay() instead of __delay() to provide more accurate timing. Because converting to nanoseconds is relatively expensive, e.g. requires division and more multiplication ops, continue using __delay() directly when guest and host TSCs are running at the same ratio. Cc: Liran Alon <liran.alon@oracle.com> Cc: Wanpeng Li <wanpengli@tencent.com> Cc: stable@vger.kernel.org Fixes: 3b8a5df6c4dc6 ("KVM: LAPIC: Tune lapic_timer_advance_ns automatically") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-18KVM: lapic: Allow user to disable adaptive tuning of timer advancementSean Christopherson
The introduction of adaptive tuning of lapic timer advancement did not allow for the scenario where userspace would want to disable adaptive tuning but still employ timer advancement, e.g. for testing purposes or to handle a use case where adaptive tuning is unable to settle on a suitable time. This is epecially pertinent now that KVM places a hard threshold on the maximum advancment time. Rework the timer semantics to accept signed values, with a value of '-1' being interpreted as "use adaptive tuning with KVM's internal default", and any other value being used as an explicit advancement time, e.g. a time of '0' effectively disables advancement. Note, this does not completely restore the original behavior of lapic_timer_advance_ns. Prior to tracking the advancement per vCPU, which is necessary to support autotuning, userspace could adjust lapic_timer_advance_ns for *running* vCPU. With per-vCPU tracking, the module params are snapshotted at vCPU creation, i.e. applying a new advancement effectively requires restarting a VM. Dynamically updating a running vCPU is possible, e.g. a helper could be added to retrieve the desired delay, choosing between the global module param and the per-VCPU value depending on whether or not auto-tuning is (globally) enabled, but introduces a great deal of complexity. The wrapper itself is not complex, but understanding and documenting the effects of dynamically toggling auto-tuning and/or adjusting the timer advancement is nigh impossible since the behavior would be dependent on KVM's implementation as well as compiler optimizations. In other words, providing stable behavior would require extremely careful consideration now and in the future. Given that the expected use of a manually-tuned timer advancement is to "tune once, run many", use the vastly simpler approach of recognizing changes to the module params only when creating a new vCPU. Cc: Liran Alon <liran.alon@oracle.com> Cc: Wanpeng Li <wanpengli@tencent.com> Reviewed-by: Liran Alon <liran.alon@oracle.com> Cc: stable@vger.kernel.org Fixes: 3b8a5df6c4dc6 ("KVM: LAPIC: Tune lapic_timer_advance_ns automatically") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-18KVM: lapic: Track lapic timer advance per vCPUSean Christopherson
Automatically adjusting the globally-shared timer advancement could corrupt the timer, e.g. if multiple vCPUs are concurrently adjusting the advancement value. That could be partially fixed by using a local variable for the arithmetic, but it would still be susceptible to a race when setting timer_advance_adjust_done. And because virtual_tsc_khz and tsc_scaling_ratio are per-vCPU, the correct calibration for a given vCPU may not apply to all vCPUs. Furthermore, lapic_timer_advance_ns is marked __read_mostly, which is effectively violated when finding a stable advancement takes an extended amount of timer. Opportunistically change the definition of lapic_timer_advance_ns to a u32 so that it matches the style of struct kvm_timer. Explicitly pass the param to kvm_create_lapic() so that it doesn't have to be exposed to lapic.c, thus reducing the probability of unintentionally using the global value instead of the per-vCPU value. Cc: Liran Alon <liran.alon@oracle.com> Cc: Wanpeng Li <wanpengli@tencent.com> Reviewed-by: Liran Alon <liran.alon@oracle.com> Cc: stable@vger.kernel.org Fixes: 3b8a5df6c4dc6 ("KVM: LAPIC: Tune lapic_timer_advance_ns automatically") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-18KVM: lapic: Disable timer advancement if adaptive tuning goes haywireSean Christopherson
To minimize the latency of timer interrupts as observed by the guest, KVM adjusts the values it programs into the host timers to account for the host's overhead of programming and handling the timer event. Now that the timer advancement is automatically tuned during runtime, it's effectively unbounded by default, e.g. if KVM is running as L1 the advancement can measure in hundreds of milliseconds. Disable timer advancement if adaptive tuning yields an advancement of more than 5000ns, as large advancements can break reasonable assumptions of the guest, e.g. that a timer configured to fire after 1ms won't arrive on the next instruction. Although KVM busy waits to mitigate the case of a timer event arriving too early, complications can arise when shifting the interrupt too far, e.g. kvm-unit-test's vmx.interrupt test will fail when its "host" exits on interrupts as KVM may inject the INTR before the guest executes STI+HLT. Arguably the unit test is "broken" in the sense that delaying a timer interrupt by 1ms doesn't technically guarantee the interrupt will arrive after STI+HLT, but it's a reasonable assumption that KVM should support. Furthermore, an unbounded advancement also effectively unbounds the time spent busy waiting, e.g. if the guest programs a timer with a very large delay. 5000ns is a somewhat arbitrary threshold. When running on bare metal, which is the intended use case, timer advancement is expected to be in the general vicinity of 1000ns. 5000ns is high enough that false positives are unlikely, while not being so high as to negatively affect the host's performance/stability. Note, a future patch will enable userspace to disable KVM's adaptive tuning, which will allow priveleged userspace will to specifying an advancement value in excess of this arbitrary threshold in order to satisfy an abnormal use case. Cc: Liran Alon <liran.alon@oracle.com> Cc: Wanpeng Li <wanpengli@tencent.com> Cc: stable@vger.kernel.org Fixes: 3b8a5df6c4dc6 ("KVM: LAPIC: Tune lapic_timer_advance_ns automatically") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-18x86: kvm: hyper-v: deal with buggy TLB flush requests from WS2012Vitaly Kuznetsov
It was reported that with some special Multi Processor Group configuration, e.g: bcdedit.exe /set groupsize 1 bcdedit.exe /set maxgroup on bcdedit.exe /set groupaware on for a 16-vCPU guest WS2012 shows BSOD on boot when PV TLB flush mechanism is in use. Tracing kvm_hv_flush_tlb immediately reveals the issue: kvm_hv_flush_tlb: processor_mask 0x0 address_space 0x0 flags 0x2 The only flag set in this request is HV_FLUSH_ALL_VIRTUAL_ADDRESS_SPACES, however, processor_mask is 0x0 and no HV_FLUSH_ALL_PROCESSORS is specified. We don't flush anything and apparently it's not what Windows expects. TLFS doesn't say anything about such requests and newer Windows versions seem to be unaffected. This all feels like a WS2012 bug, which is, however, easy to workaround in KVM: let's flush everything when we see an empty flush request, over-flushing doesn't hurt. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-18KVM: x86: Consider LAPIC TSC-Deadline timer expired if deadline too shortLiran Alon
If guest sets MSR_IA32_TSCDEADLINE to value such that in host time-domain it's shorter than lapic_timer_advance_ns, we can reach a case that we call hrtimer_start() with expiration time set at the past. Because lapic_timer.timer is init with HRTIMER_MODE_ABS_PINNED, it is not allowed to run in softirq and therefore will never expire. To avoid such a scenario, verify that deadline expiration time is set on host time-domain further than (now + lapic_timer_advance_ns). A future patch can also consider adding a min_timer_deadline_ns module parameter, similar to min_timer_period_us to avoid races that amount of ns it takes to run logic could still call hrtimer_start() with expiration timer set at the past. Reviewed-by: Joao Martins <joao.m.martins@oracle.com> Signed-off-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-16kvm: move KVM_CAP_NR_MEMSLOTS to common codePaolo Bonzini
All architectures except MIPS were defining it in the same way, and memory slots are handled entirely by common code so there is no point in keeping the definition per-architecture. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-16KVM: x86: Inject #GP if guest attempts to set unsupported EFER bitsSean Christopherson
EFER.LME and EFER.NX are considered reserved if their respective feature bits are not advertised to the guest. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-16KVM: x86: Skip EFER vs. guest CPUID checks for host-initiated writesSean Christopherson
KVM allows userspace to violate consistency checks related to the guest's CPUID model to some degree. Generally speaking, userspace has carte blanche when it comes to guest state so long as jamming invalid state won't negatively affect the host. Currently this is seems to be a non-issue as most of the interesting EFER checks are missing, e.g. NX and LME, but those will be added shortly. Proactively exempt userspace from the CPUID checks so as not to break userspace. Note, the efer_reserved_bits check still applies to userspace writes as that mask reflects the host's capabilities, e.g. KVM shouldn't allow a guest to run with NX=1 if it has been disabled in the host. Fixes: d80174745ba39 ("KVM: SVM: Only allow setting of EFER_SVME when CPUID SVM is set") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-16KVM: nVMX: Return -EINVAL when signaling failure in VM-Entry helpersSean Christopherson
Most, but not all, helpers that are related to emulating consistency checks for nested VM-Entry return -EINVAL when a check fails. Convert the holdouts to have consistency throughout and to make it clear that the functions are signaling pass/fail as opposed to "resume guest" vs. "exit to userspace". Opportunistically fix bad indentation in nested_vmx_check_guest_state(). Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-16KVM: nVMX: Return -EINVAL when signaling failure in pre-VM-Entry helpersPaolo Bonzini
Convert all top-level nested VM-Enter consistency check functions to return 0/-EINVAL instead of failure codes, since now they can only ever return one failure code. This also does not give the false impression that failure information is always consumed and/or relevant, e.g. vmx_set_nested_state() only cares whether or not the checks were successful. nested_check_host_control_regs() can also now be inlined into its caller, nested_vmx_check_host_state, since the two have effectively become the same function. Based on a patch by Sean Christopherson. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-16KVM: nVMX: Rename and split top-level consistency checks to match SDMSean Christopherson
Rename the top-level consistency check functions to (loosely) align with the SDM. Historically, KVM has used the terms "prereq" and "postreq" to differentiate between consistency checks that lead to VM-Fail and those that lead to VM-Exit. The terms are vague and potentially misleading, e.g. "postreq" might be interpreted as occurring after VM-Entry. Note, while the SDM lumps controls and host state into a single section, "Checks on VMX Controls and Host-State Area", split them into separate top-level functions as the two categories of checks result in different VM instruction errors. This split will allow for additional cleanup. Note #2, "vmentry" is intentionally dropped from the new function names to avoid confusion with nested_check_vm_entry_controls(), and to keep the length of the functions names somewhat manageable. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-16KVM: nVMX: Move guest non-reg state checks to VM-Exit pathSean Christopherson
Per Intel's SDM, volume 3, section Checking and Loading Guest State: Because the checking and the loading occur concurrently, a failure may be discovered only after some state has been loaded. For this reason, the logical processor responds to such failures by loading state from the host-state area, as it would for a VM exit. In other words, a failed non-register state consistency check results in a VM-Exit, not VM-Fail. Moving the non-reg state checks also paves the way for renaming nested_vmx_check_vmentry_postreqs() to align with the SDM, i.e. nested_vmx_check_vmentry_guest_state(). Fixes: 26539bd0e446a ("KVM: nVMX: check vmcs12 for valid activity state") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-16kvm: nVMX: Check "load IA32_PAT" VM-entry control on vmentryKrish Sadhukhan
According to section "Checking and Loading Guest State" in Intel SDM vol 3C, the following check is performed on vmentry: If the "load IA32_PAT" VM-entry control is 1, the value of the field for the IA32_PAT MSR must be one that could be written by WRMSR without fault at CPL 0. Specifically, each of the 8 bytes in the field must have one of the values 0 (UC), 1 (WC), 4 (WT), 5 (WP), 6 (WB), or 7 (UC-). Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> Reviewed-by: Karl Heubaum <karl.heubaum@oracle.com> Suggested-by: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-16kvm: nVMX: Check "load IA32_PAT" VM-exit control on vmentryKrish Sadhukhan
According to section "Checks on Host Control Registers and MSRs" in Intel SDM vol 3C, the following check is performed on vmentry: If the "load IA32_PAT" VM-exit control is 1, the value of the field for the IA32_PAT MSR must be one that could be written by WRMSR without fault at CPL 0. Specifically, each of the 8 bytes in the field must have one of the values 0 (UC), 1 (WC), 4 (WT), 5 (WP), 6 (WB), or 7 (UC-). Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> Reviewed-by: Karl Heubaum <karl.heubaum@oracle.com> Suggested-by: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-16KVM: x86: optimize check for valid PAT valuePaolo Bonzini
This check will soon be done on every nested vmentry and vmexit, "parallelize" it using bitwise operations. Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-16KVM: x86: clear VM_EXIT_SAVE_IA32_PATPaolo Bonzini
This is not needed, PAT writes always take an MSR vmexit. Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-16KVM: vmx: print more APICv fields in dump_vmcsPaolo Bonzini
The SVI, RVI, virtual-APIC page address and APIC-access page address fields were left out of dump_vmcs. Add them. KERN_CONT technically isn't SMP safe, but it's okay to use it here since the whole of dump_vmcs() is a single huge multi-line piece of output that isn't SMP-safe. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-16KVM: x86: avoid misreporting level-triggered irqs as edge-triggered in tracingVitaly Kuznetsov
In __apic_accept_irq() interface trig_mode is int and actually on some code paths it is set above u8: kvm_apic_set_irq() extracts it from 'struct kvm_lapic_irq' where trig_mode is u16. This is done on purpose as e.g. kvm_set_msi_irq() sets it to (1 << 15) & e->msi.data kvm_apic_local_deliver sets it to reg & (1 << 15). Fix the immediate issue by making 'tm' into u16. We may also want to adjust __apic_accept_irq() interface and use proper sizes for vector, level, trig_mode but this is not urgent. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-16KVM: fix spectrev1 gadgetsPaolo Bonzini
These were found with smatch, and then generalized when applicable. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-16KVM: x86: fix warning Using plain integer as NULL pointerHariprasad Kelam
Changed passing argument as "0 to NULL" which resolves below sparse warning arch/x86/kvm/x86.c:3096:61: warning: Using plain integer as NULL pointer Signed-off-by: Hariprasad Kelam <hariprasad.kelam@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-16KVM: x86: Always use 32-bit SMRAM save state for 32-bit kernelsSean Christopherson
Invoking the 64-bit variation on a 32-bit kenrel will crash the guest, trigger a WARN, and/or lead to a buffer overrun in the host, e.g. rsm_load_state_64() writes r8-r15 unconditionally, but enum kvm_reg and thus x86_emulate_ctxt._regs only define r8-r15 for CONFIG_X86_64. KVM allows userspace to report long mode support via CPUID, even though the guest is all but guaranteed to crash if it actually tries to enable long mode. But, a pure 32-bit guest that is ignorant of long mode will happily plod along. SMM complicates things as 64-bit CPUs use a different SMRAM save state area. KVM handles this correctly for 64-bit kernels, e.g. uses the legacy save state map if userspace has hid long mode from the guest, but doesn't fare well when userspace reports long mode support on a 32-bit host kernel (32-bit KVM doesn't support 64-bit guests). Since the alternative is to crash the guest, e.g. by not loading state or explicitly requesting shutdown, unconditionally use the legacy SMRAM save state map for 32-bit KVM. If a guest has managed to get far enough to handle SMIs when running under a weird/buggy userspace hypervisor, then don't deliberately crash the guest since there are no downsides (from KVM's perspective) to allow it to continue running. Fixes: 660a5d517aaab ("KVM: x86: save/load state on SMM switch") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-16KVM: x86: Don't clear EFER during SMM transitions for 32-bit vCPUSean Christopherson
Neither AMD nor Intel CPUs have an EFER field in the legacy SMRAM save state area, i.e. don't save/restore EFER across SMM transitions. KVM somewhat models this, e.g. doesn't clear EFER on entry to SMM if the guest doesn't support long mode. But during RSM, KVM unconditionally clears EFER so that it can get back to pure 32-bit mode in order to start loading CRs with their actual non-SMM values. Clear EFER only when it will be written when loading the non-SMM state so as to preserve bits that can theoretically be set on 32-bit vCPUs, e.g. KVM always emulates EFER_SCE. And because CR4.PAE is cleared only to play nice with EFER, wrap that code in the long mode check as well. Note, this may result in a compiler warning about cr4 being consumed uninitialized. Re-read CR4 even though it's technically unnecessary, as doing so allows for more readable code and RSM emulation is not a performance critical path. Fixes: 660a5d517aaab ("KVM: x86: save/load state on SMM switch") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-16KVM: x86: clear SMM flags before loading state while leaving SMMSean Christopherson
RSM emulation is currently broken on VMX when the interrupted guest has CR4.VMXE=1. Stop dancing around the issue of HF_SMM_MASK being set when loading SMSTATE into architectural state, e.g. by toggling it for problematic flows, and simply clear HF_SMM_MASK prior to loading architectural state (from SMRAM save state area). Reported-by: Jon Doron <arilou@gmail.com> Cc: Jim Mattson <jmattson@google.com> Cc: Liran Alon <liran.alon@oracle.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Fixes: 5bea5123cbf0 ("KVM: VMX: check nested state and CR4.VMXE against SMM") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Tested-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-16KVM: x86: Open code kvm_set_hflagsSean Christopherson
Prepare for clearing HF_SMM_MASK prior to loading state from the SMRAM save state map, i.e. kvm_smm_changed() needs to be called after state has been loaded and so cannot be done automatically when setting hflags from RSM. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-16KVM: x86: Load SMRAM in a single shot when leaving SMMSean Christopherson
RSM emulation is currently broken on VMX when the interrupted guest has CR4.VMXE=1. Rather than dance around the issue of HF_SMM_MASK being set when loading SMSTATE into architectural state, ideally RSM emulation itself would be reworked to clear HF_SMM_MASK prior to loading non-SMM architectural state. Ostensibly, the only motivation for having HF_SMM_MASK set throughout the loading of state from the SMRAM save state area is so that the memory accesses from GET_SMSTATE() are tagged with role.smm. Load all of the SMRAM save state area from guest memory at the beginning of RSM emulation, and load state from the buffer instead of reading guest memory one-by-one. This paves the way for clearing HF_SMM_MASK prior to loading state, and also aligns RSM with the enter_smm() behavior, which fills a buffer and writes SMRAM save state in a single go. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-16KVM: nVMX: Expose RDPMC-exiting only when guest supports PMULiran Alon
Issue was discovered when running kvm-unit-tests on KVM running as L1 on top of Hyper-V. When vmx_instruction_intercept unit-test attempts to run RDPMC to test RDPMC-exiting, it is intercepted by L1 KVM which it's EXIT_REASON_RDPMC handler raise #GP because vCPU exposed by Hyper-V doesn't support PMU. Instead of unit-test expectation to be reflected with EXIT_REASON_RDPMC. The reason vmx_instruction_intercept unit-test attempts to run RDPMC even though Hyper-V doesn't support PMU is because L1 expose to L2 support for RDPMC-exiting. Which is reasonable to assume that is supported only in case CPU supports PMU to being with. Above issue can easily be simulated by modifying vmx_instruction_intercept config in x86/unittests.cfg to run QEMU with "-cpu host,+vmx,-pmu" and run unit-test. To handle issue, change KVM to expose RDPMC-exiting only when guest supports PMU. Reported-by: Saar Amar <saaramar@microsoft.com> Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-16KVM: x86: Raise #GP when guest vCPU do not support PMULiran Alon
Before this change, reading a VMware pseduo PMC will succeed even when PMU is not supported by guest. This can easily be seen by running kvm-unit-test vmware_backdoors with "-cpu host,-pmu" option. Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com> Signed-off-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-16x86/kvm: move kvm_load/put_guest_xcr0 into atomic contextWANG Chao
guest xcr0 could leak into host when MCE happens in guest mode. Because do_machine_check() could schedule out at a few places. For example: kvm_load_guest_xcr0 ... kvm_x86_ops->run(vcpu) { vmx_vcpu_run vmx_complete_atomic_exit kvm_machine_check do_machine_check do_memory_failure memory_failure lock_page In this case, host_xcr0 is 0x2ff, guest vcpu xcr0 is 0xff. After schedule out, host cpu has guest xcr0 loaded (0xff). In __switch_to { switch_fpu_finish copy_kernel_to_fpregs XRSTORS If any bit i in XSTATE_BV[i] == 1 and xcr0[i] == 0, XRSTORS will generate #GP (In this case, bit 9). Then ex_handler_fprestore kicks in and tries to reinitialize fpu by restoring init fpu state. Same story as last #GP, except we get DOUBLE FAULT this time. Cc: stable@vger.kernel.org Signed-off-by: WANG Chao <chao.wang@ucloud.cn> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-16KVM: x86: svm: make sure NMI is injected after nmi_singlestepVitaly Kuznetsov
I noticed that apic test from kvm-unit-tests always hangs on my EPYC 7401P, the hanging test nmi-after-sti is trying to deliver 30000 NMIs and tracing shows that we're sometimes able to deliver a few but never all. When we're trying to inject an NMI we may fail to do so immediately for various reasons, however, we still need to inject it so enable_nmi_window() arms nmi_singlestep mode. #DB occurs as expected, but we're not checking for pending NMIs before entering the guest and unless there's a different event to process, the NMI will never get delivered. Make KVM_REQ_EVENT request on the vCPU from db_interception() to make sure pending NMIs are checked and possibly injected. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-16svm/avic: Fix invalidate logical APIC id entrySuthikulpanit, Suravee
Only clear the valid bit when invalidate logical APIC id entry. The current logic clear the valid bit, but also set the rest of the bits (including reserved bits) to 1. Fixes: 98d90582be2e ('svm: Fix AVIC DFR and LDR handling') Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-16Revert "svm: Fix AVIC incomplete IPI emulation"Suthikulpanit, Suravee
This reverts commit bb218fbcfaaa3b115d4cd7a43c0ca164f3a96e57. As Oren Twaig pointed out the old discussion: https://patchwork.kernel.org/patch/8292231/ that the change coud potentially cause an extra IPI to be sent to the destination vcpu because the AVIC hardware already set the IRR bit before the incomplete IPI #VMEXIT with id=1 (target vcpu is not running). Since writting to ICR and ICR2 will also set the IRR. If something triggers the destination vcpu to get scheduled before the emulation finishes, then this could result in an additional IPI. Also, the issue mentioned in the commit bb218fbcfaaa was misdiagnosed. Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Reported-by: Oren Twaig <oren@scalemp.com> Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-16kvm: mmu: Fix overflow on kvm mmu page limit calculationBen Gardon
KVM bases its memory usage limits on the total number of guest pages across all memslots. However, those limits, and the calculations to produce them, use 32 bit unsigned integers. This can result in overflow if a VM has more guest pages that can be represented by a u32. As a result of this overflow, KVM can use a low limit on the number of MMU pages it will allocate. This makes KVM unable to map all of guest memory at once, prompting spurious faults. Tested: Ran all kvm-unit-tests on an Intel Haswell machine. This patch introduced no new failures. Signed-off-by: Ben Gardon <bgardon@google.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-16KVM: nVMX: always use early vmcs check when EPT is disabledPaolo Bonzini
The remaining failures of vmx.flat when EPT is disabled are caused by incorrectly reflecting VMfails to the L1 hypervisor. What happens is that nested_vmx_restore_host_state corrupts the guest CR3, reloading it with the host's shadow CR3 instead, because it blindly loads GUEST_CR3 from the vmcs01. For simplicity let's just always use hardware VMCS checks when EPT is disabled. This way, nested_vmx_restore_host_state is not reached at all (or at least shouldn't be reached). Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-16KVM: nVMX: allow tests to use bad virtual-APIC page addressPaolo Bonzini
As mentioned in the comment, there are some special cases where we can simply clear the TPR shadow bit from the CPU-based execution controls in the vmcs02. Handle them so that we can remove some XFAILs from vmx.flat. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-15KVM: x86/mmu: Fix an inverted list_empty() check when zapping sptesSean Christopherson
A recently introduced helper for handling zap vs. remote flush incorrectly bails early, effectively leaking defunct shadow pages. Manifests as a slab BUG when exiting KVM due to the shadow pages being alive when their associated cache is destroyed. ========================================================================== BUG kvm_mmu_page_header: Objects remaining in kvm_mmu_page_header on ... -------------------------------------------------------------------------- Disabling lock debugging due to kernel taint INFO: Slab 0x00000000fc436387 objects=26 used=23 fp=0x00000000d023caee ... CPU: 6 PID: 4315 Comm: rmmod Tainted: G B 5.1.0-rc2+ #19 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 Call Trace: dump_stack+0x46/0x5b slab_err+0xad/0xd0 ? on_each_cpu_mask+0x3c/0x50 ? ksm_migrate_page+0x60/0x60 ? on_each_cpu_cond_mask+0x7c/0xa0 ? __kmalloc+0x1ca/0x1e0 __kmem_cache_shutdown+0x13a/0x310 shutdown_cache+0xf/0x130 kmem_cache_destroy+0x1d5/0x200 kvm_mmu_module_exit+0xa/0x30 [kvm] kvm_arch_exit+0x45/0x60 [kvm] kvm_exit+0x6f/0x80 [kvm] vmx_exit+0x1a/0x50 [kvm_intel] __x64_sys_delete_module+0x153/0x1f0 ? exit_to_usermode_loop+0x88/0xc0 do_syscall_64+0x4f/0x100 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Fixes: a21136345cb6f ("KVM: x86/mmu: Split remote_flush+zap case out of kvm_mmu_flush_or_zap()") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-12x86/fpu: Defer FPU state load until return to userspaceRik van Riel
Defer loading of FPU state until return to userspace. This gives the kernel the potential to skip loading FPU state for tasks that stay in kernel mode, or for tasks that end up with repeated invocations of kernel_fpu_begin() & kernel_fpu_end(). The fpregs_lock/unlock() section ensures that the registers remain unchanged. Otherwise a context switch or a bottom half could save the registers to its FPU context and the processor's FPU registers would became random if modified at the same time. KVM swaps the host/guest registers on entry/exit path. This flow has been kept as is. First it ensures that the registers are loaded and then saves the current (host) state before it loads the guest's registers. The swap is done at the very end with disabled interrupts so it should not change anymore before theg guest is entered. The read/save version seems to be cheaper compared to memcpy() in a micro benchmark. Each thread gets TIF_NEED_FPU_LOAD set as part of fork() / fpu__copy(). For kernel threads, this flag gets never cleared which avoids saving / restoring the FPU state for kernel threads and during in-kernel usage of the FPU registers. [ bp: Correct and update commit message and fix checkpatch warnings. s/register/registers/ where it is used in plural. minor comment corrections. remove unused trace_x86_fpu_activate_state() TP. ] Signed-off-by: Rik van Riel <riel@surriel.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Dave Hansen <dave.hansen@intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andy Lutomirski <luto@kernel.org> Cc: Aubrey Li <aubrey.li@intel.com> Cc: Babu Moger <Babu.Moger@amd.com> Cc: "Chang S. Bae" <chang.seok.bae@intel.com> Cc: Dmitry Safonov <dima@arista.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: "Jason A. Donenfeld" <Jason@zx2c4.com> Cc: Joerg Roedel <jroedel@suse.de> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: kvm ML <kvm@vger.kernel.org> Cc: Nicolai Stange <nstange@suse.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: "Radim Krčmář" <rkrcmar@redhat.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Waiman Long <longman@redhat.com> Cc: x86-ml <x86@kernel.org> Cc: Yi Wang <wang.yi59@zte.com.cn> Link: https://lkml.kernel.org/r/20190403164156.19645-24-bigeasy@linutronix.de
2019-04-11x86/pkeys: Provide *pkru() helpersSebastian Andrzej Siewior
Dave Hansen asked for __read_pkru() and __write_pkru() to be symmetrical. As part of the series __write_pkru() will read back the value and only write it if it is different. In order to make both functions symmetrical, move the function containing only the opcode asm into a function called like the instruction itself. __write_pkru() will just invoke wrpkru() but in a follow-up patch will also read back the value. [ bp: Convert asm opcode wrapper names to rd/wrpkru(). ] Suggested-by: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Dave Hansen <dave.hansen@intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andi Kleen <ak@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: "Jason A. Donenfeld" <Jason@zx2c4.com> Cc: Joerg Roedel <jroedel@suse.de> Cc: Juergen Gross <jgross@suse.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: kvm ML <kvm@vger.kernel.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: "Radim Krčmář" <rkrcmar@redhat.com> Cc: Rik van Riel <riel@surriel.com> Cc: x86-ml <x86@kernel.org> Link: https://lkml.kernel.org/r/20190403164156.19645-13-bigeasy@linutronix.de
2019-04-10x86/fpu: Use a feature number instead of mask in two more helpersSebastian Andrzej Siewior
After changing the argument of __raw_xsave_addr() from a mask to number Dave suggested to check if it makes sense to do the same for get_xsave_addr(). As it turns out it does. Only get_xsave_addr() needs the mask to check if the requested feature is part of what is supported/saved and then uses the number again. The shift operation is cheaper compared to fls64() (find last bit set). Also, the feature number uses less opcode space compared to the mask. :) Make the get_xsave_addr() argument a xfeature number instead of a mask and fix up its callers. Furthermore, use xfeature_nr and xfeature_mask consistently. This results in the following changes to the kvm code: feature -> xfeature_mask index -> xfeature_nr Suggested-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Dave Hansen <dave.hansen@intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andy Lutomirski <luto@kernel.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: "Jason A. Donenfeld" <Jason@zx2c4.com> Cc: kvm ML <kvm@vger.kernel.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Radim Krčmář" <rkrcmar@redhat.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Rik van Riel <riel@surriel.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Siarhei Liakh <Siarhei.Liakh@concurrent-rt.com> Cc: x86-ml <x86@kernel.org> Link: https://lkml.kernel.org/r/20190403164156.19645-12-bigeasy@linutronix.de
2019-04-05KVM: x86: nVMX: fix x2APIC VTPR read interceptMarc Orr
Referring to the "VIRTUALIZING MSR-BASED APIC ACCESSES" chapter of the SDM, when "virtualize x2APIC mode" is 1 and "APIC-register virtualization" is 0, a RDMSR of 808H should return the VTPR from the virtual APIC page. However, for nested, KVM currently fails to disable the read intercept for this MSR. This means that a RDMSR exit takes precedence over "virtualize x2APIC mode", and KVM passes through L1's TPR to L2, instead of sourcing the value from L2's virtual APIC page. This patch fixes the issue by disabling the read intercept, in VMCS02, for the VTPR when "APIC-register virtualization" is 0. The issue described above and fix prescribed here, were verified with a related patch in kvm-unit-tests titled "Test VMX's virtualize x2APIC mode w/ nested". Signed-off-by: Marc Orr <marcorr@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Fixes: c992384bde84f ("KVM: vmx: speed up MSR bitmap merge") Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-05KVM: x86: nVMX: close leak of L0's x2APIC MSRs (CVE-2019-3887)Marc Orr
The nested_vmx_prepare_msr_bitmap() function doesn't directly guard the x2APIC MSR intercepts with the "virtualize x2APIC mode" MSR. As a result, we discovered the potential for a buggy or malicious L1 to get access to L0's x2APIC MSRs, via an L2, as follows. 1. L1 executes WRMSR(IA32_SPEC_CTRL, 1). This causes the spec_ctrl variable, in nested_vmx_prepare_msr_bitmap() to become true. 2. L1 disables "virtualize x2APIC mode" in VMCS12. 3. L1 enables "APIC-register virtualization" in VMCS12. Now, KVM will set VMCS02's x2APIC MSR intercepts from VMCS12, and then set "virtualize x2APIC mode" to 0 in VMCS02. Oops. This patch closes the leak by explicitly guarding VMCS02's x2APIC MSR intercepts with VMCS12's "virtualize x2APIC mode" control. The scenario outlined above and fix prescribed here, were verified with a related patch in kvm-unit-tests titled "Add leak scenario to virt_x2apic_mode_test". Note, it looks like this issue may have been introduced inadvertently during a merge---see 15303ba5d1cd. Signed-off-by: Marc Orr <marcorr@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-05KVM: SVM: prevent DBG_DECRYPT and DBG_ENCRYPT overflowDavid Rientjes
This ensures that the address and length provided to DBG_DECRYPT and DBG_ENCRYPT do not cause an overflow. At the same time, pass the actual number of pages pinned in memory to sev_unpin_memory() as a cleanup. Reported-by: Cfir Cohen <cfir@google.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-05kvm: svm: fix potential get_num_contig_pages overflowDavid Rientjes
get_num_contig_pages() could potentially overflow int so make its type consistent with its usage. Reported-by: Cfir Cohen <cfir@google.com> Cc: stable@vger.kernel.org Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-03-28KVM: x86: update %rip after emulating IOSean Christopherson
Most (all?) x86 platforms provide a port IO based reset mechanism, e.g. OUT 92h or CF9h. Userspace may emulate said mechanism, i.e. reset a vCPU in response to KVM_EXIT_IO, without explicitly announcing to KVM that it is doing a reset, e.g. Qemu jams vCPU state and resumes running. To avoid corruping %rip after such a reset, commit 0967b7bf1c22 ("KVM: Skip pio instruction when it is emulated, not executed") changed the behavior of PIO handlers, i.e. today's "fast" PIO handling to skip the instruction prior to exiting to userspace. Full emulation doesn't need such tricks becase re-emulating the instruction will naturally handle %rip being changed to point at the reset vector. Updating %rip prior to executing to userspace has several drawbacks: - Userspace sees the wrong %rip on the exit, e.g. if PIO emulation fails it will likely yell about the wrong address. - Single step exits to userspace for are effectively dropped as KVM_EXIT_DEBUG is overwritten with KVM_EXIT_IO. - Behavior of PIO emulation is different depending on whether it goes down the fast path or the slow path. Rather than skip the PIO instruction before exiting to userspace, snapshot the linear %rip and cancel PIO completion if the current value does not match the snapshot. For a 64-bit vCPU, i.e. the most common scenario, the snapshot and comparison has negligible overhead as VMCS.GUEST_RIP will be cached regardless, i.e. there is no extra VMREAD in this case. All other alternatives to snapshotting the linear %rip that don't rely on an explicit reset announcenment suffer from one corner case or another. For example, canceling PIO completion on any write to %rip fails if userspace does a save/restore of %rip, and attempting to avoid that issue by canceling PIO only if %rip changed then fails if PIO collides with the reset %rip. Attempting to zero in on the exact reset vector won't work for APs, which means adding more hooks such as the vCPU's MP_STATE, and so on and so forth. Checking for a linear %rip match technically suffers from corner cases, e.g. userspace could theoretically rewrite the underlying code page and expect a different instruction to execute, or the guest hardcodes a PIO reset at 0xfffffff0, but those are far, far outside of what can be considered normal operation. Fixes: 432baf60eee3 ("KVM: VMX: use kvm_fast_pio_in for handling IN I/O") Cc: <stable@vger.kernel.org> Reported-by: Jim Mattson <jmattson@google.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-03-28x86/kvm/hyper-v: avoid spurious pending stimer on vCPU initVitaly Kuznetsov
When userspace initializes guest vCPUs it may want to zero all supported MSRs including Hyper-V related ones including HV_X64_MSR_STIMERn_CONFIG/ HV_X64_MSR_STIMERn_COUNT. With commit f3b138c5d89a ("kvm/x86: Update SynIC timers on guest entry only") we began doing stimer_mark_pending() unconditionally on every config change. The issue I'm observing manifests itself as following: - Qemu writes 0 to STIMERn_{CONFIG,COUNT} MSRs and marks all stimers as pending in stimer_pending_bitmap, arms KVM_REQ_HV_STIMER; - kvm_hv_has_stimer_pending() starts returning true; - kvm_vcpu_has_events() starts returning true; - kvm_arch_vcpu_runnable() starts returning true; - when kvm_arch_vcpu_ioctl_run() gets into (vcpu->arch.mp_state == KVM_MP_STATE_UNINITIALIZED) case: - kvm_vcpu_block() gets in 'kvm_vcpu_check_block(vcpu) < 0' and returns immediately, avoiding normal wait path; - -EAGAIN is returned from kvm_arch_vcpu_ioctl_run() immediately forcing userspace to retry. So instead of normal wait path we get a busy loop on all secondary vCPUs before they get INIT signal. This seems to be undesirable, especially given that this happens even when Hyper-V extensions are not used. Generally, it seems to be pointless to mark an stimer as pending in stimer_pending_bitmap and arm KVM_REQ_HV_STIMER as the only thing kvm_hv_process_stimers() will do is clear the corresponding bit. We may just not mark disabled timers as pending instead. Fixes: f3b138c5d89a ("kvm/x86: Update SynIC timers on guest entry only") Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-03-28kvm/x86: Move MSR_IA32_ARCH_CAPABILITIES to array emulated_msrsXiaoyao Li
Since MSR_IA32_ARCH_CAPABILITIES is emualted unconditionally even if host doesn't suppot it. We should move it to array emulated_msrs from arry msrs_to_save, to report to userspace that guest support this msr. Signed-off-by: Xiaoyao Li <xiaoyao.li@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-03-28KVM: x86: Emulate MSR_IA32_ARCH_CAPABILITIES on AMD hostsSean Christopherson
The CPUID flag ARCH_CAPABILITIES is unconditioinally exposed to host userspace for all x86 hosts, i.e. KVM advertises ARCH_CAPABILITIES regardless of hardware support under the pretense that KVM fully emulates MSR_IA32_ARCH_CAPABILITIES. Unfortunately, only VMX hosts handle accesses to MSR_IA32_ARCH_CAPABILITIES (despite KVM_GET_MSRS also reporting MSR_IA32_ARCH_CAPABILITIES for all hosts). Move the MSR_IA32_ARCH_CAPABILITIES handling to common x86 code so that it's emulated on AMD hosts. Fixes: 1eaafe91a0df4 ("kvm: x86: IA32_ARCH_CAPABILITIES is always supported") Cc: stable@vger.kernel.org Reported-by: Xiaoyao Li <xiaoyao.li@linux.intel.com> Cc: Jim Mattson <jmattson@google.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-03-28kvm: mmu: Used range based flushing in slot_handle_level_rangeBen Gardon
Replace kvm_flush_remote_tlbs with kvm_flush_remote_tlbs_with_address in slot_handle_level_range. When range based flushes are not enabled kvm_flush_remote_tlbs_with_address falls back to kvm_flush_remote_tlbs. This changes the behavior of many functions that indirectly use slot_handle_level_range, iff the range based flushes are enabled. The only potential problem I see with this is that kvm->tlbs_dirty will be cleared less often, however the only caller of slot_handle_level_range that checks tlbs_dirty is kvm_mmu_notifier_invalidate_range_start which checks it and does a kvm_flush_remote_tlbs after calling kvm_unmap_hva_range anyway. Tested: Ran all kvm-unit-tests on a Intel Haswell machine with and without this patch. The patch introduced no new failures. Signed-off-by: Ben Gardon <bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-03-28KVM: x86: remove check on nr_mmu_pages in kvm_arch_commit_memory_region()Wei Yang
* nr_mmu_pages would be non-zero only if kvm->arch.n_requested_mmu_pages is non-zero. * nr_mmu_pages is always non-zero, since kvm_mmu_calculate_mmu_pages() never return zero. Based on these two reasons, we can merge the two *if* clause and use the return value from kvm_mmu_calculate_mmu_pages() directly. This simplify the code and also eliminate the possibility for reader to believe nr_mmu_pages would be zero. Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-03-28kvm: nVMX: Add a vmentry check for HOST_SYSENTER_ESP and HOST_SYSENTER_EIP ↵Krish Sadhukhan
fields According to section "Checks on VMX Controls" in Intel SDM vol 3C, the following check is performed on vmentry of L2 guests: On processors that support Intel 64 architecture, the IA32_SYSENTER_ESP field and the IA32_SYSENTER_EIP field must each contain a canonical address. Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>