summaryrefslogtreecommitdiff
path: root/arch/x86/kvm/vmx
AgeCommit message (Collapse)Author
2020-09-28KVM: VMX: Rename vmx_*_supported() helpers to cpu_has_vmx_*()Sean Christopherson
Rename helpers for a few controls to conform to the more prevelant style of cpu_has_vmx_<feature>(). Consistent names will allow adding macros to consolidate the boilerplate code for adjusting secondary execution controls. No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200923165048.20486-2-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-28KVM: VMX: Use "illegal GPA" helper for PT/RTIT output base checkSean Christopherson
Use kvm_vcpu_is_illegal_gpa() to check for a legal GPA when validating a PT output base instead of open coding a clever, but difficult to read, variant. Code readability is far more important than shaving a few uops in a slow path. No functional change intended. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200924194250.19137-6-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-28KVM: x86: Move illegal GPA helper out of the MMU codeSean Christopherson
Rename kvm_mmu_is_illegal_gpa() to kvm_vcpu_is_illegal_gpa() and move it to cpuid.h so that's it's colocated with cpuid_maxphyaddr(). The helper is not MMU specific and will gain a user that is completely unrelated to the MMU in a future patch. No functional change intended. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200924194250.19137-5-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-28KVM: VMX: Replace MSR_IA32_RTIT_OUTPUT_BASE_MASK with helper functionSean Christopherson
Replace the subtly not-a-constant MSR_IA32_RTIT_OUTPUT_BASE_MASK with a proper helper function to check whether or not the specified base is valid. Blindly referencing the local 'vcpu' is especially nasty. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200924194250.19137-4-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-28KVM: VMX: Use precomputed MAXPHYADDR for RTIT base MSR checkSean Christopherson
Use cpuid_maxphyaddr() instead of cpuid_query_maxphyaddr() for the RTIT base MSR check. There is no reason to recompute MAXPHYADDR as the precomputed version is synchronized with CPUID updates, and MSR_IA32_RTIT_OUTPUT_BASE is not written between stuffing CPUID and refreshing vcpu->arch.maxphyaddr. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200924194250.19137-2-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-28KVM: VMX: Do not perform emulation for INVD interceptTom Lendacky
The INVD instruction is emulated as a NOP, just skip the instruction instead. Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Message-Id: <addd41be2fbf50f5f4059e990a2a0cff182d2136.1600972918.git.thomas.lendacky@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-28KVM: nVMX: Add VM-Enter failed tracepoints for super early checksSean Christopherson
Add tracepoints for the early consistency checks in nested_vmx_run(). The "VMLAUNCH vs. VMRESUME" check in particular is useful to trace, as there is no architectural way to check VMCS.LAUNCH_STATE, and subtle bugs such as VMCLEAR on the wrong HPA can lead to confusing errors in the L1 VMM. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200812180615.22372-1-sean.j.christopherson@intel.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-28KVM: nVMX: Morph notification vector IRQ on nested VM-Enter to pending PISean Christopherson
On successful nested VM-Enter, check for pending interrupts and convert the highest priority interrupt to a pending posted interrupt if it matches L2's notification vector. If the vCPU receives a notification interrupt before nested VM-Enter (assuming L1 disables IRQs before doing VM-Enter), the pending interrupt (for L1) should be recognized and processed as a posted interrupt when interrupts become unblocked after VM-Enter to L2. This fixes a bug where L1/L2 will get stuck in an infinite loop if L1 is trying to inject an interrupt into L2 by setting the appropriate bit in L2's PIR and sending a self-IPI prior to VM-Enter (as opposed to KVM's method of manually moving the vector from PIR->vIRR/RVI). KVM will observe the IPI while the vCPU is in L1 context and so won't immediately morph it to a posted interrupt for L2. The pending interrupt will be seen by vmx_check_nested_events(), cause KVM to force an immediate exit after nested VM-Enter, and eventually be reflected to L1 as a VM-Exit. After handling the VM-Exit, L1 will see that L2 has a pending interrupt in PIR, send another IPI, and repeat until L2 is killed. Note, posted interrupts require virtual interrupt deliveriy, and virtual interrupt delivery requires exit-on-interrupt, ergo interrupts will be unconditionally unmasked on VM-Enter if posted interrupts are enabled. Fixes: 705699a13994 ("KVM: nVMX: Enable nested posted interrupt processing") Cc: stable@vger.kernel.org Cc: Liran Alon <liran.alon@oracle.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200812175129.12172-1-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-28KVM: VMX: Invoke NMI handler via indirect call instead of INTnSean Christopherson
Rework NMI VM-Exit handling to invoke the kernel handler by function call instead of INTn. INTn microcode is relatively expensive, and aligning the IRQ and NMI handling will make it easier to update KVM should some newfangled method for invoking the handlers come along. Suggested-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200915191505.10355-3-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-28KVM: VMX: Move IRQ invocation to assembly subroutineSean Christopherson
Move the asm blob that invokes the appropriate IRQ handler after VM-Exit into a proper subroutine. Unconditionally create a stack frame in the subroutine so that, as objtool sees things, the function has standard stack behavior. The dynamic stack adjustment makes using unwind hints problematic. Suggested-by: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200915191505.10355-2-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-28KVM: x86: Add kvm_x86_ops hook to short circuit emulationSean Christopherson
Replace the existing kvm_x86_ops.need_emulation_on_page_fault() with a more generic is_emulatable(), and unconditionally call the new function in x86_emulate_instruction(). KVM will use the generic hook to support multiple security related technologies that prevent emulation in one way or another. Similar to the existing AMD #NPF case where emulation of the current instruction is not possible due to lack of information, AMD's SEV-ES and Intel's SGX and TDX will introduce scenarios where emulation is impossible due to the guest's register state being inaccessible. And again similar to the existing #NPF case, emulation can be initiated by kvm_mmu_page_fault(), i.e. outside of the control of vendor-specific code. While the cause and architecturally visible behavior of the various cases are different, e.g. SGX will inject a #UD, AMD #NPF is a clean resume or complete shutdown, and SEV-ES and TDX "return" an error, the impact on the common emulation code is identical: KVM must stop emulation immediately and resume the guest. Query is_emulatable() in handle_ud() as well so that the force_emulation_prefix code doesn't incorrectly modify RIP before calling emulate_instruction() in the absurdly unlikely scenario that KVM encounters forced emulation in conjunction with "do not emulate". Cc: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200915232702.15945-1-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-28KVM: nVMX: KVM needs to unset "unrestricted guest" VM-execution control in ↵Krish Sadhukhan
vmcs02 if vmcs12 doesn't set it Currently, prepare_vmcs02_early() does not check if the "unrestricted guest" VM-execution control in vmcs12 is turned off and leaves the corresponding bit on in vmcs02. Due to this setting, vmentry checks which are supposed to render the nested guest state as invalid when this VM-execution control is not set, are passing in hardware. This patch turns off the "unrestricted guest" VM-execution control in vmcs02 if vmcs12 has turned it off. Suggested-by: Jim Mattson <jmattson@google.com> Suggested-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> Message-Id: <20200921081027.23047-2-krish.sadhukhan@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-28KVM: X86: Move handling of INVPCID types to x86Babu Moger
INVPCID instruction handling is mostly same across both VMX and SVM. So, move the code to common x86.c. Signed-off-by: Babu Moger <babu.moger@amd.com> Reviewed-by: Jim Mattson <jmattson@google.com> Message-Id: <159985255212.11252.10322694343971983487.stgit@bmoger-ubuntu> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-28KVM: X86: Rename and move the function vmx_handle_memory_failure to x86.cBabu Moger
Handling of kvm_read/write_guest_virt*() errors can be moved to common code. The same code can be used by both VMX and SVM. Signed-off-by: Babu Moger <babu.moger@amd.com> Reviewed-by: Jim Mattson <jmattson@google.com> Message-Id: <159985254493.11252.6603092560732507607.stgit@bmoger-ubuntu> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-28KVM: LAPIC: Reduce world switch latency caused by timer_advance_nsWanpeng Li
All the checks in lapic_timer_int_injected(), __kvm_wait_lapic_expire(), and these function calls waste cpu cycles when the timer mode is not tscdeadline. We can observe ~1.3% world switch time overhead by kvm-unit-tests/vmexit.flat vmcall testing on AMD server. This patch reduces the world switch latency caused by timer_advance_ns feature when the timer mode is not tscdeadline by simpling move the check against apic->lapic_timer.expired_tscdeadline much earlier. Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Message-Id: <1599731444-3525-7-git-send-email-wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-28KVM: nVMX: Simplify the initialization of nested_vmx_msrsChenyi Qiang
The nested VMX controls MSRs can be initialized by the global capability values stored in vmcs_config. Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Message-Id: <20200828085622.8365-6-chenyi.qiang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-28KVM: nVMX: Fix VMX controls MSRs setup when nested VMX enabledChenyi Qiang
KVM supports the nested VM_{EXIT, ENTRY}_LOAD_IA32_PERF_GLOBAL_CTRL and VM_{ENTRY_LOAD, EXIT_CLEAR}_BNDCFGS, but they are not exposed by the system ioctl KVM_GET_MSR. Add them to the setup of nested VMX controls MSR. Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com> Message-Id: <20200828085622.8365-2-chenyi.qiang@intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-25Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull more kvm fixes from Paolo Bonzini: "Five small fixes. The nested migration bug will be fixed with a better API in 5.10 or 5.11, for now this is a fix that works with existing userspace but keeps the current ugly API" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: SVM: Add a dedicated INVD intercept routine KVM: x86: Reset MMU context if guest toggles CR4.SMAP or CR4.PKE KVM: x86: fix MSR_IA32_TSC read for nested migration selftests: kvm: Fix assert failure in single-step test KVM: x86: VMX: Make smaller physical guest address space support user-configurable
2020-09-23KVM: x86: VMX: Make smaller physical guest address space support ↵Mohammed Gamal
user-configurable This patch exposes allow_smaller_maxphyaddr to the user as a module parameter. Since smaller physical address spaces are only supported on VMX, the parameter is only exposed in the kvm_intel module. For now disable support by default, and let the user decide if they want to enable it. Modifications to VMX page fault and EPT violation handling will depend on whether that parameter is enabled. Signed-off-by: Mohammed Gamal <mgamal@redhat.com> Message-Id: <20200903141122.72908-1-mgamal@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-22Merge branch 'x86-seves-for-paolo' of ↵Paolo Bonzini
https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into HEAD
2020-09-13Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm fixes from Paolo Bonzini: "A bit on the bigger side, mostly due to me being on vacation, then busy, then on parental leave, but there's nothing worrisome. ARM: - Multiple stolen time fixes, with a new capability to match x86 - Fix for hugetlbfs mappings when PUD and PMD are the same level - Fix for hugetlbfs mappings when PTE mappings are enforced (dirty logging, for example) - Fix tracing output of 64bit values x86: - nSVM state restore fixes - Async page fault fixes - Lots of small fixes everywhere" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (25 commits) KVM: emulator: more strict rsm checks. KVM: nSVM: more strict SMM checks when returning to nested guest SVM: nSVM: setup nested msr permission bitmap on nested state load SVM: nSVM: correctly restore GIF on vmexit from nesting after migration x86/kvm: don't forget to ACK async PF IRQ x86/kvm: properly use DEFINE_IDTENTRY_SYSVEC() macro KVM: VMX: Don't freeze guest when event delivery causes an APIC-access exit KVM: SVM: avoid emulation with stale next_rip KVM: x86: always allow writing '0' to MSR_KVM_ASYNC_PF_EN KVM: SVM: Periodically schedule when unregistering regions on destroy KVM: MIPS: Change the definition of kvm type kvm x86/mmu: use KVM_REQ_MMU_SYNC to sync when needed KVM: nVMX: Fix the update value of nested load IA32_PERF_GLOBAL_CTRL control KVM: fix memory leak in kvm_io_bus_unregister_dev() KVM: Check the allocation of pv cpu mask KVM: nVMX: Update VMCS02 when L2 PAE PDPTE updates detected KVM: arm64: Update page shift if stage 2 block mapping not supported KVM: arm64: Fix address truncation in traces KVM: arm64: Do not try to map PUDs when they are folded into PMD arm64/x86: KVM: Introduce steal-time cap ...
2020-09-12KVM: VMX: Don't freeze guest when event delivery causes an APIC-access exitWanpeng Li
According to SDM 27.2.4, Event delivery causes an APIC-access VM exit. Don't report internal error and freeze guest when event delivery causes an APIC-access exit, it is handleable and the event will be re-injected during the next vmentry. Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Message-Id: <1597827327-25055-2-git-send-email-wanpengli@tencent.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-11KVM: nVMX: Fix the update value of nested load IA32_PERF_GLOBAL_CTRL controlChenyi Qiang
A minor fix for the update of VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL field in exit_ctls_high. Fixes: 03a8871add95 ("KVM: nVMX: Expose load IA32_PERF_GLOBAL_CTRL VM-{Entry,Exit} control") Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Message-Id: <20200828085622.8365-5-chenyi.qiang@intel.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-11KVM: nVMX: Update VMCS02 when L2 PAE PDPTE updates detectedPeter Shier
When L2 uses PAE, L0 intercepts of L2 writes to CR0/CR3/CR4 call load_pdptrs to read the possibly updated PDPTEs from the guest physical address referenced by CR3. It loads them into vcpu->arch.walk_mmu->pdptrs and sets VCPU_EXREG_PDPTR in vcpu->arch.regs_dirty. At the subsequent assumed reentry into L2, the mmu will call vmx_load_mmu_pgd which calls ept_load_pdptrs. ept_load_pdptrs sees VCPU_EXREG_PDPTR set in vcpu->arch.regs_dirty and loads VMCS02.GUEST_PDPTRn from vcpu->arch.walk_mmu->pdptrs[]. This all works if the L2 CRn write intercept always resumes L2. The resume path calls vmx_check_nested_events which checks for exceptions, MTF, and expired VMX preemption timers. If vmx_check_nested_events finds any of these conditions pending it will reflect the corresponding exit into L1. Live migration at this point would also cause a missed immediate reentry into L2. After L1 exits, vmx_vcpu_run calls vmx_register_cache_reset which clears VCPU_EXREG_PDPTR in vcpu->arch.regs_dirty. When L2 next resumes, ept_load_pdptrs finds VCPU_EXREG_PDPTR clear in vcpu->arch.regs_dirty and does not load VMCS02.GUEST_PDPTRn from vcpu->arch.walk_mmu->pdptrs[]. prepare_vmcs02 will then load VMCS02.GUEST_PDPTRn from vmcs12->pdptr0/1/2/3 which contain the stale values stored at last L2 exit. A repro of this bug showed L2 entering triple fault immediately due to the bad VMCS02.GUEST_PDPTRn values. When L2 is in PAE paging mode add a call to ept_load_pdptrs before leaving L2. This will update VMCS02.GUEST_PDPTRn if they are dirty in vcpu->arch.walk_mmu->pdptrs[]. Tested: kvm-unit-tests with new directed test: vmx_mtf_pdpte_test. Verified that test fails without the fix. Also ran Google internal VMM with an Ubuntu 16.04 4.4.0-83 guest running a custom hypervisor with a 32-bit Windows XP L2 guest using PAE. Prior to fix would repro readily. Ran 14 simultaneous L2s for 140 iterations with no failures. Signed-off-by: Peter Shier <pshier@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Message-Id: <20200820230545.2411347-1-pshier@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-10objtool: Rename frame.h -> objtool.hJulien Thierry
Header frame.h is getting more code annotations to help objtool analyze object files. Rename the file to objtool.h. [ jpoimboe: add objtool.h to MAINTAINERS ] Signed-off-by: Julien Thierry <jthierry@redhat.com> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
2020-08-23treewide: Use fallthrough pseudo-keywordGustavo A. R. Silva
Replace the existing /* fall through */ comments and its variants with the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary fall-through markings when it is the case. [1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2020-08-06Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull KVM updates from Paolo Bonzini: "s390: - implement diag318 x86: - Report last CPU for debugging - Emulate smaller MAXPHYADDR in the guest than in the host - .noinstr and tracing fixes from Thomas - nested SVM page table switching optimization and fixes Generic: - Unify shadow MMU cache data structures across architectures" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (127 commits) KVM: SVM: Fix sev_pin_memory() error handling KVM: LAPIC: Set the TDCR settable bits KVM: x86: Specify max TDP level via kvm_configure_mmu() KVM: x86/mmu: Rename max_page_level to max_huge_page_level KVM: x86: Dynamically calculate TDP level from max level and MAXPHYADDR KVM: VXM: Remove temporary WARN on expected vs. actual EPTP level mismatch KVM: x86: Pull the PGD's level from the MMU instead of recalculating it KVM: VMX: Make vmx_load_mmu_pgd() static KVM: x86/mmu: Add separate helper for shadow NPT root page role calc KVM: VMX: Drop a duplicate declaration of construct_eptp() KVM: nSVM: Correctly set the shadow NPT root level in its MMU role KVM: Using macros instead of magic values MIPS: KVM: Fix build error caused by 'kvm_run' cleanup KVM: nSVM: remove nonsensical EXITINFO1 adjustment on nested NPF KVM: x86: Add a capability for GUEST_MAXPHYADDR < HOST_MAXPHYADDR support KVM: VMX: optimize #PF injection when MAXPHYADDR does not match KVM: VMX: Add guest physical address check in EPT violation and misconfig KVM: VMX: introduce vmx_need_pf_intercept KVM: x86: update exception bitmap on CPUID changes KVM: x86: rename update_bp_intercept to update_exception_bitmap ...
2020-08-04Merge tag 'x86-fsgsbase-2020-08-04' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fsgsbase from Thomas Gleixner: "Support for FSGSBASE. Almost 5 years after the first RFC to support it, this has been brought into a shape which is maintainable and actually works. This final version was done by Sasha Levin who took it up after Intel dropped the ball. Sasha discovered that the SGX (sic!) offerings out there ship rogue kernel modules enabling FSGSBASE behind the kernels back which opens an instantanious unpriviledged root hole. The FSGSBASE instructions provide a considerable speedup of the context switch path and enable user space to write GSBASE without kernel interaction. This enablement requires careful handling of the exception entries which go through the paranoid entry path as they can no longer rely on the assumption that user GSBASE is positive (as enforced via prctl() on non FSGSBASE enabled systemn). All other entries (syscalls, interrupts and exceptions) can still just utilize SWAPGS unconditionally when the entry comes from user space. Converting these entries to use FSGSBASE has no benefit as SWAPGS is only marginally slower than WRGSBASE and locating and retrieving the kernel GSBASE value is not a free operation either. The real benefit of RD/WRGSBASE is the avoidance of the MSR reads and writes. The changes come with appropriate selftests and have held up in field testing against the (sanitized) Graphene-SGX driver" * tag 'x86-fsgsbase-2020-08-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits) x86/fsgsbase: Fix Xen PV support x86/ptrace: Fix 32-bit PTRACE_SETREGS vs fsbase and gsbase selftests/x86/fsgsbase: Add a missing memory constraint selftests/x86/fsgsbase: Fix a comment in the ptrace_write_gsbase test selftests/x86: Add a syscall_arg_fault_64 test for negative GSBASE selftests/x86/fsgsbase: Test ptracer-induced GS base write with FSGSBASE selftests/x86/fsgsbase: Test GS selector on ptracer-induced GS base write Documentation/x86/64: Add documentation for GS/FS addressing mode x86/elf: Enumerate kernel FSGSBASE capability in AT_HWCAP2 x86/cpu: Enable FSGSBASE on 64bit by default and add a chicken bit x86/entry/64: Handle FSGSBASE enabled paranoid entry/exit x86/entry/64: Introduce the FIND_PERCPU_BASE macro x86/entry/64: Switch CR3 before SWAPGS in paranoid entry x86/speculation/swapgs: Check FSGSBASE in enabling SWAPGS mitigation x86/process/64: Use FSGSBASE instructions on thread copy and ptrace x86/process/64: Use FSBSBASE in switch_to() if available x86/process/64: Make save_fsgs_for_kvm() ready for FSGSBASE x86/fsgsbase/64: Enable FSGSBASE instructions in helper functions x86/fsgsbase/64: Add intrinsics for FSGSBASE instructions x86/cpu: Add 'unsafe_fsgsbase' to enable CR4.FSGSBASE ...
2020-08-04Merge tag 'x86-entry-2020-08-04' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 conversion to generic entry code from Thomas Gleixner: "The conversion of X86 syscall, interrupt and exception entry/exit handling to the generic code. Pretty much a straight-forward 1:1 conversion plus the consolidation of the KVM handling of pending work before entering guest mode" * tag 'x86-entry-2020-08-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/kvm: Use __xfer_to_guest_mode_work_pending() in kvm_run_vcpu() x86/kvm: Use generic xfer to guest work function x86/entry: Cleanup idtentry_enter/exit x86/entry: Use generic interrupt entry/exit code x86/entry: Cleanup idtentry_entry/exit_user x86/entry: Use generic syscall exit functionality x86/entry: Use generic syscall entry function x86/ptrace: Provide pt_regs helper for entry/exit x86/entry: Move user return notifier out of loop x86/entry: Consolidate 32/64 bit syscall entry x86/entry: Consolidate check_user_regs() x86: Correct noinstr qualifiers x86/idtentry: Remove stale comment
2020-07-30KVM: x86: Specify max TDP level via kvm_configure_mmu()Sean Christopherson
Capture the max TDP level during kvm_configure_mmu() instead of using a kvm_x86_ops hook to do it at every vCPU creation. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200716034122.5998-10-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-07-30KVM: x86: Dynamically calculate TDP level from max level and MAXPHYADDRSean Christopherson
Calculate the desired TDP level on the fly using the max TDP level and MAXPHYADDR instead of doing the same when CPUID is updated. This avoids the hidden dependency on cpuid_maxphyaddr() in vmx_get_tdp_level() and also standardizes the "use 5-level paging iff MAXPHYADDR > 48" behavior across x86. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200716034122.5998-8-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-07-30KVM: VXM: Remove temporary WARN on expected vs. actual EPTP level mismatchSean Christopherson
Remove the WARN in vmx_load_mmu_pgd() that was temporarily added to aid bisection/debug in the event the current MMU's shadow root level didn't match VMX's computed EPTP level. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200716034122.5998-7-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-07-30KVM: x86: Pull the PGD's level from the MMU instead of recalculating itSean Christopherson
Use the shadow_root_level from the current MMU as the root level for the PGD, i.e. for VMX's EPTP. This eliminates the weird dependency between VMX and the MMU where both must independently calculate the same root level for things to work correctly. Temporarily keep VMX's calculation of the level and use it to WARN if the incoming level diverges. Opportunistically refactor kvm_mmu_load_pgd() to avoid indentation hell, and rename a 'cr3' param in the load_mmu_pgd prototype that managed to survive the cr3 purge. No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200716034122.5998-6-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-07-30KVM: VMX: Make vmx_load_mmu_pgd() staticSean Christopherson
Make vmx_load_mmu_pgd() static as it is no longer invoked directly by nested VMX (or any code for that matter). No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200716034122.5998-5-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-07-30KVM: VMX: Drop a duplicate declaration of construct_eptp()Sean Christopherson
Remove an extra declaration of construct_eptp() from vmx.h. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200716034122.5998-4-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-07-27KVM: nVMX: check for invalid hdr.vmx.flagsPaolo Bonzini
hdr.vmx.flags is meant for future extensions to the ABI, rejecting invalid flags is necessary to avoid broken half-loads of the nVMX state. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-07-27KVM: nVMX: check for required but missing VMCS12 in KVM_SET_NESTED_STATEPaolo Bonzini
A missing VMCS12 was not causing -EINVAL (it was just read with copy_from_user, so it is not a security issue, but it is still wrong). Test for VMCS12 validity and reject the nested state if a VMCS12 is required but not present. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-07-24x86/kvm: Use generic xfer to guest work functionThomas Gleixner
Use the generic infrastructure to check for and handle pending work before transitioning into guest mode. This now handles TIF_NOTIFY_RESUME as well which was ignored so far. Handling it is important as this covers task work and task work will be used to offload the heavy lifting of POSIX CPU timers to thread context. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20200722220520.979724969@linutronix.de
2020-07-10KVM: x86: Add a capability for GUEST_MAXPHYADDR < HOST_MAXPHYADDR supportMohammed Gamal
This patch adds a new capability KVM_CAP_SMALLER_MAXPHYADDR which allows userspace to query if the underlying architecture would support GUEST_MAXPHYADDR < HOST_MAXPHYADDR and hence act accordingly (e.g. qemu can decide if it should warn for -cpu ..,phys-bits=X) The complications in this patch are due to unexpected (but documented) behaviour we see with NPF vmexit handling in AMD processor. If SVM is modified to add guest physical address checks in the NPF and guest #PF paths, we see the followning error multiple times in the 'access' test in kvm-unit-tests: test pte.p pte.36 pde.p: FAIL: pte 2000021 expected 2000001 Dump mapping: address: 0x123400000000 ------L4: 24c3027 ------L3: 24c4027 ------L2: 24c5021 ------L1: 1002000021 This is because the PTE's accessed bit is set by the CPU hardware before the NPF vmexit. This is handled completely by hardware and cannot be fixed in software. Therefore, availability of the new capability depends on a boolean variable allow_smaller_maxphyaddr which is set individually by VMX and SVM init routines. On VMX it's always set to true, on SVM it's only set to true when NPT is not enabled. CC: Tom Lendacky <thomas.lendacky@amd.com> CC: Babu Moger <babu.moger@amd.com> Signed-off-by: Mohammed Gamal <mgamal@redhat.com> Message-Id: <20200710154811.418214-10-mgamal@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-07-10KVM: VMX: optimize #PF injection when MAXPHYADDR does not matchPaolo Bonzini
Ignore non-present page faults, since those cannot have reserved bits set. When running access.flat with "-cpu Haswell,phys-bits=36", the number of trapped page faults goes down from 8872644 to 3978948. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20200710154811.418214-9-mgamal@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-07-10KVM: VMX: Add guest physical address check in EPT violation and misconfigMohammed Gamal
Check guest physical address against its maximum, which depends on the guest MAXPHYADDR. If the guest's physical address exceeds the maximum (i.e. has reserved bits set), inject a guest page fault with PFERR_RSVD_MASK set. This has to be done both in the EPT violation and page fault paths, as there are complications in both cases with respect to the computation of the correct error code. For EPT violations, unfortunately the only possibility is to emulate, because the access type in the exit qualification might refer to an access to a paging structure, rather than to the access performed by the program. Trapping page faults instead is needed in order to correct the error code, but the access type can be obtained from the original error code and passed to gva_to_gpa. The corrections required in the error code are subtle. For example, imagine that a PTE for a supervisor page has a reserved bit set. On a supervisor-mode access, the EPT violation path would trigger. However, on a user-mode access, the processor will not notice the reserved bit and not include PFERR_RSVD_MASK in the error code. Co-developed-by: Mohammed Gamal <mgamal@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20200710154811.418214-8-mgamal@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-07-10KVM: VMX: introduce vmx_need_pf_interceptPaolo Bonzini
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20200710154811.418214-7-mgamal@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-07-10KVM: x86: rename update_bp_intercept to update_exception_bitmapPaolo Bonzini
We would like to introduce a callback to update the #PF intercept when CPUID changes. Just reuse update_bp_intercept since VMX is already using update_exception_bitmap instead of a bespoke function. While at it, remove an unnecessary assignment in the SVM version, which is already done in the caller (kvm_arch_vcpu_ioctl_set_guest_debug) and has nothing to do with the exception bitmap. Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-07-10KVM: x86: move MSR_IA32_PERF_CAPABILITIES emulation to common x86 codeVitaly Kuznetsov
state_test/smm_test selftests are failing on AMD with: "Unexpected result from KVM_GET_MSRS, r: 51 (failed MSR was 0x345)" MSR_IA32_PERF_CAPABILITIES is an emulated MSR on Intel but it is not known to AMD code, we can move the emulation to common x86 code. For AMD, we basically just allow the host to read and write zero to the MSR. Fixes: 27461da31089 ("KVM: x86/pmu: Support full width counting") Suggested-by: Jim Mattson <jmattson@google.com> Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20200710152559.1645827-1-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-07-10KVM: nVMX: fixes for preemption timer migrationPaolo Bonzini
Commit 850448f35aaf ("KVM: nVMX: Fix VMX preemption timer migration", 2020-06-01) accidentally broke nVMX live migration from older version by changing the userspace ABI. Restore it and, while at it, ensure that vmx->nested.has_preemption_timer_deadline is always initialized according to the KVM_STATE_VMX_PREEMPTION_TIMER_DEADLINE flag. Cc: Makarand Sonare <makarandsonare@google.com> Fixes: 850448f35aaf ("KVM: nVMX: Fix VMX preemption timer migration") Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-07-09x86/kvm/vmx: Use native read/write_cr2()Thomas Gleixner
read/write_cr2() go throuh the paravirt XXL indirection, but nested VMX in a XEN_PV guest is not supported. Use the native variants. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Message-Id: <20200708195322.344731916@linutronix.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-07-09x86/kvm/vmx: Move guest enter/exit into .noinstr.textThomas Gleixner
Move the functions which are inside the RCU off region into the non-instrumentable text section. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20200708195322.037311579@linutronix.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-07-09x86/kvm/vmx: Add hardirq tracing to guest enter/exitThomas Gleixner
Entering guest mode is more or less the same as returning to user space. From an instrumentation point of view both leave kernel mode and the transition to guest or user mode reenables interrupts on the host. In user mode an interrupt is served directly and in guest mode it causes a VM exit which then handles or reinjects the interrupt. The transition from guest mode or user mode to kernel mode disables interrupts, which needs to be recorded in instrumentation to set the correct state again. This is important for e.g. latency analysis because otherwise the execution time in guest or user mode would be wrongly accounted as interrupt disabled and could trigger false positives. Add hardirq tracing to guest enter/exit functions in the same way as it is done in the user mode enter/exit code, respecting the RCU requirements. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20200708195321.822002354@linutronix.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-07-09x86/kvm: Move context tracking where it belongsThomas Gleixner
Context tracking for KVM happens way too early in the vcpu_run() code. Anything after guest_enter_irqoff() and before guest_exit_irqoff() cannot use RCU and should also be not instrumented. The current way of doing this covers way too much code. Move it closer to the actual vmenter/exit code. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20200708195321.724574345@linutronix.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-07-09kvm: x86: replace kvm_spec_ctrl_test_value with runtime test on the hostMaxim Levitsky
To avoid complex and in some cases incorrect logic in kvm_spec_ctrl_test_value, just try the guest's given value on the host processor instead, and if it doesn't #GP, allow the guest to set it. One such case is when host CPU supports STIBP mitigation but doesn't support IBRS (as is the case with some Zen2 AMD cpus), and in this case we were giving guest #GP when it tried to use STIBP The reason why can can do the host test is that IA32_SPEC_CTRL msr is passed to the guest, after the guest sets it to a non zero value for the first time (due to performance reasons), and as as result of this, it is pointless to emulate #GP condition on this first access, in a different way than what the host CPU does. This is based on a patch from Sean Christopherson, who suggested this idea. Fixes: 6441fa6178f5 ("KVM: x86: avoid incorrect writes to host MSR_IA32_SPEC_CTRL") Cc: stable@vger.kernel.org Suggested-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20200708115731.180097-1-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>