summaryrefslogtreecommitdiff
path: root/arch/x86
AgeCommit message (Collapse)Author
2020-03-16KVM: x86: Drop redundant boot cpu checks on SSBD feature bitsSean Christopherson
Drop redundant checks when "emulating" SSBD feature across vendors, i.e. advertising the AMD variant when running on an Intel CPU and vice versa. Both SPEC_CTRL_SSBD and AMD_SSBD are already defined in the leaf-specific feature masks and are *not* forcefully set by the kernel, i.e. will already be set in the entry when supported by the host. Functionally, this changes nothing, but the redundant check is confusing, especially when considering future patches that will further differentiate between "real" and "emulated" feature bits. No functional change intended. Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-03-16KVM: x86: Drop the explicit @index from do_cpuid_7_mask()Sean Christopherson
Drop the index param from do_cpuid_7_mask() and instead switch on the entry's index, which is guaranteed to be set by do_host_cpuid(). No functional change intended. Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-03-16KVM: x86: Clean up CPUID 0x7 sub-leaf loopSean Christopherson
Refactor the sub-leaf loop for CPUID 0x7 to move the main leaf out of said loop. The emitted code savings is basically a mirage, as the handling of the main leaf can easily be split to its own helper to avoid code bloat. No functional change intended. Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-03-16KVM: x86: Refactor CPUID 0xD.N sub-leaf entry creationSean Christopherson
Increment the number of CPUID entries immediately after do_host_cpuid() in preparation for moving the logic into do_host_cpuid(). Handle the rare/impossible case of encountering a bogus sub-leaf by decrementing the number entries on failure. Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-03-16KVM: x86: Warn on zero-size save state for valid CPUID 0xD.N sub-leafSean Christopherson
WARN if the save state size for a valid XCR0-managed sub-leaf is zero, which would indicate a KVM or CPU bug. Add a comment to explain why KVM WARNs so the reader doesn't have to tease out the relevant bits from Intel's SDM and KVM's XCR0/XSS code. Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-03-16KVM: x86: Check for CPUID 0xD.N support before validating array sizeSean Christopherson
Now that sub-leaf 1 is handled separately, verify the next sub-leaf is needed before rejecting KVM_GET_SUPPORTED_CPUID due to an insufficiently sized userspace array. Note, although this is technically a bug, it's not visible to userspace as KVM_GET_SUPPORTED_CPUID is guaranteed to fail on KVM_CPUID_SIGNATURE, which is hardcoded to be added after leaf 0xD. The real motivation for the change is to tightly couple the nent/maxnent and do_host_cpuid() sequences in preparation for future cleanup. Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-03-16KVM: x86: Move CPUID 0xD.1 handling out of the index>0 loopSean Christopherson
Mov the sub-leaf 1 handling for CPUID 0xD out of the index>0 loop so that the loop only handles index>2. Sub-leafs 2+ have identical semantics, whereas sub-leaf 1 is effectively a feature sub-leaf. Moving sub-leaf 1 out of the loop does duplicate a bit of code, but the nent/maxnent code will be consolidated in a future patch, and duplicating the clear of ECX/EDX is arguably a good thing as the reasons for clearing said registers are completely different. No functional change intended. Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-03-16KVM: x86: Check userspace CPUID array size after validating sub-leafSean Christopherson
Verify that the next sub-leaf of CPUID 0x4 (or 0x8000001d) is valid before rejecting the entire KVM_GET_SUPPORTED_CPUID due to insufficent space in the userspace array. Note, although this is technically a bug, it's not visible to userspace as KVM_GET_SUPPORTED_CPUID is guaranteed to fail on KVM_CPUID_SIGNATURE, which is hardcoded to be added after the affected leafs. The real motivation for the change is to tightly couple the nent/maxnent and do_host_cpuid() sequences in preparation for future cleanup. Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-03-16KVM: x86: Clean up error handling in kvm_dev_ioctl_get_cpuid()Sean Christopherson
Clean up the error handling in kvm_dev_ioctl_get_cpuid(), which has gotten a bit crusty as the function has evolved over the years. Opportunistically hoist the static @funcs declaration to the top of the function to make it more obvious that it's a "static const". No functional change intended. Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-03-16KVM: x86: Simplify handling of Centaur CPUID leafsSean Christopherson
Refactor the handling of the Centaur-only CPUID leaf to detect the leaf via a runtime query instead of adding a one-off callback in the static array. When the callback was introduced, there were additional fields in the array's structs, and more importantly, retpoline wasn't a thing. No functional change intended. Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-03-16KVM: x86: Refactor loop around do_cpuid_func() to separate helperSean Christopherson
Move the guts of kvm_dev_ioctl_get_cpuid()'s CPUID func loop to a separate helper to improve code readability and pave the way for future cleanup. No functional change intended. Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-03-16KVM: x86: Return -E2BIG when KVM_GET_SUPPORTED_CPUID hits max entriesSean Christopherson
Fix a long-standing bug that causes KVM to return 0 instead of -E2BIG when userspace's array is insufficiently sized. This technically breaks backwards compatibility, e.g. a userspace with a hardcoded cpuid->nent could theoretically be broken as it would see an error instead of success if cpuid->nent is less than the number of entries required to fully enumerate the host CPU. But, the lowest known cpuid->nent hardcoded by a VMM is 100 (lkvm and selftests), and the limit for current processors on Intel and AMD is well under a 100. E.g. Intel's Icelake server with all the bells and whistles tops out at ~60 entries (variable due to SGX sub-leafs), and AMD's CPUID documentation allows for less than 50. CPUID 0xD sub-leaves on current kernels are capped by the value of KVM_SUPPORTED_XCR0, and therefore so many subleaves cannot have appeared on current kernels. Note, while the Fixes: tag is accurate with respect to the immediate bug, it's likely that similar bugs in KVM_GET_SUPPORTED_CPUID existed prior to the refactoring, e.g. Qemu contains a workaround for the broken KVM_GET_SUPPORTED_CPUID behavior that predates the buggy commit by over two years. The Qemu workaround is also likely the main reason the bug has gone unreported for so long. Qemu hack: commit 76ae317f7c16aec6b469604b1764094870a75470 Author: Mark McLoughlin <markmc@redhat.com> Date: Tue May 19 18:55:21 2009 +0100 kvm: work around supported cpuid ioctl() brokenness KVM_GET_SUPPORTED_CPUID has been known to fail to return -E2BIG when it runs out of entries. Detect this by always trying again with a bigger table if the ioctl() fills the table. Fixes: 831bf664e9c1f ("KVM: Refactor and simplify kvm_dev_ioctl_get_supported_cpuid") Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-03-16KVM: x86: Shrink the usercopy region of the emulation contextSean Christopherson
Shuffle a few operand structs to the end of struct x86_emulate_ctxt and update the cache creation to whitelist only the region of the emulation context that is expected to be copied to/from user memory, e.g. the instruction operands, registers, and fetch/io/mem caches. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-03-16KVM: x86: Move kvm_emulate.h into KVM's private directorySean Christopherson
Now that the emulation context is dynamically allocated and not embedded in struct kvm_vcpu, move its header, kvm_emulate.h, out of the public asm directory and into KVM's private x86 directory. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-03-16KVM: x86: Dynamically allocate per-vCPU emulation contextSean Christopherson
Allocate the emulation context instead of embedding it in struct kvm_vcpu_arch. Dynamic allocation provides several benefits: - Shrinks the size x86 vcpus by ~2.5k bytes, dropping them back below the PAGE_ALLOC_COSTLY_ORDER threshold. - Allows for dropping the include of kvm_emulate.h from asm/kvm_host.h and moving kvm_emulate.h into KVM's private directory. - Allows a reducing KVM's attack surface by shrinking the amount of vCPU data that is exposed to usercopy. - Allows a future patch to disable the emulator entirely, which may or may not be a realistic endeavor. Mark the entire struct as valid for usercopy to maintain existing behavior with respect to hardened usercopy. Future patches can shrink the usercopy range to cover only what is necessary. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-03-16KVM: x86: Move emulation-only helpers to emulate.cSean Christopherson
Move ctxt_virt_addr_bits() and emul_is_noncanonical_address() from x86.h to emulate.c. This eliminates all references to struct x86_emulate_ctxt from x86.h, and sets the stage for a future patch to stop including kvm_emulate.h in asm/kvm_host.h. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-03-16KVM: x86: Explicitly pass an exception struct to check_interceptSean Christopherson
Explicitly pass an exception struct when checking for intercept from the emulator, which eliminates the last reference to arch.emulate_ctxt in vendor specific code. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-03-16KVM: x86: Refactor I/O emulation helpers to provide vcpu-only variantSean Christopherson
Add variants of the I/O helpers that take a vCPU instead of an emulation context. This will eventually allow KVM to limit use of the emulation context to the full emulation path. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-03-16KVM: x86: Fix warning due to implicit truncation on 32-bit KVMSean Christopherson
Explicitly cast the integer literal to an unsigned long when stuffing a non-canonical value into the host virtual address during private memslot deletion. The explicit cast fixes a warning that gets promoted to an error when running with KVM's newfangled -Werror setting. arch/x86/kvm/x86.c:9739:9: error: large integer implicitly truncated to unsigned type [-Werror=overflow] Fixes: a3e967c0b87d3 ("KVM: Terminate memslot walks via used_slots" Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-03-16KVM: nVMX: Drop unnecessary check on ept caps for execute-onlySean Christopherson
Drop the call to cpu_has_vmx_ept_execute_only() when calculating which EPT capabilities will be exposed to L1 for nested EPT. The resulting configuration is immediately sanitized by the passed in @ept_caps, and except for the call from vmx_check_processor_compat(), @ept_caps is the capabilities that are queried by cpu_has_vmx_ept_execute_only(). For vmx_check_processor_compat(), KVM *wants* to ignore vmx_capability.ept so that a divergence in EPT capabilities between CPUs is detected. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-03-16KVM: x86/mmu: Rename kvm_mmu->get_cr3() to ->get_guest_pgd()Sean Christopherson
Rename kvm_mmu->get_cr3() to call out that it is retrieving a guest value, as opposed to kvm_mmu->set_cr3(), which sets a host value, and to note that it will return something other than CR3 when nested EPT is in use. Hopefully the new name will also make it more obvious that L1's nested_cr3 is returned in SVM's nested NPT case. No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-03-16KVM: nVMX: Rename EPTP validity helper and associated variablesSean Christopherson
Rename valid_ept_address() to nested_vmx_check_eptp() to follow the nVMX nomenclature and to reflect that the function now checks a lot more than just the address contained in the EPTP. Rename address to new_eptp in associated code. No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-03-16KVM: nVMX: Rename nested_ept_get_cr3() to nested_ept_get_eptp()Sean Christopherson
Rename the accessor for vmcs12.EPTP to use "eptp" instead of "cr3". The accessor has no relation to cr3 whatsoever, other than it being assigned to the also poorly named kvm_mmu->get_cr3() hook. No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-03-16KVM: nVMX: Allow L1 to use 5-level page walks for nested EPTSean Christopherson
Add support for 5-level nested EPT, and advertise said support in the EPT capabilities MSR. KVM's MMU can already handle 5-level legacy page tables, there's no reason to force an L1 VMM to use shadow paging if it wants to employ 5-level page tables. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-03-16KVM: x86/mmu: Drop kvm_mmu_extended_role.cr4_la57 hackSean Christopherson
Drop kvm_mmu_extended_role.cr4_la57 now that mmu_role doesn't mask off level, which already incorporates the guest's CR4.LA57 for a shadow MMU by querying is_la57_mode(). Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-03-16KVM: x86/mmu: Don't drop level/direct from MMU role calculationSean Christopherson
Use the calculated role as-is when propagating it to kvm_mmu.mmu_role, i.e. stop masking off meaningful fields. The concept of masking off fields came from kvm_mmu_pte_write(), which (correctly) ignores certain fields when comparing kvm_mmu_page.role against kvm_mmu.mmu_role, e.g. the current mmu's access and level have no relation to a shadow page's access and level. Masking off the level causes problems for 5-level paging, e.g. CR4.LA57 has its own redundant flag in the extended role, and nested EPT would need a similar hack to support 5-level paging for L2. Opportunistically rework the mask for kvm_mmu_pte_write() to define the fields that should be ignored as opposed to the fields that should be checked, i.e. make it opt-out instead of opt-in so that new fields are automatically picked up. While doing so, stop ignoring "direct". The field is effectively ignored anyways because kvm_mmu_pte_write() is only reached with an indirect mmu and the loop only walks indirect shadow pages, but double checking "direct" literally costs nothing. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-03-16KVM: nVMX: Properly handle userspace interrupt window requestSean Christopherson
Return true for vmx_interrupt_allowed() if the vCPU is in L2 and L1 has external interrupt exiting enabled. IRQs are never blocked in hardware if the CPU is in the guest (L2 from L1's perspective) when IRQs trigger VM-Exit. The new check percolates up to kvm_vcpu_ready_for_interrupt_injection() and thus vcpu_run(), and so KVM will exit to userspace if userspace has requested an interrupt window (to inject an IRQ into L1). Remove the @external_intr param from vmx_check_nested_events(), which is actually an indicator that userspace wants an interrupt window, e.g. it's named @req_int_win further up the stack. Injecting a VM-Exit into L1 to try and bounce out to L0 userspace is all kinds of broken and is no longer necessary. Remove the hack in nested_vmx_vmexit() that attempted to workaround the breakage in vmx_check_nested_events() by only filling interrupt info if there's an actual interrupt pending. The hack actually made things worse because it caused KVM to _never_ fill interrupt info when the LAPIC resides in userspace (kvm_cpu_has_interrupt() queries interrupt.injected, which is always cleared by prepare_vmcs12() before reaching the hack in nested_vmx_vmexit()). Fixes: 6550c4df7e50 ("KVM: nVMX: Fix interrupt window request with "Acknowledge interrupt on exit"") Cc: stable@vger.kernel.org Cc: Liran Alon <liran.alon@oracle.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-03-16KVM: X86: trigger kvmclock sync request just once on VM creationWanpeng Li
In the progress of vCPUs creation, it queues a kvmclock sync worker to the global workqueue before each vCPU creation completes. The workqueue subsystem guarantees not to queue the already queued work; however, we can make the logic more clear by making just one leader to trigger this kvmclock sync request, and also save on cacheline bouncing caused by test_and_set_bit. Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-03-16KVM: LAPIC: Recalculate apic map in batchWanpeng Li
In the vCPU reset and set APIC_BASE MSR path, the apic map will be recalculated several times, each time it will consume 10+ us observed by ftrace in my non-overcommit environment since the expensive memory allocate/mutex/rcu etc operations. This patch optimizes it by recaluating apic map in batch, I hope this can benefit the serverless scenario which can frequently create/destroy VMs. Before patch: kvm_lapic_reset ~27us After patch: kvm_lapic_reset ~14us Observed by ftrace, improve ~48%. Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-03-16KVM: Fix some obsolete commentsMiaohe Lin
Remove some obsolete comments, fix wrong function name and description. Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-03-16KVM: x86: enable dirty log gradually in small chunksJay Zhou
It could take kvm->mmu_lock for an extended period of time when enabling dirty log for the first time. The main cost is to clear all the D-bits of last level SPTEs. This situation can benefit from manual dirty log protect as well, which can reduce the mmu_lock time taken. The sequence is like this: 1. Initialize all the bits of the dirty bitmap to 1 when enabling dirty log for the first time 2. Only write protect the huge pages 3. KVM_GET_DIRTY_LOG returns the dirty bitmap info 4. KVM_CLEAR_DIRTY_LOG will clear D-bit for each of the leaf level SPTEs gradually in small chunks Under the Intel(R) Xeon(R) Gold 6152 CPU @ 2.10GHz environment, I did some tests with a 128G windows VM and counted the time taken of memory_global_dirty_log_start, here is the numbers: VM Size Before After optimization 128G 460ms 10ms Signed-off-by: Jay Zhou <jianjay.zhou@huawei.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-03-16KVM: x86/mmu: Reuse the current root if possible for fast switchSean Christopherson
Reuse the current root when possible instead of grabbing a different root from the array of cached roots. Doing so avoids unnecessary MMU switches and also fixes a quirk where KVM can't reuse roots without creating multiple roots since the cache is a victim cache, i.e. roots are added to the cache when they're "evicted", not when they are created. The quirk could be fixed by adding roots to the cache on creation, but that would reduce the effective size of the cache as one of its entries would be burned to track the current root. Reusing the current root is especially helpful for nested virt as the current root is almost always usable for the "new" MMU on nested VM-entry/VM-exit. Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-03-16KVM: x86/mmu: Ignore guest CR3 on fast root switch for direct MMUSean Christopherson
Ignore the guest's CR3 when looking for a cached root for a direct MMU, the guest's CR3 has no impact on the direct MMU's shadow pages (the role check ensures compatibility with CR0.WP, etc...). Zero out root_cr3 when allocating the direct roots to make it clear that it's ignored. Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-03-16KVM: SVM: Inhibit APIC virtualization for X2APIC guestOliver Upton
The AVIC does not support guest use of the x2APIC interface. Currently, KVM simply chooses to squash the x2APIC feature in the guest's CPUID If the AVIC is enabled. Doing so prevents KVM from running a guest with greater than 255 vCPUs, as such a guest necessitates the use of the x2APIC interface. Instead, inhibit AVIC enablement on a per-VM basis whenever the x2APIC feature is set in the guest's CPUID. Signed-off-by: Oliver Upton <oupton@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-03-16KVM: Remove unnecessary asm/kvm_host.h includesPeter Xu
Remove includes of asm/kvm_host.h from files that already include linux/kvm_host.h to make it more obvious that there is no ordering issue between the two headers. linux/kvm_host.h includes asm/kvm_host.h to pick up architecture specific settings, and this will never change, i.e. including asm/kvm_host.h after linux/kvm_host.h may seem problematic, but in practice is simply redundant. Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-03-16KVM: x86: Consolidate VM allocation and free for VMX and SVMSean Christopherson
Move the VM allocation and free code to common x86 as the logic is more or less identical across SVM and VMX. Note, although hyperv.hv_pa_pg is part of the common kvm->arch, it's (currently) only allocated by VMX VMs. But, since kfree() plays nice when passed a NULL pointer, the superfluous call for SVM is harmless and avoids future churn if SVM gains support for HyperV's direct TLB flush. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> [Make vm_size a field instead of a function. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-03-16KVM: x86: Directly return __vmalloc() result in ->vm_alloc()Sean Christopherson
Directly return the __vmalloc() result in {svm,vmx}_vm_alloc() to pave the way for handling VM alloc/free in common x86 code, and to obviate the need to check the result of __vmalloc() in vendor specific code. Add a build-time assertion to ensure each structs' "kvm" field stays at offset 0, which allows interpreting a "struct kvm_{svm,vmx}" as a "struct kvm". Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-03-16KVM: x86: Gracefully handle __vmalloc() failure during VM allocationSean Christopherson
Check the result of __vmalloc() to avoid dereferencing a NULL pointer in the event that allocation failres. Fixes: d1e5b0e98ea27 ("kvm: Make VM ioctl do valloc for some archs") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-03-16KVM: x86: Adjust counter sample period after a wrmsrEric Hankland
The sample_period of a counter tracks when that counter will overflow and set global status/trigger a PMI. However this currently only gets set when the initial counter is created or when a counter is resumed; this updates the sample period after a wrmsr so running counters will accurately reflect their new value. Signed-off-by: Eric Hankland <ehankland@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-03-16KVM: x86/mmu: Consolidate open coded variants of memslot TLB flushesSean Christopherson
Replace open coded instances of kvm_arch_flush_remote_tlbs_memslot()'s functionality with calls to the aforementioned function. Update the comment in kvm_arch_flush_remote_tlbs_memslot() to elaborate on how it is used and why it asserts that slots_lock is held. No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-03-16KVM: x86/mmu: Use range-based TLB flush for dirty log memslot flushSean Christopherson
Use the with_address() variant when performing a TLB flush for a specific memslot via kvm_arch_flush_remote_tlbs_memslot(), i.e. when flushing after clearing dirty bits during KVM_{GET,CLEAR}_DIRTY_LOG. This aligns all dirty log memslot-specific TLB flushes to use the with_address() variant and paves the way for consolidating the relevant code. Note, moving to the with_address() variant only affects functionality when running as a HyperV guest. Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-03-16KVM: x86/mmu: Move kvm_arch_flush_remote_tlbs_memslot() to mmu.cSean Christopherson
Move kvm_arch_flush_remote_tlbs_memslot() from x86.c to mmu.c in preparation for calling kvm_flush_remote_tlbs_with_address() instead of kvm_flush_remote_tlbs(). The with_address() variant is statically defined in mmu.c, arguably kvm_arch_flush_remote_tlbs_memslot() belongs in mmu.c anyways, and defining kvm_arch_flush_remote_tlbs_memslot() in mmu.c will allow the compiler to inline said function when a future patch consolidates open coded variants of the function. No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-03-16KVM: Terminate memslot walks via used_slotsSean Christopherson
Refactor memslot handling to treat the number of used slots as the de facto size of the memslot array, e.g. return NULL from id_to_memslot() when an invalid index is provided instead of relying on npages==0 to detect an invalid memslot. Rework the sorting and walking of memslots in advance of dynamically sizing memslots to aid bisection and debug, e.g. with luck, a bug in the refactoring will bisect here and/or hit a WARN instead of randomly corrupting memory. Alternatively, a global null/invalid memslot could be returned, i.e. so callers of id_to_memslot() don't have to explicitly check for a NULL memslot, but that approach runs the risk of introducing difficult-to- debug issues, e.g. if the global null slot is modified. Constifying the return from id_to_memslot() to combat such issues is possible, but would require a massive refactoring of arch specific code and would still be susceptible to casting shenanigans. Add function comments to update_memslots() and search_memslots() to explicitly (and loudly) state how memslots are sorted. Opportunistically stuff @hva with a non-canonical value when deleting a private memslot on x86 to detect bogus usage of the freed slot. No functional change intended. Tested-by: Christoffer Dall <christoffer.dall@arm.com> Tested-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-03-16KVM: Provide common implementation for generic dirty log functionsSean Christopherson
Move the implementations of KVM_GET_DIRTY_LOG and KVM_CLEAR_DIRTY_LOG for CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT into common KVM code. The arch specific implemenations are extremely similar, differing only in whether the dirty log needs to be sync'd from hardware (x86) and how the TLBs are flushed. Add new arch hooks to handle sync and TLB flush; the sync will also be used for non-generic dirty log support in a future patch (s390). The ulterior motive for providing a common implementation is to eliminate the dependency between arch and common code with respect to the memslot referenced by the dirty log, i.e. to make it obvious in the code that the validity of the memslot is guaranteed, as a future patch will rework memslot handling such that id_to_memslot() can return NULL. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-03-16KVM: Simplify kvm_free_memslot() and all its descendentsSean Christopherson
Now that all callers of kvm_free_memslot() pass NULL for @dont, remove the param from the top-level routine and all arch's implementations. No functional change intended. Tested-by: Christoffer Dall <christoffer.dall@arm.com> Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-03-16KVM: x86: Free arrays for old memslot when moving memslot's base gfnSean Christopherson
Explicitly free the metadata arrays (stored in slot->arch) in the old memslot structure when moving the memslot's base gfn is committed. This eliminates x86's dependency on kvm_free_memslot() being called when a memslot move is committed, and paves the way for removing the funky code in kvm_free_memslot() that conditionally frees structures based on its @dont param. Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-03-16KVM: Drop "const" attribute from old memslot in commit_memory_region()Sean Christopherson
Drop the "const" attribute from @old in kvm_arch_commit_memory_region() to allow arch specific code to free arch specific resources in the old memslot without having to cast away the attribute. Freeing resources in kvm_arch_commit_memory_region() paves the way for simplifying kvm_free_memslot() by eliminating the last usage of its @dont param. Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-03-16KVM: Drop kvm_arch_create_memslot()Sean Christopherson
Remove kvm_arch_create_memslot() now that all arch implementations are effectively nops. Removing kvm_arch_create_memslot() eliminates the possibility for arch specific code to allocate memory prior to setting a memslot, which sets the stage for simplifying kvm_free_memslot(). Cc: Janosch Frank <frankja@linux.ibm.com> Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-03-16KVM: x86: Allocate memslot resources during prepare_memory_region()Sean Christopherson
Allocate the various metadata structures associated with a new memslot during kvm_arch_prepare_memory_region(), which paves the way for removing kvm_arch_create_memslot() altogether. Moving x86's memory allocation only changes the order of kernel memory allocations between x86 and common KVM code. Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-03-16KVM: x86: Allocate new rmap and large page tracking when moving memslotSean Christopherson
Reallocate a rmap array and recalcuate large page compatibility when moving an existing memslot to correctly handle the alignment properties of the new memslot. The number of rmap entries required at each level is dependent on the alignment of the memslot's base gfn with respect to that level, e.g. moving a large-page aligned memslot so that it becomes unaligned will increase the number of rmap entries needed at the now unaligned level. Not updating the rmap array is the most obvious bug, as KVM accesses garbage data beyond the end of the rmap. KVM interprets the bad data as pointers, leading to non-canonical #GPs, unexpected #PFs, etc... general protection fault: 0000 [#1] SMP CPU: 0 PID: 1909 Comm: move_memory_reg Not tainted 5.4.0-rc7+ #139 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:rmap_get_first+0x37/0x50 [kvm] Code: <48> 8b 3b 48 85 ff 74 ec e8 6c f4 ff ff 85 c0 74 e3 48 89 d8 5b c3 RSP: 0018:ffffc9000021bbc8 EFLAGS: 00010246 RAX: ffff00617461642e RBX: ffff00617461642e RCX: 0000000000000012 RDX: ffff88827400f568 RSI: ffffc9000021bbe0 RDI: ffff88827400f570 RBP: 0010000000000000 R08: ffffc9000021bd00 R09: ffffc9000021bda8 R10: ffffc9000021bc48 R11: 0000000000000000 R12: 0030000000000000 R13: 0000000000000000 R14: ffff88827427d700 R15: ffffc9000021bce8 FS: 00007f7eda014700(0000) GS:ffff888277a00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f7ed9216ff8 CR3: 0000000274391003 CR4: 0000000000162eb0 Call Trace: kvm_mmu_slot_set_dirty+0xa1/0x150 [kvm] __kvm_set_memory_region.part.64+0x559/0x960 [kvm] kvm_set_memory_region+0x45/0x60 [kvm] kvm_vm_ioctl+0x30f/0x920 [kvm] do_vfs_ioctl+0xa1/0x620 ksys_ioctl+0x66/0x70 __x64_sys_ioctl+0x16/0x20 do_syscall_64+0x4c/0x170 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f7ed9911f47 Code: <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 21 6f 2c 00 f7 d8 64 89 01 48 RSP: 002b:00007ffc00937498 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 0000000001ab0010 RCX: 00007f7ed9911f47 RDX: 0000000001ab1350 RSI: 000000004020ae46 RDI: 0000000000000004 RBP: 000000000000000a R08: 0000000000000000 R09: 00007f7ed9214700 R10: 00007f7ed92149d0 R11: 0000000000000246 R12: 00000000bffff000 R13: 0000000000000003 R14: 00007f7ed9215000 R15: 0000000000000000 Modules linked in: kvm_intel kvm irqbypass ---[ end trace 0c5f570b3358ca89 ]--- The disallow_lpage tracking is more subtle. Failure to update results in KVM creating large pages when it shouldn't, either due to stale data or again due to indexing beyond the end of the metadata arrays, which can lead to memory corruption and/or leaking data to guest/userspace. Note, the arrays for the old memslot are freed by the unconditional call to kvm_free_memslot() in __kvm_set_memory_region(). Fixes: 05da45583de9b ("KVM: MMU: large page support") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>