summaryrefslogtreecommitdiff
path: root/arch/x86/kvm
AgeCommit message (Collapse)Author
2023-02-03KVM: x86: Explicitly state lockdep condition of msr_filter updateMichal Luczaj
Replace `1` with the actual mutex_is_locked() check. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Michal Luczaj <mhal@rbox.co> Link: https://lore.kernel.org/r/20230107001256.2365304-5-mhal@rbox.co [sean: delete the comment that explained the hardocded '1'] Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-02-03KVM: x86: Simplify msr_filter updateMichal Luczaj
Replace srcu_dereference()+rcu_assign_pointer() sequence with a single rcu_replace_pointer(). Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Michal Luczaj <mhal@rbox.co> Link: https://lore.kernel.org/r/20230107001256.2365304-4-mhal@rbox.co Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-02-03KVM: x86: Optimize kvm->lock and SRCU interaction (KVM_X86_SET_MSR_FILTER)Michal Luczaj
Reduce time spent holding kvm->lock: unlock mutex before calling synchronize_srcu(). There is no need to hold kvm->lock until all vCPUs have been kicked, KVM only needs to guarantee that all vCPUs will switch to the new filter before exiting to userspace. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Michal Luczaj <mhal@rbox.co> Link: https://lore.kernel.org/r/20230107001256.2365304-3-mhal@rbox.co [sean: expand changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-02-03KVM: x86: Optimize kvm->lock and SRCU interaction (KVM_SET_PMU_EVENT_FILTER)Michal Luczaj
Reduce time spent holding kvm->lock: unlock mutex before calling synchronize_srcu_expedited(). There is no need to hold kvm->lock until all vCPUs have been kicked, KVM only needs to guarantee that all vCPUs will switch to the new filter before exiting to userspace. Protecting the write to __reprogram_pmi is also unnecessary as a vCPU may process a set bit before receiving the final KVM_REQ_PMU, but the per-vCPU writes are guaranteed to occur after all vCPUs have switched to the new filter. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Michal Luczaj <mhal@rbox.co> Link: https://lore.kernel.org/r/20230107001256.2365304-2-mhal@rbox.co [sean: expand changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-02-03KVM: x86/emulator: Fix comment in __load_segment_descriptor()Michal Luczaj
The comment refers to the same condition twice. Make it reflect what the code actually does. No functional change intended. Signed-off-by: Michal Luczaj <mhal@rbox.co> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://lore.kernel.org/r/20230126013405.2967156-3-mhal@rbox.co Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-02-03KVM: x86/emulator: Fix segment load privilege level validationMichal Luczaj
Intel SDM describes what steps are taken by the CPU to verify if a memory segment can actually be used at a given privilege level. Loading DS/ES/FS/GS involves checking segment's type as well as making sure that neither selector's RPL nor caller's CPL are greater than segment's DPL. Emulator implements Intel's pseudocode in __load_segment_descriptor(), even quoting the pseudocode in the comments. Although the pseudocode is correctly translated, the implementation is incorrect. This is most likely due to SDM, at the time, being wrong. Patch fixes emulator's logic and updates the pseudocode in the comment. Below are historical notes. Emulator code for handling segment descriptors appears to have been introduced in March 2010 in commit 38ba30ba51a0 ("KVM: x86 emulator: Emulate task switch in emulator.c"). Intel SDM Vol 2A: Instruction Set Reference, A-M (Order Number: 253666-034US, _March 2010_) lists the steps for loading segment registers in section related to MOV instruction: IF DS, ES, FS, or GS is loaded with non-NULL selector THEN IF segment selector index is outside descriptor table limits or segment is not a data or readable code segment or ((segment is a data or nonconforming code segment) and (both RPL and CPL > DPL)) <--- THEN #GP(selector); FI; This is precisely what __load_segment_descriptor() quotes and implements. But there's a twist; a few SDM revisions later (253667-044US), in August 2012, the snippet above becomes: IF DS, ES, FS, or GS is loaded with non-NULL selector THEN IF segment selector index is outside descriptor table limits or segment is not a data or readable code segment or ((segment is a data or nonconforming code segment) [note: missing or superfluous parenthesis?] or ((RPL > DPL) and (CPL > DPL)) <--- THEN #GP(selector); FI; Many SDMs later (253667-065US), in December 2017, pseudocode reaches what seems to be its final form: IF DS, ES, FS, or GS is loaded with non-NULL selector THEN IF segment selector index is outside descriptor table limits OR segment is not a data or readable code segment OR ((segment is a data or nonconforming code segment) AND ((RPL > DPL) or (CPL > DPL))) <--- THEN #GP(selector); FI; which also matches the behavior described in AMD's APM, which states that a #GP occurs if: The DS, ES, FS, or GS register was loaded and the segment pointed to was a data or non-conforming code segment, but the RPL or CPL was greater than the DPL. Signed-off-by: Michal Luczaj <mhal@rbox.co> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://lore.kernel.org/r/20230126013405.2967156-2-mhal@rbox.co [sean: add blurb to changelog calling out AMD agrees] Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-02-02scripts/spelling.txt: add `permitted'Ricardo Ribalda
Patch series "spelling: Fix some trivial typos". Seems like permitted has two t's :), Lets add that to spellings to help others. This patch (of 3): Add another common typo. Noticed when I sent a patch with the typo and in kvm and of. [ribalda@chromium.org: fix trivial typo] Link: https://lkml.kernel.org/r/20221220-permited-v1-2-52ea9857fa61@chromium.org Link: https://lkml.kernel.org/r/20221220-permited-v1-1-52ea9857fa61@chromium.org Signed-off-by: Ricardo Ribalda <ribalda@chromium.org> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-01KVM: x86/pmu: Add PRIR++ and PDist support for SPR and later modelsLike Xu
The pebs capability on the SPR is basically the same as Ice Lake Server with the exception of two special facilities that have been enhanced and require special handling. Upon triggering a PEBS assist, there will be a finite delay between the time the counter overflows and when the microcode starts to carry out its data collection obligations. Even if the delay is constant in core clock space, it invariably manifest as variable "skids" in instruction address space. On the Ice Lake Server, the Precise Distribution of Instructions Retire (PDIR) facility mitigates the "skid" problem by providing an early indication of when the counter is about to overflow. On SPR, the PDIR counter available (Fixed 0) is unchanged, but the capability is enhanced to Instruction-Accurate PDIR (PDIR++), where PEBS is taken on the next instruction after the one that caused the overflow. SPR also introduces a new Precise Distribution (PDist) facility only on general programmable counter 0. Per Intel SDM, PDist eliminates any skid or shadowing effects from PEBS. With PDist, the PEBS record will be generated precisely upon completion of the instruction or operation that causes the counter to overflow (there is no "wait for next occurrence" by default). In terms of KVM handling, when guest accesses those special counters, the KVM needs to request the same index counters via the perf_event kernel subsystem to ensure that the guest uses the correct pebs hardware counter (PRIR++ or PDist). This is mainly achieved by adjusting the event precise level to the maximum, where the semantics of this magic number is mainly defined by the internal software context of perf_event and it's also backwards compatible as part of the user space interface. Opportunistically, refine confusing comments on TNT+, as the only ones that currently support pebs_ept are Ice Lake server and SPR (GLC+). Signed-off-by: Like Xu <likexu@tencent.com> Link: https://lore.kernel.org/r/20221109082802.27543-3-likexu@tencent.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-02-01KVM: x86: Reinitialize xAPIC ID when userspace forces x2APIC => xAPICEmanuele Giuseppe Esposito
Reinitialize the xAPIC ID to the vCPU ID when userspace forces the APIC to transition directly from x2APIC to xAPIC mode, e.g. to emulate RESET. KVM already stuffs the xAPIC ID when the APIC is transitioned from DISABLED to xAPIC (commit 49bd29ba1dbd ("KVM: x86: reset APIC ID when enabling LAPIC")), i.e. userspace is conditioned to expect KVM to update the xAPIC ID, but KVM doesn't handle the architecturally-impossible case where userspace forces x2APIC=>xAPIC via KVM_SET_MSRS. On its own, the "bug" is benign, as userspace emulation of RESET will also stuff APIC registers via KVM_SET_LAPIC, i.e. will manually set the xAPIC ID. However, commit 3743c2f02517 ("KVM: x86: inhibit APICv/AVIC on changes to APIC ID or APIC base") introduced a bug, fixed by commit commit ef40757743b4 ("KVM: x86: fix APICv/x2AVIC disabled when vm reboot by itself"), that caused KVM to fail to properly update the xAPIC ID when handling KVM_SET_LAPIC. Refresh the xAPIC ID even though it's not strictly necessary so that KVM provides consistent behavior. Note, KVM follows Intel architecture with regard to handling the xAPIC ID and x2APIC IDs across mode transitions. For the APIC DISABLED case (commit 49bd29ba1dbd), Intel's SDM says the xAPIC ID _may_ be reinitialized 10.4.3 Enabling or Disabling the Local APIC When IA32_APIC_BASE[11] is set to 0, prior initialization to the APIC may be lost and the APIC may return to the state described in Section 10.4.7.1, “Local APIC State After Power-Up or Reset.” 10.4.7.1 Local APIC State After Power-Up or Reset ... The local APIC ID register is set to a unique APIC ID. ... i.e. KVM's behavior is legal as per Intel's architecture. In practice, Intel's behavior is N/A as modern Intel CPUs (since at least Haswell) make the xAPIC ID fully read-only. And for xAPIC => x2APIC transitions (commit 257b9a5faab5 ("KVM: x86: use correct APIC ID on x2APIC transition")), Intel's SDM says: Any APIC ID value written to the memory-mapped local APIC ID register is not preserved. AMD's APM says nothing (that I could find) about the xAPIC ID when the APIC is DISABLED, but testing on bare metal (Rome) shows that the xAPIC ID is preserved when the APIC is DISABLED and re-enabled in xAPIC mode. AMD also preserves the xAPIC ID when the APIC is transitioned from xAPIC to x2APIC, i.e. allows a backdoor write of the x2APIC ID, which is again not emulated by KVM. Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com> Link: https://lore.kernel.org/all/20230109130605.2013555-2-eesposit@redhat.com [sean: rewrite changelog, set xAPIC ID iff APIC is enabled] Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-02-01KVM: x86: hyper-v: Add extended hypercall support in Hyper-vVipin Sharma
Add support for extended hypercall in Hyper-v. Hyper-v TLFS 6.0b describes hypercalls above call code 0x8000 as extended hypercalls. A Hyper-v hypervisor's guest VM finds availability of extended hypercalls via CPUID.0x40000003.EBX BIT(20). If the bit is set then the guest can call extended hypercalls. All extended hypercalls will exit to userspace by default. This allows for easy support of future hypercalls without being dependent on KVM releases. If there will be need to process the hypercall in KVM instead of userspace then KVM can create a capability which userspace can query to know which hypercalls can be handled by the KVM and enable handling of those hypercalls. Signed-off-by: Vipin Sharma <vipinsh@google.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Link: https://lore.kernel.org/r/20221212183720.4062037-10-vipinsh@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-02-01KVM: x86: hyper-v: Use common code for hypercall userspace exitVipin Sharma
Remove duplicate code to exit to userspace for hyper-v hypercalls and use a common place to exit. No functional change intended. Signed-off-by: Vipin Sharma <vipinsh@google.com> Suggested-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Link: https://lore.kernel.org/r/20221212183720.4062037-9-vipinsh@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-31KVM: x86: Use emulator callbacks instead of duplicating "host flags"Maxim Levitsky
Instead of re-defining the "host flags" bits, just expose dedicated helpers for each of the two remaining flags that are consumed by the emulator. The emulator never consumes both "is guest" and "is SMM" in close proximity, so there is no motivation to avoid additional indirect branches. Also while at it, garbage collect the recently removed host flags. No functional change is intended. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Tested-by: Santosh Shukla <Santosh.Shukla@amd.com> Link: https://lore.kernel.org/r/20221129193717.513824-6-mlevitsk@redhat.com [sean: fix CONFIG_KVM_SMM=n builds, tweak names of wrappers] Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-31KVM: x86: Move HF_NMI_MASK and HF_IRET_MASK into "struct vcpu_svm"Maxim Levitsky
Move HF_NMI_MASK and HF_IRET_MASK (a.k.a. "waiting for IRET") out of the common "hflags" and into dedicated flags in "struct vcpu_svm". The flags are used only for the SVM and thus should not be in hflags. Tracking NMI masking in software isn't SVM specific, e.g. VMX has a similar flag (soft_vnmi_blocked), but that's much more of a hack as VMX can't intercept IRET, is useful only for ancient CPUs, i.e. will hopefully be removed at some point, and again the exact behavior is vendor specific and shouldn't ever be referenced in common code. converting VMX No functional change is intended. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Tested-by: Santosh Shukla <Santosh.Shukla@amd.com> Link: https://lore.kernel.org/r/20221129193717.513824-5-mlevitsk@redhat.com [sean: split from HF_GIF_MASK patch] Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-31KVM: x86: Move HF_GIF_MASK into "struct vcpu_svm" as "guest_gif"Maxim Levitsky
Move HF_GIF_MASK out of the common "hflags" and into vcpu_svm.guest_gif. GIF is an SVM-only concept and has should never be consulted outside of SVM-specific code. No functional change is intended. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Tested-by: Santosh Shukla <Santosh.Shukla@amd.com> Link: https://lore.kernel.org/r/20221129193717.513824-5-mlevitsk@redhat.com [sean: split to separate patch] Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-31KVM: nSVM: Don't sync tlb_ctl back to vmcb12 on nested VM-ExitMaxim Levitsky
Don't sync the TLB control field from vmcb02 to vmcs12 on nested VM-Exit. Per AMD's APM, the field is not modified by hardware: The VMRUN instruction reads, but does not change, the value of the TLB_CONTROL field Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Tested-by: Santosh Shukla <Santosh.Shukla@amd.com> Link: https://lore.kernel.org/r/20221129193717.513824-2-mlevitsk@redhat.com [sean: massage changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-26KVM: x86/pmu: Provide "error" semantics for unsupported-but-known PMU MSRsSean Christopherson
Provide "error" semantics (read zeros, drop writes) for userspace accesses to MSRs that are ultimately unsupported for whatever reason, but for which KVM told userspace to save and restore the MSR, i.e. for MSRs that KVM included in KVM_GET_MSR_INDEX_LIST. Previously, KVM special cased a few PMU MSRs that were problematic at one point or another. Extend the treatment to all PMU MSRs, e.g. to avoid spurious unsupported accesses. Note, the logic can also be used for non-PMU MSRs, but as of today only PMU MSRs can end up being unsupported after KVM told userspace to save and restore them. Link: https://lore.kernel.org/r/20230124234905.3774678-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-26KVM: x86/pmu: Don't tell userspace to save MSRs for non-existent fixed PMCsLike Xu
Limit the set of MSRs for fixed PMU counters based on the number of fixed counters actually supported by the host so that userspace doesn't waste time saving and restoring dummy values. Signed-off-by: Like Xu <likexu@tencent.com> [sean: split for !enable_pmu logic, drop min(), write changelog] Link: https://lore.kernel.org/r/20230124234905.3774678-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-26KVM: x86/pmu: Don't tell userspace to save PMU MSRs if PMU is disabledSean Christopherson
Omit all PMU MSRs from the "MSRs to save" list if the PMU is disabled so that userspace doesn't waste time saving and restoring dummy values. KVM provides "error" semantics (read zeros, drop writes) for such known-but- unsupported MSRs, i.e. has fudged around this issue for quite some time. Keep the "error" semantics as-is for now, the logic will be cleaned up in a separate patch. Cc: Aaron Lewis <aaronlewis@google.com> Cc: Weijiang Yang <weijiang.yang@intel.com> Link: https://lore.kernel.org/r/20230124234905.3774678-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-26KVM: x86/pmu: Use separate array for defining "PMU MSRs to save"Sean Christopherson
Move all potential to-be-saved PMU MSRs into a separate array so that a future patch can easily omit all PMU MSRs from the list when the PMU is disabled. No functional change intended. Link: https://lore.kernel.org/r/20230124234905.3774678-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-26KVM: x86/pmu: Gate all "unimplemented MSR" prints on report_ignored_msrsSean Christopherson
Add helpers to print unimplemented MSR accesses and condition all such prints on report_ignored_msrs, i.e. honor userspace's request to not print unimplemented MSRs. Even though vcpu_unimpl() is ratelimited, printing can still be problematic, e.g. if a print gets stalled when host userspace is writing MSRs during live migration, an effective stall can result in very noticeable disruption in the guest. E.g. the profile below was taken while calling KVM_SET_MSRS on the PMU counters while the PMU was disabled in KVM. - 99.75% 0.00% [.] __ioctl - __ioctl - 99.74% entry_SYSCALL_64_after_hwframe do_syscall_64 sys_ioctl - do_vfs_ioctl - 92.48% kvm_vcpu_ioctl - kvm_arch_vcpu_ioctl - 85.12% kvm_set_msr_ignored_check svm_set_msr kvm_set_msr_common printk vprintk_func vprintk_default vprintk_emit console_unlock call_console_drivers univ8250_console_write serial8250_console_write uart_console_write Reported-by: Aaron Lewis <aaronlewis@google.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Link: https://lore.kernel.org/r/20230124234905.3774678-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-26KVM: x86/pmu: Cap kvm_pmu_cap.num_counters_gp at KVM's internal maxSean Christopherson
Limit kvm_pmu_cap.num_counters_gp during kvm_init_pmu_capability() based on the vendor PMU capabilities so that consuming num_counters_gp naturally does the right thing. This fixes a mostly theoretical bug where KVM could over-report its PMU support in KVM_GET_SUPPORTED_CPUID for leaf 0xA, e.g. if the number of counters reported by perf is greater than KVM's hardcoded internal limit. Incorporating input from the AMD PMU also avoids over-reporting MSRs to save when running on AMD. Link: https://lore.kernel.org/r/20230124234905.3774678-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-26KVM: x86/pmu: Drop event_type and rename "struct kvm_event_hw_type_mapping"Like Xu
After commit ("02791a5c362b KVM: x86/pmu: Use PERF_TYPE_RAW to merge reprogram_{gp,fixed}counter()"), vPMU starts to directly use the hardware event eventsel and unit_mask to reprogram perf_event, and the event_type field in the "struct kvm_event_hw_type_mapping" is simply no longer being used. Convert the struct into an anonymous struct as the current name is obsolete as the structure no longer has any mapping semantics, and placing the struct definition directly above its sole user makes its easier to understand what the array is filling in. Signed-off-by: Like Xu <likexu@tencent.com> Link: https://lore.kernel.org/r/20221205122048.16023-1-likexu@tencent.com [sean: drop new comment, use anonymous struct] Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-25KVM: x86: Propagate the AMD Automatic IBRS feature to the guestKim Phillips
Add the AMD Automatic IBRS feature bit to those being propagated to the guest, and enable the guest EFER bit. Signed-off-by: Kim Phillips <kim.phillips@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20230124163319.2277355-9-kim.phillips@amd.com
2023-01-25x86/cpu, kvm: Add the SMM_CTL MSR not present featureKim Phillips
The SMM_CTL MSR not present feature was being open-coded for KVM. Add it to its newly added CPUID leaf 0x80000021 EAX proper. Also drop the bit description comments now the code is more self-describing. Signed-off-by: Kim Phillips <kim.phillips@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20230124163319.2277355-7-kim.phillips@amd.com
2023-01-25x86/cpu, kvm: Add the Null Selector Clears Base featureKim Phillips
The Null Selector Clears Base feature was being open-coded for KVM. Add it to its newly added native CPUID leaf 0x80000021 EAX proper. Also drop the bit description comments now it's more self-describing. [ bp: Convert test in check_null_seg_clears_base() too. ] Signed-off-by: Kim Phillips <kim.phillips@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20230124163319.2277355-6-kim.phillips@amd.com
2023-01-25x86/cpu, kvm: Move X86_FEATURE_LFENCE_RDTSC to its native leafKim Phillips
The LFENCE always serializing feature bit was defined as scattered LFENCE_RDTSC and its native leaf bit position open-coded for KVM. Add it to its newly added CPUID leaf 0x80000021 EAX proper. With LFENCE_RDTSC in its proper place, the kernel's set_cpu_cap() will effectively synthesize the feature for KVM going forward. Also, DE_CFG[1] doesn't need to be set on such CPUs anymore. [ bp: Massage and merge diff from Sean. ] Signed-off-by: Kim Phillips <kim.phillips@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20230124163319.2277355-5-kim.phillips@amd.com
2023-01-25x86/cpu, kvm: Add the NO_NESTED_DATA_BP featureKim Phillips
The "Processor ignores nested data breakpoints" feature was being open-coded for KVM. Add the feature to its newly introduced CPUID leaf 0x80000021 EAX proper. Signed-off-by: Kim Phillips <kim.phillips@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20230124163319.2277355-4-kim.phillips@amd.com
2023-01-25KVM: x86: Move open-coded CPUID leaf 0x80000021 EAX bit propagation codeKim Phillips
Move code from __do_cpuid_func() to kvm_set_cpu_caps() in preparation for adding the features in their native leaf. Also drop the bit description comments as it will be more self-describing once the individual features are added. Whilst there, switch to using the more efficient cpu_feature_enabled() instead of static_cpu_has(). Note, LFENCE_RDTSC and "NULL selector clears base" are currently synthetic, Linux-defined feature flags as Linux tracking of the features predates AMD's definition. Keep the manual propagation of the flags from their synthetic counterparts until the kernel fully converts to AMD's definition, otherwise KVM would stop synthesizing the flags as intended. Signed-off-by: Kim Phillips <kim.phillips@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20230124163319.2277355-3-kim.phillips@amd.com
2023-01-25x86/cpu, kvm: Add support for CPUID_80000021_EAXKim Phillips
Add support for CPUID leaf 80000021, EAX. The majority of the features will be used in the kernel and thus a separate leaf is appropriate. Include KVM's reverse_cpuid entry because features are used by VM guests, too. [ bp: Massage commit message. ] Signed-off-by: Kim Phillips <kim.phillips@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20230124163319.2277355-2-kim.phillips@amd.com
2023-01-24KVM: x86: Replace IS_ERR() with IS_ERR_VALUE()ye xingchen
Avoid type casts that are needed for IS_ERR() and use IS_ERR_VALUE() instead. Signed-off-by: ye xingchen <ye.xingchen@zte.com.cn> Link: https://lore.kernel.org/r/202211161718436948912@zte.com.cn Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24KVM: VMX: Handle NMI VM-Exits in noinstr regionSean Christopherson
Move VMX's handling of NMI VM-Exits into vmx_vcpu_enter_exit() so that the NMI is handled prior to leaving the safety of noinstr. Handling the NMI after leaving noinstr exposes the kernel to potential ordering problems as an instrumentation-induced fault, e.g. #DB, #BP, #PF, etc. will unblock NMIs when IRETing back to the faulting instruction. Reported-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20221213060912.654668-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24KVM: VMX: Provide separate subroutines for invoking NMI vs. IRQ handlersSean Christopherson
Split the asm subroutines for handling NMIs versus IRQs that occur in the guest so that the NMI handler can be called from a noinstr section. As a bonus, the NMI path doesn't need an indirect branch. Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20221213060912.654668-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24x86/entry: KVM: Use dedicated VMX NMI entry for 32-bit kernels tooSean Christopherson
Use a dedicated entry for invoking the NMI handler from KVM VMX's VM-Exit path for 32-bit even though using a dedicated entry for 32-bit isn't strictly necessary. Exposing a single symbol will allow KVM to reference the entry point in assembly code without having to resort to more #ifdefs (or #defines). identry.h is intended to be included from asm files only once, and so simply including idtentry.h in KVM assembly isn't an option. Bypassing the ESP fixup and CR3 switching in the standard NMI entry code is safe as KVM always handles NMIs that occur in the guest on a kernel stack, with a kernel CR3. Cc: Andy Lutomirski <luto@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20221213060912.654668-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24KVM: VMX: Always inline to_vmx() and to_kvm_vmx()Sean Christopherson
Tag to_vmx() and to_kvm_vmx() __always_inline as they both just reflect the passed in pointer (the embedded struct is the first field in the container), and drop the @vmx param from vmx_vcpu_enter_exit(), which likely existed purely to make noinstr validation happy. Amusingly, when the compiler decides to not inline the helpers, e.g. for KASAN builds, to_vmx() and to_kvm_vmx() may end up pointing at the same symbol, which generates very confusing objtool warnings. E.g. the use of to_vmx() in a future patch led to objtool complaining about to_kvm_vmx(), and only once all use of to_kvm_vmx() was commented out did to_vmx() pop up in the obj tool report. vmlinux.o: warning: objtool: vmx_vcpu_enter_exit+0x160: call to to_kvm_vmx() leaves .noinstr.text section Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20221213060912.654668-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24KVM: VMX: Always inline eVMCS read/write helpersSean Christopherson
Tag all evmcs_{read,write}() helpers __always_inline so that they can be freely used in noinstr sections, e.g. to get the VM-Exit reason in vcpu_vmx_enter_exit() (in a future patch). For consistency and to avoid more spot fixes in the future, e.g. see commit 010050a86393 ("x86/kvm: Always inline evmcs_write64()"), tag all accessors even though evmcs_read32() is the only anticipated use case in the near future. In practice, non-KASAN builds are all but guaranteed to inline the helpers anyways. vmlinux.o: warning: objtool: vmx_vcpu_enter_exit+0x107: call to evmcs_read32() leaves .noinstr.text section Reported-by: kernel test robot <lkp@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20221213060912.654668-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24KVM: VMX: Allow VM-Fail path of VMREAD helper to be instrumentedSean Christopherson
Allow instrumentation in the VM-Fail path of __vmcs_readl() so that the helper can be used in noinstr functions, e.g. to get the exit reason in vmx_vcpu_enter_exit() in order to handle NMI VM-Exits in the noinstr section. While allowing instrumentation isn't technically safe, KVM has much bigger problems if VMREAD fails in a noinstr section. Note, all other VMX instructions also allow instrumentation in their VM-Fail paths for similar reasons, VMREAD was simply omitted by commit 3ebccdf373c2 ("x86/kvm/vmx: Move guest enter/exit into .noinstr.text") because VMREAD wasn't used in a noinstr section at the time. Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20221213060912.654668-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24KVM: x86: Make vmx_get_exit_qual() and vmx_get_intr_info() noinstr-friendlySean Christopherson
Add an extra special noinstr-friendly helper to test+mark a "register" available and use it when caching vmcs.EXIT_QUALIFICATION and vmcs.VM_EXIT_INTR_INFO. Make the caching helpers __always_inline too so that they can be used in noinstr functions. A future fix will move VMX's handling of NMI exits into the noinstr vmx_vcpu_enter_exit() so that the NMI is processed before any kind of instrumentation can trigger a fault and thus IRET, i.e. so that KVM doesn't invoke the NMI handler with NMIs enabled. Cc: Peter Zijlstra <peterz@infradead.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20221213060912.654668-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24KVM: VMX: don't use "unsigned long" in vmx_vcpu_enter_exit()Alexey Dobriyan
__vmx_vcpu_run_flags() returns "unsigned int" and uses only 2 bits of it so using "unsigned long" is very much pointless. Furthermore, __vmx_vcpu_run() and vmx_spec_ctrl_restore_host() take an "unsigned int", i.e. actually relying on an "unsigned long" value won't work. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/Y3e7UW0WNV2AZmsZ@p183 Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24KVM: VMX: Access @flags as a 32-bit value in __vmx_vcpu_run()Sean Christopherson
Access @flags using 32-bit operands when saving and testing @flags for VMX_RUN_VMRESUME, as using 8-bit operands is unnecessarily fragile due to relying on VMX_RUN_VMRESUME being in bits 0-7. The behavior of treating @flags a single byte is a holdover from when the param was "bool launched", i.e. is not deliberate. Cc: Alexey Dobriyan <adobriyan@gmail.com> Reviewed-by: Jim Mattson <jmattson@google.com> Link: https://lore.kernel.org/r/20221119003747.2615229-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24KVM: SVM: Account scratch allocations used to decrypt SEV guest memoryAnish Ghulati
Account the temp/scratch allocation used to decrypt unaligned debug accesses to SEV guest memory, the allocation is very much tied to the target VM. Reported-by: Mingwei Zhang <mizhang@google.com> Signed-off-by: Anish Ghulati <aghulati@google.com> Link: https://lore.kernel.org/r/20230113220923.2834699-1-aghulati@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24KVM: svm/avic: Drop "struct kvm_x86_ops" for avic_hardware_setup()Like Xu
Even in commit 4bdec12aa8d6 ("KVM: SVM: Detect X2APIC virtualization (x2AVIC) support"), where avic_hardware_setup() was first introduced, its only pass-in parameter "struct kvm_x86_ops *ops" is not used at all. Clean it up a bit to avoid compiler ranting from LLVM toolchain. Signed-off-by: Like Xu <likexu@tencent.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20221109115952.92816-1-likexu@tencent.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24KVM: SVM: remove redundant ret variablezhang songyi
Return value from svm_nmi_blocked() directly instead of taking this in another redundant variable. Signed-off-by: zhang songyi <zhang.songyi@zte.com.cn> Link: https://lore.kernel.org/r/202211282003389362484@zte.com.cn Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24KVM: x86/pmu: Introduce masked events to the pmu event filterAaron Lewis
When building a list of filter events, it can sometimes be a challenge to fit all the events needed to adequately restrict the guest into the limited space available in the pmu event filter. This stems from the fact that the pmu event filter requires each event (i.e. event select + unit mask) be listed, when the intention might be to restrict the event select all together, regardless of it's unit mask. Instead of increasing the number of filter events in the pmu event filter, add a new encoding that is able to do a more generalized match on the unit mask. Introduce masked events as another encoding the pmu event filter understands. Masked events has the fields: mask, match, and exclude. When filtering based on these events, the mask is applied to the guest's unit mask to see if it matches the match value (i.e. umask & mask == match). The exclude bit can then be used to exclude events from that match. E.g. for a given event select, if it's easier to say which unit mask values shouldn't be filtered, a masked event can be set up to match all possible unit mask values, then another masked event can be set up to match the unit mask values that shouldn't be filtered. Userspace can query to see if this feature exists by looking for the capability, KVM_CAP_PMU_EVENT_MASKED_EVENTS. This feature is enabled by setting the flags field in the pmu event filter to KVM_PMU_EVENT_FLAG_MASKED_EVENTS. Events can be encoded by using KVM_PMU_ENCODE_MASKED_ENTRY(). It is an error to have a bit set outside the valid bits for a masked event, and calls to KVM_SET_PMU_EVENT_FILTER will return -EINVAL in such cases, including the high bits of the event select (35:32) if called on Intel. With these updates the filter matching code has been updated to match on a common event. Masked events were flexible enough to handle both event types, so they were used as the common event. This changes how guest events get filtered because regardless of the type of event used in the uAPI, they will be converted to masked events. Because of this there could be a slight performance hit because instead of matching the filter event with a lookup on event select + unit mask, it does a lookup on event select then walks the unit masks to find the match. This shouldn't be a big problem because I would expect the set of common event selects to be small, and if they aren't the set can likely be reduced by using masked events to generalize the unit mask. Using one type of event when filtering guest events allows for a common code path to be used. Signed-off-by: Aaron Lewis <aaronlewis@google.com> Link: https://lore.kernel.org/r/20221220161236.555143-5-aaronlewis@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24KVM: x86/pmu: prepare the pmu event filter for masked eventsAaron Lewis
Refactor check_pmu_event_filter() in preparation for masked events. No functional changes intended Signed-off-by: Aaron Lewis <aaronlewis@google.com> Link: https://lore.kernel.org/r/20221220161236.555143-4-aaronlewis@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24KVM: x86/pmu: Remove impossible events from the pmu event filterAaron Lewis
If it's not possible for an event in the pmu event filter to match a pmu event being programmed by the guest, it's pointless to have it in the list. Opt for a shorter list by removing those events. Because this is established uAPI the pmu event filter can't outright rejected these events as garbage and return an error. Instead, play nice and remove them from the list. Also, opportunistically rewrite the comment when the filter is set to clarify that it guards against *all* TOCTOU attacks on the verified data. Signed-off-by: Aaron Lewis <aaronlewis@google.com> Link: https://lore.kernel.org/r/20221220161236.555143-3-aaronlewis@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24KVM: x86/pmu: Correct the mask used in a pmu event filter lookupAaron Lewis
When checking if a pmu event the guest is attempting to program should be filtered, only consider the event select + unit mask in that decision. Use an architecture specific mask to mask out all other bits, including bits 35:32 on Intel. Those bits are not part of the event select and should not be considered in that decision. Fixes: 66bb8a065f5a ("KVM: x86: PMU Event Filter") Signed-off-by: Aaron Lewis <aaronlewis@google.com> Link: https://lore.kernel.org/r/20221220161236.555143-2-aaronlewis@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24KVM: x86/mmu: Use kstrtobool() instead of strtobool()Christophe JAILLET
strtobool() is the same as kstrtobool(). However, the latter is more used within the kernel. In order to remove strtobool() and slightly simplify kstrtox.h, switch to the other function name. While at it, include the corresponding header file (<linux/kstrtox.h>) Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Link: https://lore.kernel.org/r/670882aa04dbdd171b46d3b20ffab87158454616.1673689135.git.christophe.jaillet@wanadoo.fr Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24KVM: x86/mmu: Cleanup range-based flushing for given pageHou Wenlong
Use the new kvm_flush_remote_tlbs_gfn() helper to cleanup the call sites of range-based flushing for given page, which makes the code clear. Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com> Link: https://lore.kernel.org/r/593ee1a876ece0e819191c0b23f56b940d6686db.1665214747.git.houwenlong.hwl@antgroup.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24KVM: x86/mmu: Fix wrong gfn range of tlb flushing in validate_direct_spte()Hou Wenlong
The spte pointing to the children SP is dropped, so the whole gfn range covered by the children SP should be flushed. Although, Hyper-V may treat a 1-page flush the same if the address points to a huge page, it still would be better to use the correct size of huge page. Fixes: c3134ce240eed ("KVM: Replace old tlb flush function with new one to flush a specified range.") Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com> Link: https://lore.kernel.org/r/5f297c566f7d7ff2ea6da3c66d050f69ce1b8ede.1665214747.git.houwenlong.hwl@antgroup.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24KVM: x86/mmu: Fix wrong start gfn of tlb flushing with rangeHou Wenlong
When a spte is dropped, the start gfn of tlb flushing should be the gfn of spte not the base gfn of SP which contains the spte. Also introduce a helper function to do range-based flushing when a spte is dropped, which would help prevent future buggy use of kvm_flush_remote_tlbs_with_address() in such case. Fixes: c3134ce240eed ("KVM: Replace old tlb flush function with new one to flush a specified range.") Suggested-by: David Matlack <dmatlack@google.com> Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com> Link: https://lore.kernel.org/r/72ac2169a261976f00c1703e88cda676dfb960f5.1665214747.git.houwenlong.hwl@antgroup.com Signed-off-by: Sean Christopherson <seanjc@google.com>