summaryrefslogtreecommitdiff
path: root/arch/x86/kvm/vmx
AgeCommit message (Collapse)Author
2024-02-07KVM: VMX: Report up-to-date exit qualification to userspaceChao Gao
Use vmx_get_exit_qual() to read the exit qualification. vcpu->arch.exit_qualification is cached for EPT violation only and even for EPT violation, it is stale at this point because the up-to-date value is cached later in handle_ept_violation(). Fixes: 70bcd708dfd1 ("KVM: vmx: expose more information for KVM_INTERNAL_ERROR_DELIVERY_EV exits") Signed-off-by: Chao Gao <chao.gao@intel.com> Link: https://lore.kernel.org/r/20231229022652.300095-1-chao.gao@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-02KVM: x86/pmu: Fix type length error when reading pmu->fixed_ctr_ctrlMingwei Zhang
Use a u64 instead of a u8 when taking a snapshot of pmu->fixed_ctr_ctrl when reprogramming fixed counters, as truncating the value results in KVM thinking fixed counter 2 is already disabled (the bug also affects fixed counters 3+, but KVM doesn't yet support those). As a result, if the guest disables fixed counter 2, KVM will get a false negative and fail to reprogram/disable emulation of the counter, which can leads to incorrect counts and spurious PMIs in the guest. Fixes: 76d287b2342e ("KVM: x86/pmu: Drop "u8 ctrl, int idx" for reprogram_fixed_counter()") Cc: stable@vger.kernel.org Signed-off-by: Mingwei Zhang <mizhang@google.com> Link: https://lore.kernel.org/r/20240123221220.3911317-1-mizhang@google.com [sean: rewrite changelog to call out the effects of the bug] Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-01KVM: x86/pmu: Snapshot event selectors that KVM emulates in softwareSean Christopherson
Snapshot the event selectors for the events that KVM emulates in software, which is currently instructions retired and branch instructions retired. The event selectors a tied to the underlying CPU, i.e. are constant for a given platform even though perf doesn't manage the mappings as such. Getting the event selectors from perf isn't exactly cheap, especially if mitigations are enabled, as at least one indirect call is involved. Snapshot the values in KVM instead of optimizing perf as working with the raw event selectors will be required if KVM ever wants to emulate events that aren't part of perf's uABI, i.e. that don't have an "enum perf_hw_id" entry. Link: https://lore.kernel.org/r/20231110022857.1273836-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-01KVM: x86/pmu: Add macros to iterate over all PMCs given a bitmapSean Christopherson
Add and use kvm_for_each_pmc() to dedup a variety of open coded for-loops that iterate over valid PMCs given a bitmap (and because seeing checkpatch whine about bad macro style is always amusing). No functional change intended. Link: https://lore.kernel.org/r/20231110022857.1273836-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-01KVM: x86/pmu: Move pmc_idx => pmc translation helper to common codeSean Christopherson
Add a common helper for *internal* PMC lookups, and delete the ops hook and Intel's implementation. Keep AMD's implementation, but rename it to amd_pmu_get_pmc() to make it somewhat more obvious that it's suited for both KVM-internal and guest-initiated lookups. Because KVM tracks all counters in a single bitmap, getting a counter when iterating over a bitmap, e.g. of all valid PMCs, requires a small amount of math, that while simple, isn't super obvious and doesn't use the same semantics as PMC lookups from RDPMC! Although AMD doesn't support fixed counters, the common PMU code still behaves as if there a split, the high half of which just happens to always be empty. Opportunstically add a comment to explain both what is going on, and why KVM uses a single bitmap, e.g. the boilerplate for iterating over separate bitmaps could be done via macros, so it's not (just) about deduplicating code. Link: https://lore.kernel.org/r/20231110022857.1273836-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-01KVM: x86/pmu: Add common define to capture fixed counters offsetSean Christopherson
Add a common define to "officially" solidify KVM's split of counters, i.e. to commit to using bits 31:0 to track general purpose counters and bits 63:32 to track fixed counters (which only Intel supports). KVM already bleeds this behavior all over common PMU code, and adding a KVM- defined macro allows clarifying that the value is a _base_, as oppposed to the _flag_ that is used to access fixed PMCs via RDPMC (which perf confusingly calls INTEL_PMC_FIXED_RDPMC_BASE). No functional change intended. Link: https://lore.kernel.org/r/20231110022857.1273836-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-01KVM: x86/pmu: Zero out PMU metadata on AMD if PMU is disabledSean Christopherson
Move the purging of common PMU metadata from intel_pmu_refresh() to kvm_pmu_refresh(), and invoke the vendor refresh() hook if and only if the VM is supposed to have a vPMU. KVM already denies access to the PMU based on kvm->arch.enable_pmu, as get_gp_pmc_amd() returns NULL for all PMCs in that case, i.e. KVM already violates AMD's architecture by not virtualizing a PMU (kernels have long since learned to not panic when the PMU is unavailable). But configuring the PMU as if it were enabled causes unwanted side effects, e.g. calls to kvm_pmu_trigger_event() waste an absurd number of cycles due to the all_valid_pmc_idx bitmap being non-zero. Fixes: b1d66dad65dc ("KVM: x86/svm: Add module param to control PMU virtualization") Reported-by: Konstantin Khorenko <khorenko@virtuozzo.com> Closes: https://lore.kernel.org/all/20231109180646.2963718-2-khorenko@virtuozzo.com Link: https://lore.kernel.org/r/20231110022857.1273836-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-01-31KVM: VMX: Call fred_entry_from_kvm() for IRQ/NMI handlingXin Li
When FRED is enabled, call fred_entry_from_kvm() to handle IRQ/NMI in IRQ/NMI induced VM exits. Signed-off-by: Xin Li <xin3.li@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: Shan Kang <shan.kang@intel.com> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://lore.kernel.org/r/20231205105030.8698-33-xin3.li@intel.com
2024-01-30KVM: x86/pmu: Explicitly check for RDPMC of unsupported Intel PMC typesSean Christopherson
Explicitly check for attempts to read unsupported PMC types instead of letting the bounds check fail. Functionally, letting the check fail is ok, but it's unnecessarily subtle and does a poor job of documenting the architectural behavior that KVM is emulating. Reviewed-by: Dapeng Mi  <dapeng1.mi@linux.intel.com> Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20240109230250.424295-12-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-01-30KVM: x86/pmu: Treat "fixed" PMU type in RDPMC as index as a value, not flagSean Christopherson
Refactor KVM's handling of ECX for RDPMC to treat the FIXED modifier as an explicit value, not a flag (minus one wart). While non-architectural PMUs do use bit 31 as a flag (for "fast" reads), architectural PMUs use the upper half of ECX to encode the type. From the SDM: ECX[31:16] specifies type of PMC while ECX[15:0] specifies the index of the PMC to be read within that type Note, that the known supported types are 4000H and 2000H, i.e. look a lot like flags, doesn't contradict the above statement that ECX[31:16] holds the type, at least not by any sane reading of the SDM. Keep the explicitly clearing of the FIXED "flag", as KVM subtly relies on that behavior to disallow unsupported types while allowing the correct indices for fixed counters. This wart will be cleaned up in short order. Opportunistically grab the per-type bitmask in the if-else blocks to eliminate the one-off usage of the local "fixed" bool. Reported-by: Jim Mattson <jmattson@google.com> Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20240109230250.424295-11-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-01-30KVM: x86/pmu: Disallow "fast" RDPMC for architectural Intel PMUsSean Christopherson
Inject #GP on RDPMC if the "fast" flag is set for architectural Intel PMUs, i.e. if the PMU version is non-zero. Per Intel's SDM, and confirmed on bare metal, the "fast" flag is supported only for non-architectural PMUs, and is reserved for architectural PMUs. If the processor does not support architectural performance monitoring (CPUID.0AH:EAX[7:0]=0), ECX[30:0] specifies the index of the PMC to be read. Setting ECX[31] selects “fast” read mode if supported. In this mode, RDPMC returns bits 31:0 of the PMC in EAX while clearing EDX to zero. If the processor does support architectural performance monitoring (CPUID.0AH:EAX[7:0] ≠ 0), ECX[31:16] specifies type of PMC while ECX[15:0] specifies the index of the PMC to be read within that type. The following PMC types are currently defined: — General-purpose counters use type 0. The index x (to read IA32_PMCx) must be less than the value enumerated by CPUID.0AH.EAX[15:8] (thus ECX[15:8] must be zero). — Fixed-function counters use type 4000H. The index x (to read IA32_FIXED_CTRx) can be used if either CPUID.0AH.EDX[4:0] > x or CPUID.0AH.ECX[x] = 1 (thus ECX[15:5] must be 0). — Performance metrics use type 2000H. This type can be used only if IA32_PERF_CAPABILITIES.PERF_METRICS_AVAILABLE[bit 15]=1. For this type, the index in ECX[15:0] is implementation specific. Opportunistically WARN if KVM ever actually tries to complete RDPMC for a non-architectural PMU, and drop the non-existent "support" for fast RDPMC, as KVM doesn't support such PMUs, i.e. kvm_pmu_rdpmc() should reject the RDPMC before getting to the Intel code. Fixes: f5132b01386b ("KVM: Expose a version 2 architectural PMU to a guests") Fixes: 67f4d4288c35 ("KVM: x86: rdpmc emulation checks the counter incorrectly") Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20240109230250.424295-10-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-01-30KVM: x86/pmu: Apply "fast" RDPMC only to Intel PMUsSean Christopherson
Move the handling of "fast" RDPMC instructions, which drop bits 63:32 of the count, to Intel. The "fast" flag, and all modifiers for that matter, are Intel-only and aren't supported by AMD. Opportunistically replace open coded bit crud with proper #defines, and add comments to try and disentangle the flags vs. values mess for non-architectural vs. architectural PMUs. Fixes: ca724305a2b0 ("KVM: x86/vPMU: Implement AMD vPMU code for KVM") Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20240109230250.424295-9-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-01-30KVM: x86/pmu: Prioritize VMX interception over #GP on RDPMC due to bad indexSean Christopherson
Apply the pre-intercepts RDPMC validity check only to AMD, and rename all relevant functions to make it as clear as possible that the check is not a standard PMC index check. On Intel, the basic rule is that only invalid opcodes and privilege/permission/mode checks have priority over VM-Exit, i.e. RDPMC with an invalid index should VM-Exit, not #GP. While the SDM doesn't explicitly call out RDPMC, it _does_ explicitly use RDMSR of a non-existent MSR as an example where VM-Exit has priority over #GP, and RDPMC is effectively just a variation of RDMSR. Manually testing on various Intel CPUs confirms this behavior, and the inverted priority was introduced for SVM compatibility, i.e. was not an intentional change for Intel PMUs. On AMD, *all* exceptions on RDPMC have priority over VM-Exit. Check for a NULL kvm_pmu_ops.check_rdpmc_early instead of using a RET0 static call so as to provide a convenient location to document the difference between Intel and AMD, and to again try to make it as obvious as possible that the early check is a one-off thing, not a generic "is this PMC valid?" helper. Fixes: 8061252ee0d2 ("KVM: SVM: Add intercept checks for remaining twobyte instructions") Cc: Jim Mattson <jmattson@google.com> Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20240109230250.424295-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-01-30KVM: x86/pmu: Get eventsel for fixed counters from perfSean Christopherson
Get the event selectors used to effectively request fixed counters for perf events from perf itself instead of hardcoding them in KVM and hoping that they match the underlying hardware. While fixed counters 0 and 1 use architectural events, as of ffbe4ab0beda ("perf/x86/intel: Extend the ref-cycles event to GP counters") fixed counter 2 (reference TSC cycles) may use a software-defined pseudo-encoding or a real hardware-defined encoding. Reported-by: Kan Liang <kan.liang@linux.intel.com> Closes: https://lkml.kernel.org/r/4281eee7-6423-4ec8-bb18-c6aeee1faf2c%40linux.intel.com Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20240109230250.424295-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-01-30KVM: x86/pmu: Setup fixed counters' eventsel during PMU initializationSean Christopherson
Set the eventsel for all fixed counters during PMU initialization, the eventsel is hardcoded and consumed if and only if the counter is supported, i.e. there is no reason to redo the setup every time the PMU is refreshed. Configuring all KVM-supported fixed counter also eliminates a potential pitfall if/when KVM supports discontiguous fixed counters, in which case configuring only nr_arch_fixed_counters will be insufficient (ignoring the fact that KVM will need many other changes to support discontiguous fixed counters). Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20240109230250.424295-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-01-30KVM: x86/pmu: Remove KVM's enumeration of Intel's architectural encodingsSean Christopherson
Drop KVM's enumeration of Intel's architectural event encodings, and instead open code the three encodings (of which only two are real) that KVM uses to emulate fixed counters. Now that KVM doesn't incorrectly enforce the availability of architectural encodings, there is no reason for KVM to ever care about the encodings themselves, at least not in the current format of an array indexed by the encoding's position in CPUID. Opportunistically add a comment to explain why KVM cares about eventsel values for fixed counters. Suggested-by: Jim Mattson <jmattson@google.com> Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20240109230250.424295-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-01-30KVM: x86/pmu: Allow programming events that match unsupported arch eventsSean Christopherson
Remove KVM's bogus restriction that the guest can't program an event whose encoding matches an unsupported architectural event. The enumeration of an architectural event only says that if a CPU supports an architectural event, then the event can be programmed using the architectural encoding. The enumeration does NOT say anything about the encoding when the CPU doesn't report support the architectural event. Preventing the guest from counting events whose encoding happens to match an architectural event breaks existing functionality whenever Intel adds an architectural encoding that was *ever* used for a CPU that doesn't enumerate support for the architectural event, even if the encoding is for the exact same event! E.g. the architectural encoding for Top-Down Slots is 0x01a4. Broadwell CPUs, which do not support the Top-Down Slots architectural event, 0x01a4 is a valid, model-specific event. Denying guest usage of 0x01a4 if/when KVM adds support for Top-Down slots would break any Broadwell-based guest. Reported-by: Kan Liang <kan.liang@linux.intel.com> Closes: https://lore.kernel.org/all/2004baa6-b494-462c-a11f-8104ea152c6a@linux.intel.com Fixes: a21864486f7e ("KVM: x86/pmu: Fix available_event_types check for REF_CPU_CYCLES event") Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Reviewed-by: Jim Mattson <jmattson@google.com> Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20240109230250.424295-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-01-30KVM: x86/pmu: Always treat Fixed counters as available when supportedSean Christopherson
Treat fixed counters as available when they are supported, i.e. don't silently ignore an enabled fixed counter just because guest CPUID says the associated general purpose architectural event is unavailable. KVM originally treated fixed counters as always available, but that got changed as part of a fix to avoid confusing REF_CPU_CYCLES, which does NOT map to an architectural event, with the actual architectural event used associated with bit 7, TOPDOWN_SLOTS. The commit justified the change with: If the event is marked as unavailable in the Intel guest CPUID 0AH.EBX leaf, we need to avoid any perf_event creation, whether it's a gp or fixed counter. but that justification doesn't mesh with reality. The Intel SDM uses "architectural events" to refer to both general purpose events (the ones with the reverse polarity mask in CPUID.0xA.EBX) and the events for fixed counters, e.g. the SDM makes statements like: Each of the fixed-function PMC can count only one architectural performance event. but the fact that fixed counter 2 (TSC reference cycles) doesn't have an associated general purpose architectural makes trying to apply the mask from CPUID.0xA.EBX impossible. Furthermore, the lack of enumeration for an architectural event in CPUID only means the CPU doesn't officially support the architectural encoding, i.e. it doesn't mean using the architectural encoding _won't_ work, it sipmly means there are no guarantees that it will work as expected. E.g. if KVM is running in a VM that advertises a fixed counters but not the corresponding architectural event encoding, and perf decides to use a general purpose counter instead of a fixed counter, odds are very good that the underlying hardware actually does support the architectrual encoding, and that programming the encoding will count the right thing. In other words, asking perf to count the event will probably work, whereas intentionally doing nothing is obviously guaranteed to fail. Note, at the time of the change, KVM didn't enforce hardware support, i.e. didn't prevent userspace from enumerating support in guest CPUID.0xA.EBX for architectural events that aren't supported in hardware. I.e. silently dropping the fixed counter didn't somehow protection against counting the wrong event, it just enforced guest CPUID. And practically speaking, this issue is almost certainly limited to running KVM on a funky virtual CPU model. No known real hardware has an asymmetric PMU where a fixed counter is supported but the associated architectural event is not. Fixes: a21864486f7e ("KVM: x86/pmu: Fix available_event_types check for REF_CPU_CYCLES event") Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20240109230250.424295-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-01-17Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm updates from Paolo Bonzini: "Generic: - Use memdup_array_user() to harden against overflow. - Unconditionally advertise KVM_CAP_DEVICE_CTRL for all architectures. - Clean up Kconfigs that all KVM architectures were selecting - New functionality around "guest_memfd", a new userspace API that creates an anonymous file and returns a file descriptor that refers to it. guest_memfd files are bound to their owning virtual machine, cannot be mapped, read, or written by userspace, and cannot be resized. guest_memfd files do however support PUNCH_HOLE, which can be used to switch a memory area between guest_memfd and regular anonymous memory. - New ioctl KVM_SET_MEMORY_ATTRIBUTES allowing userspace to specify per-page attributes for a given page of guest memory; right now the only attribute is whether the guest expects to access memory via guest_memfd or not, which in Confidential SVMs backed by SEV-SNP, TDX or ARM64 pKVM is checked by firmware or hypervisor that guarantees confidentiality (AMD PSP, Intel TDX module, or EL2 in the case of pKVM). x86: - Support for "software-protected VMs" that can use the new guest_memfd and page attributes infrastructure. This is mostly useful for testing, since there is no pKVM-like infrastructure to provide a meaningfully reduced TCB. - Fix a relatively benign off-by-one error when splitting huge pages during CLEAR_DIRTY_LOG. - Fix a bug where KVM could incorrectly test-and-clear dirty bits in non-leaf TDP MMU SPTEs if a racing thread replaces a huge SPTE with a non-huge SPTE. - Use more generic lockdep assertions in paths that don't actually care about whether the caller is a reader or a writer. - let Xen guests opt out of having PV clock reported as "based on a stable TSC", because some of them don't expect the "TSC stable" bit (added to the pvclock ABI by KVM, but never set by Xen) to be set. - Revert a bogus, made-up nested SVM consistency check for TLB_CONTROL. - Advertise flush-by-ASID support for nSVM unconditionally, as KVM always flushes on nested transitions, i.e. always satisfies flush requests. This allows running bleeding edge versions of VMware Workstation on top of KVM. - Sanity check that the CPU supports flush-by-ASID when enabling SEV support. - On AMD machines with vNMI, always rely on hardware instead of intercepting IRET in some cases to detect unmasking of NMIs - Support for virtualizing Linear Address Masking (LAM) - Fix a variety of vPMU bugs where KVM fail to stop/reset counters and other state prior to refreshing the vPMU model. - Fix a double-overflow PMU bug by tracking emulated counter events using a dedicated field instead of snapshotting the "previous" counter. If the hardware PMC count triggers overflow that is recognized in the same VM-Exit that KVM manually bumps an event count, KVM would pend PMIs for both the hardware-triggered overflow and for KVM-triggered overflow. - Turn off KVM_WERROR by default for all configs so that it's not inadvertantly enabled by non-KVM developers, which can be problematic for subsystems that require no regressions for W=1 builds. - Advertise all of the host-supported CPUID bits that enumerate IA32_SPEC_CTRL "features". - Don't force a masterclock update when a vCPU synchronizes to the current TSC generation, as updating the masterclock can cause kvmclock's time to "jump" unexpectedly, e.g. when userspace hotplugs a pre-created vCPU. - Use RIP-relative address to read kvm_rebooting in the VM-Enter fault paths, partly as a super minor optimization, but mostly to make KVM play nice with position independent executable builds. - Guard KVM-on-HyperV's range-based TLB flush hooks with an #ifdef on CONFIG_HYPERV as a minor optimization, and to self-document the code. - Add CONFIG_KVM_HYPERV to allow disabling KVM support for HyperV "emulation" at build time. ARM64: - LPA2 support, adding 52bit IPA/PA capability for 4kB and 16kB base granule sizes. Branch shared with the arm64 tree. - Large Fine-Grained Trap rework, bringing some sanity to the feature, although there is more to come. This comes with a prefix branch shared with the arm64 tree. - Some additional Nested Virtualization groundwork, mostly introducing the NV2 VNCR support and retargetting the NV support to that version of the architecture. - A small set of vgic fixes and associated cleanups. Loongarch: - Optimization for memslot hugepage checking - Cleanup and fix some HW/SW timer issues - Add LSX/LASX (128bit/256bit SIMD) support RISC-V: - KVM_GET_REG_LIST improvement for vector registers - Generate ISA extension reg_list using macros in get-reg-list selftest - Support for reporting steal time along with selftest s390: - Bugfixes Selftests: - Fix an annoying goof where the NX hugepage test prints out garbage instead of the magic token needed to run the test. - Fix build errors when a header is delete/moved due to a missing flag in the Makefile. - Detect if KVM bugged/killed a selftest's VM and print out a helpful message instead of complaining that a random ioctl() failed. - Annotate the guest printf/assert helpers with __printf(), and fix the various bugs that were lurking due to lack of said annotation" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (185 commits) x86/kvm: Do not try to disable kvmclock if it was not enabled KVM: x86: add missing "depends on KVM" KVM: fix direction of dependency on MMU notifiers KVM: introduce CONFIG_KVM_COMMON KVM: arm64: Add missing memory barriers when switching to pKVM's hyp pgd KVM: arm64: vgic-its: Avoid potential UAF in LPI translation cache RISC-V: KVM: selftests: Add get-reg-list test for STA registers RISC-V: KVM: selftests: Add steal_time test support RISC-V: KVM: selftests: Add guest_sbi_probe_extension RISC-V: KVM: selftests: Move sbi_ecall to processor.c RISC-V: KVM: Implement SBI STA extension RISC-V: KVM: Add support for SBI STA registers RISC-V: KVM: Add support for SBI extension registers RISC-V: KVM: Add SBI STA info to vcpu_arch RISC-V: KVM: Add steal-update vcpu request RISC-V: KVM: Add SBI STA extension skeleton RISC-V: paravirt: Implement steal-time support RISC-V: Add SBI STA extension definitions RISC-V: paravirt: Add skeleton for pv-time support RISC-V: KVM: Fix indentation in kvm_riscv_vcpu_set_reg_csr() ...
2024-01-10x86/bugs: Rename CONFIG_RETPOLINE => CONFIG_MITIGATION_RETPOLINEBreno Leitao
Step 5/10 of the namespace unification of CPU mitigations related Kconfig options. [ mingo: Converted a few more uses in comments/messages as well. ] Suggested-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Breno Leitao <leitao@debian.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Ariel Miculas <amiculas@cisco.com> Acked-by: Josh Poimboeuf <jpoimboe@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20231121160740.1249350-6-leitao@debian.org
2024-01-08Merge tag 'kvm-x86-lam-6.8' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM x86 support for virtualizing Linear Address Masking (LAM) Add KVM support for Linear Address Masking (LAM). LAM tweaks the canonicality checks for most virtual address usage in 64-bit mode, such that only the most significant bit of the untranslated address bits must match the polarity of the last translated address bit. This allows software to use ignored, untranslated address bits for metadata, e.g. to efficiently tag pointers for address sanitization. LAM can be enabled separately for user pointers and supervisor pointers, and for userspace LAM can be select between 48-bit and 57-bit masking - 48-bit LAM: metadata bits 62:48, i.e. LAM width of 15. - 57-bit LAM: metadata bits 62:57, i.e. LAM width of 6. For user pointers, LAM enabling utilizes two previously-reserved high bits from CR3 (similar to how PCID_NOFLUSH uses bit 63): LAM_U48 and LAM_U57, bits 62 and 61 respectively. Note, if LAM_57 is set, LAM_U48 is ignored, i.e.: - CR3.LAM_U48=0 && CR3.LAM_U57=0 == LAM disabled for user pointers - CR3.LAM_U48=1 && CR3.LAM_U57=0 == LAM-48 enabled for user pointers - CR3.LAM_U48=x && CR3.LAM_U57=1 == LAM-57 enabled for user pointers For supervisor pointers, LAM is controlled by a single bit, CR4.LAM_SUP, with the 48-bit versus 57-bit LAM behavior following the current paging mode, i.e.: - CR4.LAM_SUP=0 && CR4.LA57=x == LAM disabled for supervisor pointers - CR4.LAM_SUP=1 && CR4.LA57=0 == LAM-48 enabled for supervisor pointers - CR4.LAM_SUP=1 && CR4.LA57=1 == LAM-57 enabled for supervisor pointers The modified LAM canonicality checks: - LAM_S48 : [ 1 ][ metadata ][ 1 ] 63 47 - LAM_U48 : [ 0 ][ metadata ][ 0 ] 63 47 - LAM_S57 : [ 1 ][ metadata ][ 1 ] 63 56 - LAM_U57 + 5-lvl paging : [ 0 ][ metadata ][ 0 ] 63 56 - LAM_U57 + 4-lvl paging : [ 0 ][ metadata ][ 0...0 ] 63 56..47 The bulk of KVM support for LAM is to emulate LAM's modified canonicality checks. The approach taken by KVM is to "fill" the metadata bits using the highest bit of the translated address, e.g. for LAM-48, bit 47 is sign-extended to bits 62:48. The most significant bit, 63, is *not* modified, i.e. its value from the raw, untagged virtual address is kept for the canonicality check. This untagging allows Aside from emulating LAM's canonical checks behavior, LAM has the usual KVM touchpoints for selectable features: enumeration (CPUID.7.1:EAX.LAM[bit 26], enabling via CR3 and CR4 bits, etc.
2024-01-08Merge tag 'kvm-x86-pmu-6.8' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM x86 PMU changes for 6.8: - Fix a variety of bugs where KVM fail to stop/reset counters and other state prior to refreshing the vPMU model. - Fix a double-overflow PMU bug by tracking emulated counter events using a dedicated field instead of snapshotting the "previous" counter. If the hardware PMC count triggers overflow that is recognized in the same VM-Exit that KVM manually bumps an event count, KVM would pend PMIs for both the hardware-triggered overflow and for KVM-triggered overflow.
2024-01-08Merge tag 'kvm-x86-misc-6.8' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM x86 misc changes for 6.8: - Turn off KVM_WERROR by default for all configs so that it's not inadvertantly enabled by non-KVM developers, which can be problematic for subsystems that require no regressions for W=1 builds. - Advertise all of the host-supported CPUID bits that enumerate IA32_SPEC_CTRL "features". - Don't force a masterclock update when a vCPU synchronizes to the current TSC generation, as updating the masterclock can cause kvmclock's time to "jump" unexpectedly, e.g. when userspace hotplugs a pre-created vCPU. - Use RIP-relative address to read kvm_rebooting in the VM-Enter fault paths, partly as a super minor optimization, but mostly to make KVM play nice with position independent executable builds.
2024-01-03arch/x86: Fix typosBjorn Helgaas
Fix typos, most reported by "codespell arch/x86". Only touches comments, no code changes. Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Link: https://lore.kernel.org/r/20240103004011.1758650-1-helgaas@kernel.org
2023-12-07KVM: nVMX: Hide more stuff under CONFIG_KVM_HYPERVVitaly Kuznetsov
'hv_evmcs_vmptr'/'hv_evmcs_map'/'hv_evmcs' fields in 'struct nested_vmx' should not be used when !CONFIG_KVM_HYPERV, hide them when !CONFIG_KVM_HYPERV. No functional change intended. Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Link: https://lore.kernel.org/r/20231205103630.1391318-16-vkuznets@redhat.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-12-07KVM: nVMX: Introduce accessor to get Hyper-V eVMCS pointerVitaly Kuznetsov
There's a number of 'vmx->nested.hv_evmcs' accesses in nested.c, introduce 'nested_vmx_evmcs()' accessor to hide them all in !CONFIG_KVM_HYPERV case. No functional change intended. Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Link: https://lore.kernel.org/r/20231205103630.1391318-15-vkuznets@redhat.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-12-07KVM: nVMX: Introduce helpers to check if Hyper-V evmptr12 is valid/setVitaly Kuznetsov
In order to get rid of raw 'vmx->nested.hv_evmcs_vmptr' accesses when !CONFIG_KVM_HYPERV, introduce a pair of helpers: nested_vmx_is_evmptr12_valid() to check that eVMPTR points to a valid address. nested_vmx_is_evmptr12_valid() to check that eVMPTR either points to a valid address or is in 'pending' port-migration state (meaning it is supposed to be valid but the exact address wasn't acquired from guest's memory yet). No functional change intended. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com> Link: https://lore.kernel.org/r/20231205103630.1391318-14-vkuznets@redhat.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-12-07KVM: x86: Make Hyper-V emulation optionalVitaly Kuznetsov
Hyper-V emulation in KVM is a fairly big chunk and in some cases it may be desirable to not compile it in to reduce module sizes as well as the attack surface. Introduce CONFIG_KVM_HYPERV option to make it possible. Note, there's room for further nVMX/nSVM code optimizations when !CONFIG_KVM_HYPERV, this will be done in follow-up patches. Reorganize Makefile a bit so all CONFIG_HYPERV and CONFIG_KVM_HYPERV files are grouped together. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com> Link: https://lore.kernel.org/r/20231205103630.1391318-13-vkuznets@redhat.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-12-07KVM: nVMX: Move guest_cpuid_has_evmcs() to hyperv.hVitaly Kuznetsov
In preparation for making Hyper-V emulation optional, move Hyper-V specific guest_cpuid_has_evmcs() to hyperv.h. No functional change intended. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com> Link: https://lore.kernel.org/r/20231205103630.1391318-12-vkuznets@redhat.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-12-07KVM: nVMX: Split off helper for emulating VMCLEAR on Hyper-V eVMCSVitaly Kuznetsov
To avoid overloading handle_vmclear() with Hyper-V specific details and to prepare the code to making Hyper-V emulation optional, create a dedicated nested_evmcs_handle_vmclear() helper. No functional change intended. Suggested-by: Sean Christopherson <seanjc@google.com> Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20231205103630.1391318-9-vkuznets@redhat.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-12-07KVM: x86: Introduce helper to handle Hyper-V paravirt TLB flush requestsVitaly Kuznetsov
As a preparation to making Hyper-V emulation optional, introduce a helper to handle pending KVM_REQ_HV_TLB_FLUSH requests. No functional change intended. Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Link: https://lore.kernel.org/r/20231205103630.1391318-8-vkuznets@redhat.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-12-07KVM: VMX: Split off hyperv_evmcs.{ch}Vitaly Kuznetsov
Some Enlightened VMCS related code is needed both by Hyper-V on KVM and KVM on Hyper-V. As a preparation to making Hyper-V emulation optional, create dedicated 'hyperv_evmcs.{ch}' files which are used by both. No functional change intended. Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Link: https://lore.kernel.org/r/20231205103630.1391318-7-vkuznets@redhat.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-12-07KVM: VMX: Split off vmx_onhyperv.{ch} from hyperv.{ch}Vitaly Kuznetsov
hyperv.{ch} is currently a mix of stuff which is needed by both Hyper-V on KVM and KVM on Hyper-V. As a preparation to making Hyper-V emulation optional, put KVM-on-Hyper-V specific code into dedicated files. No functional change intended. Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Link: https://lore.kernel.org/r/20231205103630.1391318-4-vkuznets@redhat.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-12-07KVM: x86: Move Hyper-V partition assist page out of Hyper-V emulation contextVitaly Kuznetsov
Hyper-V partition assist page is used when KVM runs on top of Hyper-V and is not used for Windows/Hyper-V guests on KVM, this means that 'hv_pa_pg' placement in 'struct kvm_hv' is unfortunate. As a preparation to making Hyper-V emulation optional, move 'hv_pa_pg' to 'struct kvm_arch' and put it under CONFIG_HYPERV. While on it, introduce hv_get_partition_assist_page() helper to allocate partition assist page. Move the comment explaining why we use a single page for all vCPUs from VMX and expand it a bit. No functional change intended. Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Link: https://lore.kernel.org/r/20231205103630.1391318-3-vkuznets@redhat.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-11-30KVM: x86/pmu: Update sample period in pmc_write_counter()Sean Christopherson
Update a PMC's sample period in pmc_write_counter() to deduplicate code across all callers of pmc_write_counter(). Opportunistically move pmc_write_counter() into pmc.c now that it's doing more work. WRMSR isn't such a hot path that an extra CALL+RET pair will be problematic, and the order of function definitions needs to be changed anyways, i.e. now is a convenient time to eat the churn. Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20231103230541.352265-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-11-30KVM: x86/pmu: Move PMU reset logic to common x86 codeSean Christopherson
Move the common (or at least "ignored") aspects of resetting the vPMU to common x86 code, along with the stop/release helpers that are no used only by the common pmu.c. There is no need to manually handle fixed counters as all_valid_pmc_idx tracks both fixed and general purpose counters, and resetting the vPMU is far from a hot path, i.e. the extra bit of overhead to the PMC from the index is a non-issue. Zero fixed_ctr_ctrl in common code even though it's Intel specific. Ensuring it's zero doesn't harm AMD/SVM in any way, and stopping the fixed counters via all_valid_pmc_idx, but not clearing the associated control bits, would be odd/confusing. Make the .reset() hook optional as SVM no longer needs vendor specific handling. Cc: stable@vger.kernel.org Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20231103230541.352265-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-11-30KVM: SVM,VMX: Use %rip-relative addressing to access kvm_rebootingUros Bizjak
Instruction with %rip-relative address operand is one byte shorter than its absolute address counterpart and is also compatible with position independent executable (-fpie) build. No functional changes intended. Cc: Sean Christopherson <seanjc@google.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Link: https://lore.kernel.org/r/20231031075312.47525-1-ubizjak@gmail.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-11-28KVM: x86: Use KVM-governed feature framework to track "LAM enabled"Binbin Wu
Use the governed feature framework to track if Linear Address Masking (LAM) is "enabled", i.e. if LAM can be used by the guest. Using the framework to avoid the relative expensive call guest_cpuid_has() during cr3 and vmexit handling paths for LAM. No functional change intended. Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Tested-by: Xuelian Guo <xuelian.guo@intel.com> Link: https://lore.kernel.org/r/20230913124227.12574-14-binbin.wu@linux.intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-11-28KVM: x86: Virtualize LAM for user pointerRobert Hoo
Add support to allow guests to set the new CR3 control bits for Linear Address Masking (LAM) and add implementation to get untagged address for user pointers. LAM modifies the canonical check for 64-bit linear addresses, allowing software to use the masked/ignored address bits for metadata. Hardware masks off the metadata bits before using the linear addresses to access memory. LAM uses two new CR3 non-address bits, LAM_U48 (bit 62) and LAM_U57 (bit 61), to configure LAM for user pointers. LAM also changes VMENTER to allow both bits to be set in VMCS's HOST_CR3 and GUEST_CR3 for virtualization. When EPT is on, CR3 is not trapped by KVM and it's up to the guest to set any of the two LAM control bits. However, when EPT is off, the actual CR3 used by the guest is generated from the shadow MMU root which is different from the CR3 that is *set* by the guest, and KVM needs to manually apply any active control bits to VMCS's GUEST_CR3 based on the cached CR3 *seen* by the guest. KVM manually checks guest's CR3 to make sure it points to a valid guest physical address (i.e. to support smaller MAXPHYSADDR in the guest). Extend this check to allow the two LAM control bits to be set. After check, LAM bits of guest CR3 will be stripped off to extract guest physical address. In case of nested, for a guest which supports LAM, both VMCS12's HOST_CR3 and GUEST_CR3 are allowed to have the new LAM control bits set, i.e. when L0 enters L1 to emulate a VMEXIT from L2 to L1 or when L0 enters L2 directly. KVM also manually checks VMCS12's HOST_CR3 and GUEST_CR3 being valid physical address. Extend such check to allow the new LAM control bits too. Note, LAM doesn't have a global control bit to turn on/off LAM completely, but purely depends on hardware's CPUID to determine it can be enabled or not. That means, when EPT is on, even when KVM doesn't expose LAM to guest, the guest can still set LAM control bits in CR3 w/o causing problem. This is an unfortunate virtualization hole. KVM could choose to intercept CR3 in this case and inject fault but this would hurt performance when running a normal VM w/o LAM support. This is undesirable. Just choose to let the guest do such illegal thing as the worst case is guest being killed when KVM eventually find out such illegal behaviour and that the guest is misbehaving. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Robert Hoo <robert.hu@linux.intel.com> Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Chao Gao <chao.gao@intel.com> Tested-by: Xuelian Guo <xuelian.guo@intel.com> Link: https://lore.kernel.org/r/20230913124227.12574-12-binbin.wu@linux.intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-11-28KVM: x86: Virtualize LAM for supervisor pointerRobert Hoo
Add support to allow guests to set the new CR4 control bit for LAM and add implementation to get untagged address for supervisor pointers. LAM modifies the canonicality check applied to 64-bit linear addresses for data accesses, allowing software to use of the untranslated address bits for metadata and masks the metadata bits before using them as linear addresses to access memory. LAM uses CR4.LAM_SUP (bit 28) to configure and enable LAM for supervisor pointers. It also changes VMENTER to allow the bit to be set in VMCS's HOST_CR4 and GUEST_CR4 to support virtualization. Note CR4.LAM_SUP is allowed to be set even not in 64-bit mode, but it will not take effect since LAM only applies to 64-bit linear addresses. Move CR4.LAM_SUP out of CR4_RESERVED_BITS, its reservation depends on vcpu supporting LAM or not. Leave it intercepted to prevent guest from setting the bit if LAM is not exposed to guest as well as to avoid vmread every time when KVM fetches its value, with the expectation that guest won't toggle the bit frequently. Set CR4.LAM_SUP bit in the emulated IA32_VMX_CR4_FIXED1 MSR for guests to allow guests to enable LAM for supervisor pointers in nested VMX operation. Hardware is not required to do TLB flush when CR4.LAM_SUP toggled, KVM doesn't need to emulate TLB flush based on it. There's no other features or vmx_exec_controls connection, and no other code needed in {kvm,vmx}_set_cr4(). Skip address untag for instruction fetches (which includes branch targets), operand of INVLPG instructions, and implicit system accesses, all of which are not subject to untagging. Note, get_untagged_addr() isn't invoked for implicit system accesses as there is no reason to do so, but check the flag anyways for documentation purposes. Signed-off-by: Robert Hoo <robert.hu@linux.intel.com> Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Tested-by: Xuelian Guo <xuelian.guo@intel.com> Link: https://lore.kernel.org/r/20230913124227.12574-11-binbin.wu@linux.intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-11-28KVM: x86: Untag addresses for LAM emulation where applicableBinbin Wu
Stub in vmx_get_untagged_addr() and wire up calls from the emulator (via get_untagged_addr()) and "direct" calls from various VM-Exit handlers in VMX where LAM untagging is supposed to be applied. Defer implementing the guts of vmx_get_untagged_addr() to future patches purely to make the changes easier to consume. LAM is active only for 64-bit linear addresses and several types of accesses are exempted. - Cases need to untag address (handled in get_vmx_mem_address()) Operand(s) of VMX instructions and INVPCID. Operand(s) of SGX ENCLS. - Cases LAM doesn't apply to (no change needed) Operand of INVLPG. Linear address in INVPCID descriptor. Linear address in INVVPID descriptor. BASEADDR specified in SECS of ECREATE. Note: - LAM doesn't apply to write to control registers or MSRs - LAM masking is applied before walking page tables, i.e. the faulting linear address in CR2 doesn't contain the metadata. - The guest linear address saved in VMCS doesn't contain metadata. Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Chao Gao <chao.gao@intel.com> Tested-by: Xuelian Guo <xuelian.guo@intel.com> Link: https://lore.kernel.org/r/20230913124227.12574-10-binbin.wu@linux.intel.com [sean: massage changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-11-28KVM: x86: Remove kvm_vcpu_is_illegal_gpa()Binbin Wu
Remove kvm_vcpu_is_illegal_gpa() and use !kvm_vcpu_is_legal_gpa() instead. The "illegal" helper actually predates the "legal" helper, the only reason the "illegal" variant wasn't removed by commit 4bda0e97868a ("KVM: x86: Add a helper to check for a legal GPA") was to avoid code churn. Now that CR3 has a dedicated helper, there are fewer callers, and so the code churn isn't that much of a deterrent. No functional change intended. Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Tested-by: Xuelian Guo <xuelian.guo@intel.com> Link: https://lore.kernel.org/r/20230913124227.12574-8-binbin.wu@linux.intel.com [sean: provide a bit of history in the changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-11-28KVM: x86: Add & use kvm_vcpu_is_legal_cr3() to check CR3's legalityBinbin Wu
Add and use kvm_vcpu_is_legal_cr3() to check CR3's legality to provide a clear distinction between CR3 and GPA checks. This will allow exempting bits from kvm_vcpu_is_legal_cr3() without affecting general GPA checks, e.g. for upcoming features that will use high bits in CR3 for feature enabling. No functional change intended. Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Tested-by: Xuelian Guo <xuelian.guo@intel.com> Link: https://lore.kernel.org/r/20230913124227.12574-7-binbin.wu@linux.intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-11-13KVM: Use gfn instead of hva for mmu_notifier_retryChao Peng
Currently in mmu_notifier invalidate path, hva range is recorded and then checked against by mmu_invalidate_retry_hva() in the page fault handling path. However, for the soon-to-be-introduced private memory, a page fault may not have a hva associated, checking gfn(gpa) makes more sense. For existing hva based shared memory, gfn is expected to also work. The only downside is when aliasing multiple gfns to a single hva, the current algorithm of checking multiple ranges could result in a much larger range being rejected. Such aliasing should be uncommon, so the impact is expected small. Suggested-by: Sean Christopherson <seanjc@google.com> Cc: Xu Yilun <yilun.xu@intel.com> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com> Reviewed-by: Fuad Tabba <tabba@google.com> Tested-by: Fuad Tabba <tabba@google.com> [sean: convert vmx_set_apic_access_page_addr() to gfn-based API] Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Xu Yilun <yilun.xu@linux.intel.com> Message-Id: <20231027182217.3615211-4-seanjc@google.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-10-31Merge tag 'kvm-x86-svm-6.7' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM SVM changes for 6.7: - Report KVM_EXIT_SHUTDOWN instead of EINVAL if KVM intercepts SHUTDOWN while running an SEV-ES guest. - Clean up handling "failures" when KVM detects it can't emulate the "skip" action for an instruction that has already been partially emulated. Drop a hack in the SVM code that was fudging around the emulator code not giving SVM enough information to do the right thing.
2023-10-31Merge tag 'kvm-x86-mmu-6.7' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM x86 MMU changes for 6.7: - Clean up code that deals with honoring guest MTRRs when the VM has non-coherent DMA and host MTRRs are ignored, i.e. EPT is enabled. - Zap EPT entries when non-coherent DMA assignment stops/start to prevent using stale entries with the wrong memtype. - Don't ignore guest PAT for CR0.CD=1 && KVM_X86_QUIRK_CD_NW_CLEARED=y, as there's zero reason to ignore guest PAT if the effective MTRR memtype is WB. This will also allow for future optimizations of handling guest MTRR updates for VMs with non-coherent DMA and the quirk enabled. - Harden the fast page fault path to guard against encountering an invalid root when walking SPTEs.
2023-10-31Merge tag 'kvm-x86-misc-6.7' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM x86 misc changes for 6.7: - Add CONFIG_KVM_MAX_NR_VCPUS to allow supporting up to 4096 vCPUs without forcing more common use cases to eat the extra memory overhead. - Add IBPB and SBPB virtualization support. - Fix a bug where restoring a vCPU snapshot that was taken within 1 second of creating the original vCPU would cause KVM to try to synchronize the vCPU's TSC and thus clobber the correct TSC being set by userspace. - Compute guest wall clock using a single TSC read to avoid generating an inaccurate time, e.g. if the vCPU is preempted between multiple TSC reads. - "Virtualize" HWCR.TscFreqSel to make Linux guests happy, which complain about a "Firmware Bug" if the bit isn't set for select F/M/S combos. - Don't apply side effects to Hyper-V's synthetic timer on writes from userspace to fix an issue where the auto-enable behavior can trigger spurious interrupts, i.e. do auto-enabling only for guest writes. - Remove an unnecessary kick of all vCPUs when synchronizing the dirty log without PML enabled. - Advertise "support" for non-serializing FS/GS base MSR writes as appropriate. - Use octal notation for file permissions through KVM x86. - Fix a handful of typo fixes and warts.
2023-10-31Merge tag 'kvm-x86-apic-6.7' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM x86 APIC changes for 6.7: - Purge VMX's posted interrupt descriptor *before* loading APIC state when handling KVM_SET_LAPIC. Purging the PID after loading APIC state results in lost APIC timer IRQs as the APIC timer can be armed as part of loading APIC state, i.e. can immediately pend an IRQ if the expiry is in the past. - Clear the ICR.BUSY bit when handling trap-like x2APIC writes. This avoids a WARN, due to KVM expecting the BUSY bit to be cleared when sending IPIs.
2023-10-17KVM: x86: Use octal for file permissionPeng Hao
Convert all module params to octal permissions to improve code readability and to make checkpatch happy: WARNING: Symbolic permissions 'S_IRUGO' are not preferred. Consider using octal permissions '0444'. Signed-off-by: Peng Hao <flyingpeng@tencent.com> Link: https://lore.kernel.org/r/20231013113020.77523-1-flyingpeng@tencent.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-10-10KVM: VMX: drop IPAT in memtype when CD=1 for KVM_X86_QUIRK_CD_NW_CLEAREDYan Zhao
For KVM_X86_QUIRK_CD_NW_CLEARED is on, remove the IPAT (ignore PAT) bit in EPT memory types when cache is disabled and non-coherent DMA are present. To correctly emulate CR0.CD=1, UC + IPAT are required as memtype in EPT. However, as with commit fb279950ba02 ("KVM: vmx: obey KVM_QUIRK_CD_NW_CLEARED"), WB + IPAT are now returned to workaround a BIOS issue that guest MTRRs are enabled too late. Without this workaround, a super slow guest boot-up is expected during the pre-guest-MTRR-enabled period due to UC as the effective memory type for all guest memory. Absent emulating CR0.CD=1 with UC, it makes no sense to set IPAT when KVM is honoring the guest memtype. Removing the IPAT bit in this patch allows effective memory type to honor PAT values as well, as WB is the weakest memtype. It means if a guest explicitly claims UC as the memtype in PAT, the effective memory is UC instead of previous WB. If, for some unknown reason, a guest meets a slow boot-up issue with the removal of IPAT, it's desired to fix the blamed PAT in the guest. Returning guest MTRR type as if CR0.CD=0 is also not preferred because KVMs ABI for the quirk also requires KVM to force WB memtype regardless of guest MTRRs to workaround the slow guest boot-up issue. In the future, honoring guest PAT will also allow KVM to more precisely zap SPTEs when the effective memtype changes. E.g. by not forcing WB when CR0.CD=1, instead of zapping SPTEs when guest MTRRs change, KVM can skip MTRR-induced zaps if CR0.CD=1 and zap SPTEs for non-WB MTRR ranges when CR0.CD is toggled (WB MTRR SPTEs can be kept because they're WB regardless of CR0.CD). The change of removing IPAT has been verified with normal boot-up time on old OVMF of commit c9e5618f84b0cb54a9ac2d7604f7b7e7859b45a7 as well, dated back to Apr 14 2015. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Link: https://lore.kernel.org/r/20230714065326.20557-1-yan.y.zhao@intel.com [sean: massage changelog to apply patch without full series] Signed-off-by: Sean Christopherson <seanjc@google.com>