summaryrefslogtreecommitdiff
path: root/arch/x86/kvm/x86.c
AgeCommit message (Collapse)Author
2023-01-26KVM: x86/pmu: Use separate array for defining "PMU MSRs to save"Sean Christopherson
Move all potential to-be-saved PMU MSRs into a separate array so that a future patch can easily omit all PMU MSRs from the list when the PMU is disabled. No functional change intended. Link: https://lore.kernel.org/r/20230124234905.3774678-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-26KVM: x86/pmu: Gate all "unimplemented MSR" prints on report_ignored_msrsSean Christopherson
Add helpers to print unimplemented MSR accesses and condition all such prints on report_ignored_msrs, i.e. honor userspace's request to not print unimplemented MSRs. Even though vcpu_unimpl() is ratelimited, printing can still be problematic, e.g. if a print gets stalled when host userspace is writing MSRs during live migration, an effective stall can result in very noticeable disruption in the guest. E.g. the profile below was taken while calling KVM_SET_MSRS on the PMU counters while the PMU was disabled in KVM. - 99.75% 0.00% [.] __ioctl - __ioctl - 99.74% entry_SYSCALL_64_after_hwframe do_syscall_64 sys_ioctl - do_vfs_ioctl - 92.48% kvm_vcpu_ioctl - kvm_arch_vcpu_ioctl - 85.12% kvm_set_msr_ignored_check svm_set_msr kvm_set_msr_common printk vprintk_func vprintk_default vprintk_emit console_unlock call_console_drivers univ8250_console_write serial8250_console_write uart_console_write Reported-by: Aaron Lewis <aaronlewis@google.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Link: https://lore.kernel.org/r/20230124234905.3774678-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-26KVM: x86/pmu: Cap kvm_pmu_cap.num_counters_gp at KVM's internal maxSean Christopherson
Limit kvm_pmu_cap.num_counters_gp during kvm_init_pmu_capability() based on the vendor PMU capabilities so that consuming num_counters_gp naturally does the right thing. This fixes a mostly theoretical bug where KVM could over-report its PMU support in KVM_GET_SUPPORTED_CPUID for leaf 0xA, e.g. if the number of counters reported by perf is greater than KVM's hardcoded internal limit. Incorporating input from the AMD PMU also avoids over-reporting MSRs to save when running on AMD. Link: https://lore.kernel.org/r/20230124234905.3774678-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-25KVM: x86: Propagate the AMD Automatic IBRS feature to the guestKim Phillips
Add the AMD Automatic IBRS feature bit to those being propagated to the guest, and enable the guest EFER bit. Signed-off-by: Kim Phillips <kim.phillips@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20230124163319.2277355-9-kim.phillips@amd.com
2023-01-24KVM: x86: Replace IS_ERR() with IS_ERR_VALUE()ye xingchen
Avoid type casts that are needed for IS_ERR() and use IS_ERR_VALUE() instead. Signed-off-by: ye xingchen <ye.xingchen@zte.com.cn> Link: https://lore.kernel.org/r/202211161718436948912@zte.com.cn Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24KVM: x86/pmu: Introduce masked events to the pmu event filterAaron Lewis
When building a list of filter events, it can sometimes be a challenge to fit all the events needed to adequately restrict the guest into the limited space available in the pmu event filter. This stems from the fact that the pmu event filter requires each event (i.e. event select + unit mask) be listed, when the intention might be to restrict the event select all together, regardless of it's unit mask. Instead of increasing the number of filter events in the pmu event filter, add a new encoding that is able to do a more generalized match on the unit mask. Introduce masked events as another encoding the pmu event filter understands. Masked events has the fields: mask, match, and exclude. When filtering based on these events, the mask is applied to the guest's unit mask to see if it matches the match value (i.e. umask & mask == match). The exclude bit can then be used to exclude events from that match. E.g. for a given event select, if it's easier to say which unit mask values shouldn't be filtered, a masked event can be set up to match all possible unit mask values, then another masked event can be set up to match the unit mask values that shouldn't be filtered. Userspace can query to see if this feature exists by looking for the capability, KVM_CAP_PMU_EVENT_MASKED_EVENTS. This feature is enabled by setting the flags field in the pmu event filter to KVM_PMU_EVENT_FLAG_MASKED_EVENTS. Events can be encoded by using KVM_PMU_ENCODE_MASKED_ENTRY(). It is an error to have a bit set outside the valid bits for a masked event, and calls to KVM_SET_PMU_EVENT_FILTER will return -EINVAL in such cases, including the high bits of the event select (35:32) if called on Intel. With these updates the filter matching code has been updated to match on a common event. Masked events were flexible enough to handle both event types, so they were used as the common event. This changes how guest events get filtered because regardless of the type of event used in the uAPI, they will be converted to masked events. Because of this there could be a slight performance hit because instead of matching the filter event with a lookup on event select + unit mask, it does a lookup on event select then walks the unit masks to find the match. This shouldn't be a big problem because I would expect the set of common event selects to be small, and if they aren't the set can likely be reduced by using masked events to generalize the unit mask. Using one type of event when filtering guest events allows for a common code path to be used. Signed-off-by: Aaron Lewis <aaronlewis@google.com> Link: https://lore.kernel.org/r/20221220161236.555143-5-aaronlewis@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24KVM: x86/xen: update Xen CPUID Leaf 4 (tsc info) sub-leaves, if presentPaul Durrant
The scaling information in subleaf 1 should match the values set by KVM in the 'vcpu_info' sub-structure 'time_info' (a.k.a. pvclock_vcpu_time_info) which is shared with the guest, but is not directly available to the VMM. The offset values are not set since a TSC offset is already applied. The TSC frequency should also be set in sub-leaf 2. Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/20230106103600.528-3-pdurrant@amazon.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24KVM: x86: Replace cpu_dirty_logging_count with nr_memslots_dirty_loggingDavid Matlack
Drop cpu_dirty_logging_count in favor of nr_memslots_dirty_logging. Both fields count the number of memslots that have dirty-logging enabled, with the only difference being that cpu_dirty_logging_count is only incremented when using PML. So while nr_memslots_dirty_logging is not a direct replacement for cpu_dirty_logging_count, it can be combined with enable_pml to get the same information. Signed-off-by: David Matlack <dmatlack@google.com> Link: https://lore.kernel.org/r/20230105214303.2919415-1-dmatlack@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24Merge branch 'kvm-lapic-fix-and-cleanup' into HEADPaolo Bonzini
The first half or so patches fix semi-urgent, real-world relevant APICv and AVIC bugs. The second half fixes a variety of AVIC and optimized APIC map bugs where KVM doesn't play nice with various edge cases that are architecturally legal(ish), but are unlikely to occur in most real world scenarios Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-01-13KVM: x86: Track required APICv inhibits with variable, not callbackSean Christopherson
Track the per-vendor required APICv inhibits with a variable instead of calling into vendor code every time KVM wants to query the set of required inhibits. The required inhibits are a property of the vendor's virtualization architecture, i.e. are 100% static. Using a variable allows the compiler to inline the check, e.g. generate a single-uop TEST+Jcc, and thus eliminates any desire to avoid checking inhibits for performance reasons. No functional change intended. Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20230106011306.85230-32-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-01-13KVM: x86: Inhibit APIC memslot if x2APIC and AVIC are enabledSean Christopherson
Free the APIC access page memslot if any vCPU enables x2APIC and SVM's AVIC is enabled to prevent accesses to the virtual APIC on vCPUs with x2APIC enabled. On AMD, if its "hybrid" mode is enabled (AVIC is enabled when x2APIC is enabled even without x2AVIC support), keeping the APIC access page memslot results in the guest being able to access the virtual APIC page as x2APIC is fully emulated by KVM. I.e. hardware isn't aware that the guest is operating in x2APIC mode. Exempt nested SVM's update of APICv state from the new logic as x2APIC can't be toggled on VM-Exit. In practice, invoking the x2APIC logic should be harmless precisely because it should be a glorified nop, but play it safe to avoid latent bugs, e.g. with dropping the vCPU's SRCU lock. Intel doesn't suffer from the same issue as APICv has fully independent VMCS controls for xAPIC vs. x2APIC virtualization. Technically, KVM should provide bus error semantics and not memory semantics for the APIC page when x2APIC is enabled, but KVM already provides memory semantics in other scenarios, e.g. if APICv/AVIC is enabled and the APIC is hardware disabled (via APIC_BASE MSR). Note, checking apic_access_memslot_enabled without taking locks relies it being set during vCPU creation (before kvm_vcpu_reset()). vCPUs can race to set the inhibit and delete the memslot, i.e. can get false positives, but can't get false negatives as apic_access_memslot_enabled can't be toggled "on" once any vCPU reaches KVM_RUN. Opportunistically drop the "can" while updating avic_activate_vmcb()'s comment, i.e. to state that KVM _does_ support the hybrid mode. Move the "Note:" down a line to conform to preferred kernel/KVM multi-line comment style. Opportunistically update the apicv_update_lock comment, as it isn't actually used to protect apic_access_memslot_enabled (which is protected by slots_lock). Fixes: 0e311d33bfbe ("KVM: SVM: Introduce hybrid-AVIC mode") Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20230106011306.85230-11-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-29KVM: Disable CPU hotplug during hardware enabling/disablingChao Gao
Disable CPU hotplug when enabling/disabling hardware to prevent the corner case where if the following sequence occurs: 1. A hotplugged CPU marks itself online in cpu_online_mask 2. The hotplugged CPU enables interrupt before invoking KVM's ONLINE callback 3 hardware_{en,dis}able_all() is invoked on another CPU the hotplugged CPU will be included in on_each_cpu() and thus get sent through hardware_{en,dis}able_nolock() before kvm_online_cpu() is called. start_secondary { ... set_cpu_online(smp_processor_id(), true); <- 1 ... local_irq_enable(); <- 2 ... cpu_startup_entry(CPUHP_AP_ONLINE_IDLE); <- 3 } KVM currently fudges around this race by keeping track of which CPUs have done hardware enabling (see commit 1b6c016818a5 "KVM: Keep track of which cpus have virtualization enabled"), but that's an inefficient, convoluted, and hacky solution. Signed-off-by: Chao Gao <chao.gao@intel.com> [sean: split to separate patch, write changelog] Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20221130230934.1014142-43-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-29KVM: Rename and move CPUHP_AP_KVM_STARTING to ONLINE sectionChao Gao
The CPU STARTING section doesn't allow callbacks to fail. Move KVM's hotplug callback to ONLINE section so that it can abort onlining a CPU in certain cases to avoid potentially breaking VMs running on existing CPUs. For example, when KVM fails to enable hardware virtualization on the hotplugged CPU. Place KVM's hotplug state before CPUHP_AP_SCHED_WAIT_EMPTY as it ensures when offlining a CPU, all user tasks and non-pinned kernel tasks have left the CPU, i.e. there cannot be a vCPU task around. So, it is safe for KVM's CPU offline callback to disable hardware virtualization at that point. Likewise, KVM's online callback can enable hardware virtualization before any vCPU task gets a chance to run on hotplugged CPUs. Drop kvm_x86_check_processor_compatibility()'s WARN that IRQs are disabled, as the ONLINE section runs with IRQs disabled. The WARN wasn't intended to be a requirement, e.g. disabling preemption is sufficient, the IRQ thing was purely an aggressive sanity check since the helper was only ever invoked via SMP function call. Rename KVM's CPU hotplug callbacks accordingly. Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Chao Gao <chao.gao@intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Reviewed-by: Yuan Yao <yuan.yao@intel.com> [sean: drop WARN that IRQs are disabled] Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20221130230934.1014142-42-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-29KVM: x86: Do compatibility checks when onlining CPUChao Gao
Do compatibility checks when enabling hardware to effectively add compatibility checks when onlining a CPU. Abort enabling, i.e. the online process, if the (hotplugged) CPU is incompatible with the known good setup. At init time, KVM does compatibility checks to ensure that all online CPUs support hardware virtualization and a common set of features. But KVM uses hotplugged CPUs without such compatibility checks. On Intel CPUs, this leads to #GP if the hotplugged CPU doesn't support VMX, or VM-Entry failure if the hotplugged CPU doesn't support all features enabled by KVM. Note, this is little more than a NOP on SVM, as SVM already checks for full SVM support during hardware enabling. Opportunistically add a pr_err() if setup_vmcs_config() fails, and tweak all error messages to output which CPU failed. Signed-off-by: Chao Gao <chao.gao@intel.com> Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Acked-by: Kai Huang <kai.huang@intel.com> Message-Id: <20221130230934.1014142-41-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-29KVM: x86: Move CPU compat checks hook to kvm_x86_ops (from kvm_x86_init_ops)Sean Christopherson
Move the .check_processor_compatibility() callback from kvm_x86_init_ops to kvm_x86_ops to allow a future patch to do compatibility checks during CPU hotplug. Do kvm_ops_update() before compat checks so that static_call() can be used during compat checks. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Message-Id: <20221130230934.1014142-40-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-29KVM: x86: Do VMX/SVM support checks directly in vendor codeSean Christopherson
Do basic VMX/SVM support checks directly in vendor code instead of implementing them via kvm_x86_ops hooks. Beyond the superficial benefit of providing common messages, which isn't even clearly a net positive since vendor code can provide more precise/detailed messages, there's zero advantage to bouncing through common x86 code. Consolidating the checks will also simplify performing the checks across all CPUs (in a future patch). Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20221130230934.1014142-37-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-29KVM: x86: Unify pr_fmt to use module name for all KVM modulesSean Christopherson
Define pr_fmt using KBUILD_MODNAME for all KVM x86 code so that printks use consistent formatting across common x86, Intel, and AMD code. In addition to providing consistent print formatting, using KBUILD_MODNAME, e.g. kvm_amd and kvm_intel, allows referencing SVM and VMX (and SEV and SGX and ...) as technologies without generating weird messages, and without causing naming conflicts with other kernel code, e.g. "SEV: ", "tdx: ", "sgx: " etc.. are all used by the kernel for non-KVM subsystems. Opportunistically move away from printk() for prints that need to be modified anyways, e.g. to drop a manual "kvm: " prefix. Opportunistically convert a few SGX WARNs that are similarly modified to WARN_ONCE; in the very unlikely event that the WARNs fire, odds are good that they would fire repeatedly and spam the kernel log without providing unique information in each print. Note, defining pr_fmt yields undesirable results for code that uses KVM's printk wrappers, e.g. vcpu_unimpl(). But, that's a pre-existing problem as SVM/kvm_amd already defines a pr_fmt, and thankfully use of KVM's wrappers is relatively limited in KVM x86 code. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Paul Durrant <paul@xen.org> Message-Id: <20221130230934.1014142-35-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-29KVM: Drop kvm_arch_check_processor_compat() hookSean Christopherson
Drop kvm_arch_check_processor_compat() and its support code now that all architecture implementations are nops. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Reviewed-by: Eric Farman <farman@linux.ibm.com> # s390 Acked-by: Anup Patel <anup@brainfault.org> Reviewed-by: Kai Huang <kai.huang@intel.com> Message-Id: <20221130230934.1014142-33-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-29KVM: x86: Do CPU compatibility checks in x86 codeSean Christopherson
Move the CPU compatibility checks to pure x86 code, i.e. drop x86's use of the common kvm_x86_check_cpu_compat() arch hook. x86 is the only architecture that "needs" to do per-CPU compatibility checks, moving the logic to x86 will allow dropping the common code, and will also give x86 more control over when/how the compatibility checks are performed, e.g. TDX will need to enable hardware (do VMXON) in order to perform compatibility checks. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Message-Id: <20221130230934.1014142-32-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-29KVM: Drop kvm_arch_{init,exit}() hooksSean Christopherson
Drop kvm_arch_init() and kvm_arch_exit() now that all implementations are nops. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Eric Farman <farman@linux.ibm.com> # s390 Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Acked-by: Anup Patel <anup@brainfault.org> Message-Id: <20221130230934.1014142-30-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-29KVM: x86: Serialize vendor module initialization (hardware setup)Sean Christopherson
Acquire a new mutex, vendor_module_lock, in kvm_x86_vendor_init() while doing hardware setup to ensure that concurrent calls are fully serialized. KVM rejects attempts to load vendor modules if a different module has already been loaded, but doesn't handle the case where multiple vendor modules are loaded at the same time, and module_init() doesn't run under the global module_mutex. Note, in practice, this is likely a benign bug as no platform exists that supports both SVM and VMX, i.e. barring a weird VM setup, one of the vendor modules is guaranteed to fail a support check before modifying common KVM state. Alternatively, KVM could perform an atomic CMPXCHG on .hardware_enable, but that comes with its own ugliness as it would require setting .hardware_enable before success is guaranteed, e.g. attempting to load the "wrong" could result in spurious failure to load the "right" module. Introduce a new mutex as using kvm_lock is extremely deadlock prone due to kvm_lock being taken under cpus_write_lock(), and in the future, under under cpus_read_lock(). Any operation that takes cpus_read_lock() while holding kvm_lock would potentially deadlock, e.g. kvm_timer_init() takes cpus_read_lock() to register a callback. In theory, KVM could avoid such problematic paths, i.e. do less setup under kvm_lock, but avoiding all calls to cpus_read_lock() is subtly difficult and thus fragile. E.g. updating static calls also acquires cpus_read_lock(). Inverting the lock ordering, i.e. always taking kvm_lock outside cpus_read_lock(), is not a viable option as kvm_lock is taken in various callbacks that may be invoked under cpus_read_lock(), e.g. x86's kvmclock_cpufreq_notifier(). The lockdep splat below is dependent on future patches to take cpus_read_lock() in hardware_enable_all(), but as above, deadlock is already is already possible. ====================================================== WARNING: possible circular locking dependency detected 6.0.0-smp--7ec93244f194-init2 #27 Tainted: G O ------------------------------------------------------ stable/251833 is trying to acquire lock: ffffffffc097ea28 (kvm_lock){+.+.}-{3:3}, at: hardware_enable_all+0x1f/0xc0 [kvm] but task is already holding lock: ffffffffa2456828 (cpu_hotplug_lock){++++}-{0:0}, at: hardware_enable_all+0xf/0xc0 [kvm] which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (cpu_hotplug_lock){++++}-{0:0}: cpus_read_lock+0x2a/0xa0 __cpuhp_setup_state+0x2b/0x60 __kvm_x86_vendor_init+0x16a/0x1870 [kvm] kvm_x86_vendor_init+0x23/0x40 [kvm] 0xffffffffc0a4d02b do_one_initcall+0x110/0x200 do_init_module+0x4f/0x250 load_module+0x1730/0x18f0 __se_sys_finit_module+0xca/0x100 __x64_sys_finit_module+0x1d/0x20 do_syscall_64+0x3d/0x80 entry_SYSCALL_64_after_hwframe+0x63/0xcd -> #0 (kvm_lock){+.+.}-{3:3}: __lock_acquire+0x16f4/0x30d0 lock_acquire+0xb2/0x190 __mutex_lock+0x98/0x6f0 mutex_lock_nested+0x1b/0x20 hardware_enable_all+0x1f/0xc0 [kvm] kvm_dev_ioctl+0x45e/0x930 [kvm] __se_sys_ioctl+0x77/0xc0 __x64_sys_ioctl+0x1d/0x20 do_syscall_64+0x3d/0x80 entry_SYSCALL_64_after_hwframe+0x63/0xcd other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(cpu_hotplug_lock); lock(kvm_lock); lock(cpu_hotplug_lock); lock(kvm_lock); *** DEADLOCK *** 1 lock held by stable/251833: #0: ffffffffa2456828 (cpu_hotplug_lock){++++}-{0:0}, at: hardware_enable_all+0xf/0xc0 [kvm] Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20221130230934.1014142-16-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-29KVM: x86: Move guts of kvm_arch_init() to standalone helperSean Christopherson
Move the guts of kvm_arch_init() to a new helper, kvm_x86_vendor_init(), so that VMX can do _all_ arch and vendor initialization before calling kvm_init(). Calling kvm_init() must be the _very_ last step during init, as kvm_init() exposes /dev/kvm to userspace, i.e. allows creating VMs. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20221130230934.1014142-14-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-29KVM: Drop arch hardware (un)setup hooksSean Christopherson
Drop kvm_arch_hardware_setup() and kvm_arch_hardware_unsetup() now that all implementations are nops. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Eric Farman <farman@linux.ibm.com> # s390 Acked-by: Anup Patel <anup@brainfault.org> Message-Id: <20221130230934.1014142-10-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-29KVM: x86: Move hardware setup/unsetup to init/exitSean Christopherson
Now that kvm_arch_hardware_setup() is called immediately after kvm_arch_init(), fold the guts of kvm_arch_hardware_(un)setup() into kvm_arch_{init,exit}() as a step towards dropping one of the hooks. To avoid having to unwind various setup, e.g registration of several notifiers, slot in the vendor hardware setup before the registration of said notifiers and callbacks. Introducing a functional change while moving code is less than ideal, but the alternative is adding a pile of unwinding code, which is much more error prone, e.g. several attempts to move the setup code verbatim all introduced bugs. Add a comment to document that kvm_ops_update() is effectively the point of no return, e.g. it sets the kvm_x86_ops.hardware_enable canary and so needs to be unwound. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20221130230934.1014142-9-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-29KVM: x86: Do timer initialization after XCR0 configurationSean Christopherson
Move kvm_arch_init()'s call to kvm_timer_init() down a few lines below the XCR0 configuration code. A future patch will move hardware setup into kvm_arch_init() and slot in vendor hardware setup before the call to kvm_timer_init() so that timer initialization (among other stuff) doesn't need to be unwound if vendor setup fails. XCR0 setup on the other hand needs to happen before vendor hardware setup. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20221130230934.1014142-8-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-29Merge branch 'kvm-late-6.1' into HEADPaolo Bonzini
x86: * Change tdp_mmu to a read-only parameter * Separate TDP and shadow MMU page fault paths * Enable Hyper-V invariant TSC control selftests: * Use TAP interface for kvm_binary_stats_test and tsc_msrs_test Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-29KVM: x86: Hyper-V invariant TSC controlVitaly Kuznetsov
Normally, genuine Hyper-V doesn't expose architectural invariant TSC (CPUID.80000007H:EDX[8]) to its guests by default. A special PV MSR (HV_X64_MSR_TSC_INVARIANT_CONTROL, 0x40000118) and corresponding CPUID feature bit (CPUID.0x40000003.EAX[15]) were introduced. When bit 0 of the PV MSR is set, invariant TSC bit starts to show up in CPUID. When the feature is exposed to Hyper-V guests, reenlightenment becomes unneeded. Add the feature to KVM. Keep CPUID output intact when the feature wasn't exposed to L1 and implement the required logic for hiding invariant TSC when the feature was exposed and invariant TSC control MSR wasn't written to. Copy genuine Hyper-V behavior and forbid to disable the feature once it was enabled. For the reference, for linux guests, support for the feature was added in commit dce7cd62754b ("x86/hyperv: Allow guests to enable InvariantTSC"). Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <20221013095849.705943-4-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-28Merge branch 'kvm-late-6.1-fixes' into HEADPaolo Bonzini
x86: * several fixes to nested VMX execution controls * fixes and clarification to the documentation for Xen emulation * do not unnecessarily release a pmu event with zero period * MMU fixes * fix Coverity warning in kvm_hv_flush_tlb() selftests: * fixes for the ucall mechanism in selftests * other fixes mostly related to compilation with clang
2022-12-28KVM: x86: Hyper-V invariant TSC controlVitaly Kuznetsov
Normally, genuine Hyper-V doesn't expose architectural invariant TSC (CPUID.80000007H:EDX[8]) to its guests by default. A special PV MSR (HV_X64_MSR_TSC_INVARIANT_CONTROL, 0x40000118) and corresponding CPUID feature bit (CPUID.0x40000003.EAX[15]) were introduced. When bit 0 of the PV MSR is set, invariant TSC bit starts to show up in CPUID. When the feature is exposed to Hyper-V guests, reenlightenment becomes unneeded. Add the feature to KVM. Keep CPUID output intact when the feature wasn't exposed to L1 and implement the required logic for hiding invariant TSC when the feature was exposed and invariant TSC control MSR wasn't written to. Copy genuine Hyper-V behavior and forbid to disable the feature once it was enabled. For the reference, for linux guests, support for the feature was added in commit dce7cd62754b ("x86/hyperv: Allow guests to enable InvariantTSC"). Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <20221013095849.705943-4-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-23KVM: x86: Sanity check inputs to kvm_handle_memory_failure()Sean Christopherson
Add a sanity check in kvm_handle_memory_failure() to assert that a valid x86_exception structure is provided if the memory "failure" wants to propagate a fault into the guest. If a memory failure happens during a direct guest physical memory access, e.g. for nested VMX, KVM hardcodes the failure to X86EMUL_IO_NEEDED and doesn't provide an exception pointer (because the exception struct would just be filled with garbage). Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20221220153427.514032-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-15Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm updates from Paolo Bonzini: "ARM64: - Enable the per-vcpu dirty-ring tracking mechanism, together with an option to keep the good old dirty log around for pages that are dirtied by something other than a vcpu. - Switch to the relaxed parallel fault handling, using RCU to delay page table reclaim and giving better performance under load. - Relax the MTE ABI, allowing a VMM to use the MAP_SHARED mapping option, which multi-process VMMs such as crosvm rely on (see merge commit 382b5b87a97d: "Fix a number of issues with MTE, such as races on the tags being initialised vs the PG_mte_tagged flag as well as the lack of support for VM_SHARED when KVM is involved. Patches from Catalin Marinas and Peter Collingbourne"). - Merge the pKVM shadow vcpu state tracking that allows the hypervisor to have its own view of a vcpu, keeping that state private. - Add support for the PMUv3p5 architecture revision, bringing support for 64bit counters on systems that support it, and fix the no-quite-compliant CHAIN-ed counter support for the machines that actually exist out there. - Fix a handful of minor issues around 52bit VA/PA support (64kB pages only) as a prefix of the oncoming support for 4kB and 16kB pages. - Pick a small set of documentation and spelling fixes, because no good merge window would be complete without those. s390: - Second batch of the lazy destroy patches - First batch of KVM changes for kernel virtual != physical address support - Removal of a unused function x86: - Allow compiling out SMM support - Cleanup and documentation of SMM state save area format - Preserve interrupt shadow in SMM state save area - Respond to generic signals during slow page faults - Fixes and optimizations for the non-executable huge page errata fix. - Reprogram all performance counters on PMU filter change - Cleanups to Hyper-V emulation and tests - Process Hyper-V TLB flushes from a nested guest (i.e. from a L2 guest running on top of a L1 Hyper-V hypervisor) - Advertise several new Intel features - x86 Xen-for-KVM: - Allow the Xen runstate information to cross a page boundary - Allow XEN_RUNSTATE_UPDATE flag behaviour to be configured - Add support for 32-bit guests in SCHEDOP_poll - Notable x86 fixes and cleanups: - One-off fixes for various emulation flows (SGX, VMXON, NRIPS=0). - Reinstate IBPB on emulated VM-Exit that was incorrectly dropped a few years back when eliminating unnecessary barriers when switching between vmcs01 and vmcs02. - Clean up vmread_error_trampoline() to make it more obvious that params must be passed on the stack, even for x86-64. - Let userspace set all supported bits in MSR_IA32_FEAT_CTL irrespective of the current guest CPUID. - Fudge around a race with TSC refinement that results in KVM incorrectly thinking a guest needs TSC scaling when running on a CPU with a constant TSC, but no hardware-enumerated TSC frequency. - Advertise (on AMD) that the SMM_CTL MSR is not supported - Remove unnecessary exports Generic: - Support for responding to signals during page faults; introduces new FOLL_INTERRUPTIBLE flag that was reviewed by mm folks Selftests: - Fix an inverted check in the access tracking perf test, and restore support for asserting that there aren't too many idle pages when running on bare metal. - Fix build errors that occur in certain setups (unsure exactly what is unique about the problematic setup) due to glibc overriding static_assert() to a variant that requires a custom message. - Introduce actual atomics for clear/set_bit() in selftests - Add support for pinning vCPUs in dirty_log_perf_test. - Rename the so called "perf_util" framework to "memstress". - Add a lightweight psuedo RNG for guest use, and use it to randomize the access pattern and write vs. read percentage in the memstress tests. - Add a common ucall implementation; code dedup and pre-work for running SEV (and beyond) guests in selftests. - Provide a common constructor and arch hook, which will eventually be used by x86 to automatically select the right hypercall (AMD vs. Intel). - A bunch of added/enabled/fixed selftests for ARM64, covering memslots, breakpoints, stage-2 faults and access tracking. - x86-specific selftest changes: - Clean up x86's page table management. - Clean up and enhance the "smaller maxphyaddr" test, and add a related test to cover generic emulation failure. - Clean up the nEPT support checks. - Add X86_PROPERTY_* framework to retrieve multi-bit CPUID values. - Fix an ordering issue in the AMX test introduced by recent conversions to use kvm_cpu_has(), and harden the code to guard against similar bugs in the future. Anything that tiggers caching of KVM's supported CPUID, kvm_cpu_has() in this case, effectively hides opt-in XSAVE features if the caching occurs before the test opts in via prctl(). Documentation: - Remove deleted ioctls from documentation - Clean up the docs for the x86 MSR filter. - Various fixes" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (361 commits) KVM: x86: Add proper ReST tables for userspace MSR exits/flags KVM: selftests: Allocate ucall pool from MEM_REGION_DATA KVM: arm64: selftests: Align VA space allocator with TTBR0 KVM: arm64: Fix benign bug with incorrect use of VA_BITS KVM: arm64: PMU: Fix period computation for 64bit counters with 32bit overflow KVM: x86: Advertise that the SMM_CTL MSR is not supported KVM: x86: remove unnecessary exports KVM: selftests: Fix spelling mistake "probabalistic" -> "probabilistic" tools: KVM: selftests: Convert clear/set_bit() to actual atomics tools: Drop "atomic_" prefix from atomic test_and_set_bit() tools: Drop conflicting non-atomic test_and_{clear,set}_bit() helpers KVM: selftests: Use non-atomic clear/set bit helpers in KVM tests perf tools: Use dedicated non-atomic clear/set bit helpers tools: Take @bit as an "unsigned long" in {clear,set}_bit() helpers KVM: arm64: selftests: Enable single-step without a "full" ucall() KVM: x86: fix APICv/x2AVIC disabled when vm reboot by itself KVM: Remove stale comment about KVM_REQ_UNHALT KVM: Add missing arch for KVM_CREATE_DEVICE and KVM_{SET,GET}_DEVICE_ATTR KVM: Reference to kvm_userspace_memory_region in doc and comments KVM: Delete all references to removed KVM_SET_MEMORY_ALIAS ioctl ...
2022-12-12Merge remote-tracking branch 'kvm/queue' into HEADPaolo Bonzini
x86 Xen-for-KVM: * Allow the Xen runstate information to cross a page boundary * Allow XEN_RUNSTATE_UPDATE flag behaviour to be configured * add support for 32-bit guests in SCHEDOP_poll x86 fixes: * One-off fixes for various emulation flows (SGX, VMXON, NRIPS=0). * Reinstate IBPB on emulated VM-Exit that was incorrectly dropped a few years back when eliminating unnecessary barriers when switching between vmcs01 and vmcs02. * Clean up the MSR filter docs. * Clean up vmread_error_trampoline() to make it more obvious that params must be passed on the stack, even for x86-64. * Let userspace set all supported bits in MSR_IA32_FEAT_CTL irrespective of the current guest CPUID. * Fudge around a race with TSC refinement that results in KVM incorrectly thinking a guest needs TSC scaling when running on a CPU with a constant TSC, but no hardware-enumerated TSC frequency. * Advertise (on AMD) that the SMM_CTL MSR is not supported * Remove unnecessary exports Selftests: * Fix an inverted check in the access tracking perf test, and restore support for asserting that there aren't too many idle pages when running on bare metal. * Fix an ordering issue in the AMX test introduced by recent conversions to use kvm_cpu_has(), and harden the code to guard against similar bugs in the future. Anything that tiggers caching of KVM's supported CPUID, kvm_cpu_has() in this case, effectively hides opt-in XSAVE features if the caching occurs before the test opts in via prctl(). * Fix build errors that occur in certain setups (unsure exactly what is unique about the problematic setup) due to glibc overriding static_assert() to a variant that requires a custom message. * Introduce actual atomics for clear/set_bit() in selftests Documentation: * Remove deleted ioctls from documentation * Various fixes
2022-12-09Merge tag 'kvmarm-6.2' of ↵Paolo Bonzini
https://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 updates for 6.2 - Enable the per-vcpu dirty-ring tracking mechanism, together with an option to keep the good old dirty log around for pages that are dirtied by something other than a vcpu. - Switch to the relaxed parallel fault handling, using RCU to delay page table reclaim and giving better performance under load. - Relax the MTE ABI, allowing a VMM to use the MAP_SHARED mapping option, which multi-process VMMs such as crosvm rely on. - Merge the pKVM shadow vcpu state tracking that allows the hypervisor to have its own view of a vcpu, keeping that state private. - Add support for the PMUv3p5 architecture revision, bringing support for 64bit counters on systems that support it, and fix the no-quite-compliant CHAIN-ed counter support for the machines that actually exist out there. - Fix a handful of minor issues around 52bit VA/PA support (64kB pages only) as a prefix of the oncoming support for 4kB and 16kB pages. - Add/Enable/Fix a bunch of selftests covering memslots, breakpoints, stage-2 faults and access tracking. You name it, we got it, we probably broke it. - Pick a small set of documentation and spelling fixes, because no good merge window would be complete without those. As a side effect, this tag also drags: - The 'kvmarm-fixes-6.1-3' tag as a dependency to the dirty-ring series - A shared branch with the arm64 tree that repaints all the system registers to match the ARM ARM's naming, and resulting in interesting conflicts
2022-12-05Merge branch kvm-arm64/dirty-ring into kvmarm-master/nextMarc Zyngier
* kvm-arm64/dirty-ring: : . : Add support for the "per-vcpu dirty-ring tracking with a bitmap : and sprinkles on top", courtesy of Gavin Shan. : : This branch drags the kvmarm-fixes-6.1-3 tag which was already : merged in 6.1-rc4 so that the branch is in a working state. : . KVM: Push dirty information unconditionally to backup bitmap KVM: selftests: Automate choosing dirty ring size in dirty_log_test KVM: selftests: Clear dirty ring states between two modes in dirty_log_test KVM: selftests: Use host page size to map ring buffer in dirty_log_test KVM: arm64: Enable ring-based dirty memory tracking KVM: Support dirty ring in conjunction with bitmap KVM: Move declaration of kvm_cpu_dirty_log_size() to kvm_dirty_ring.h KVM: x86: Introduce KVM_REQ_DIRTY_RING_SOFT_FULL Signed-off-by: Marc Zyngier <maz@kernel.org>
2022-12-02Merge branch 'gpc-fixes' of git://git.infradead.org/users/dwmw2/linux into HEADPaolo Bonzini
Pull Xen-for-KVM changes from David Woodhouse: * add support for 32-bit guests in SCHEDOP_poll * the rest of the gfn-to-pfn cache API cleanup "I still haven't reinstated the last of those patches to make gpc->len immutable." Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-02KVM: x86: remove unnecessary exportsPaolo Bonzini
Several symbols are not used by vendor modules but still exported. Removing them ensures that new coupling between kvm.ko and kvm-*.ko is noticed and reviewed. Co-developed-by: Sean Christopherson <seanjc@google.com> Co-developed-by: Like Xu <like.xu.linux@gmail.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Like Xu <like.xu.linux@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-30KVM: x86: Use current rather than snapshotted TSC frequency if it is constantAnton Romanov
Don't snapshot tsc_khz into per-cpu cpu_tsc_khz if the host TSC is constant, in which case the actual TSC frequency will never change and thus capturing TSC during initialization is unnecessary, KVM can simply use tsc_khz. This value is snapshotted from kvm_timer_init->kvmclock_cpu_online->tsc_khz_changed(NULL) On CPUs with constant TSC, but not a hardware-specified TSC frequency, snapshotting cpu_tsc_khz and using that to set a VM's target TSC frequency can lead to VM to think its TSC frequency is not what it actually is if refining the TSC completes after KVM snapshots tsc_khz. The actual frequency never changes, only the kernel's calculation of what that frequency is changes. Ideally, KVM would not be able to race with TSC refinement, or would have a hook into tsc_refine_calibration_work() to get an alert when refinement is complete. Avoiding the race altogether isn't practical as refinement takes a relative eternity; it's deliberately put on a work queue outside of the normal boot sequence to avoid unnecessarily delaying boot. Adding a hook is doable, but somewhat gross due to KVM's ability to be built as a module. And if the TSC is constant, which is likely the case for every VMX/SVM-capable CPU produced in the last decade, the race can be hit if and only if userspace is able to create a VM before TSC refinement completes; refinement is slow, but not that slow. For now, punt on a proper fix, as not taking a snapshot can help some uses cases and not taking a snapshot is arguably correct irrespective of the race with refinement. Signed-off-by: Anton Romanov <romanton@google.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20220608183525.1143682-1-romanton@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2022-11-30KVM: x86: Fail emulation during EMULTYPE_SKIP on any exceptionSean Christopherson
Treat any exception during instruction decode for EMULTYPE_SKIP as a "full" emulation failure, i.e. signal failure instead of queuing the exception. When decoding purely to skip an instruction, KVM and/or the CPU has already done some amount of emulation that cannot be unwound, e.g. on an EPT misconfig VM-Exit KVM has already processeed the emulated MMIO. KVM already does this if a #UD is encountered, but not for other exceptions, e.g. if a #PF is encountered during fetch. In SVM's soft-injection use case, queueing the exception is particularly problematic as queueing exceptions while injecting events can put KVM into an infinite loop due to bailing from VM-Enter to service the newly pending exception. E.g. multiple warnings to detect such behavior fire: ------------[ cut here ]------------ WARNING: CPU: 3 PID: 1017 at arch/x86/kvm/x86.c:9873 kvm_arch_vcpu_ioctl_run+0x1de5/0x20a0 [kvm] Modules linked in: kvm_amd ccp kvm irqbypass CPU: 3 PID: 1017 Comm: svm_nested_soft Not tainted 6.0.0-rc1+ #220 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:kvm_arch_vcpu_ioctl_run+0x1de5/0x20a0 [kvm] Call Trace: kvm_vcpu_ioctl+0x223/0x6d0 [kvm] __x64_sys_ioctl+0x85/0xc0 do_syscall_64+0x2b/0x50 entry_SYSCALL_64_after_hwframe+0x46/0xb0 ---[ end trace 0000000000000000 ]--- ------------[ cut here ]------------ WARNING: CPU: 3 PID: 1017 at arch/x86/kvm/x86.c:9987 kvm_arch_vcpu_ioctl_run+0x12a3/0x20a0 [kvm] Modules linked in: kvm_amd ccp kvm irqbypass CPU: 3 PID: 1017 Comm: svm_nested_soft Tainted: G W 6.0.0-rc1+ #220 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:kvm_arch_vcpu_ioctl_run+0x12a3/0x20a0 [kvm] Call Trace: kvm_vcpu_ioctl+0x223/0x6d0 [kvm] __x64_sys_ioctl+0x85/0xc0 do_syscall_64+0x2b/0x50 entry_SYSCALL_64_after_hwframe+0x46/0xb0 ---[ end trace 0000000000000000 ]--- Fixes: 6ea6e84309ca ("KVM: x86: inject exceptions produced by x86_decode_insn") Signed-off-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20220930233632.1725475-1-seanjc@google.com
2022-11-30KVM: Drop @gpa from exported gfn=>pfn cache check() and refresh() helpersSean Christopherson
Drop the @gpa param from the exported check()+refresh() helpers and limit changing the cache's GPA to the activate path. All external users just feed in gpc->gpa, i.e. this is a fancy nop. Allowing users to change the GPA at check()+refresh() is dangerous as those helpers explicitly allow concurrent calls, e.g. KVM could get into a livelock scenario. It's also unclear as to what the expected behavior should be if multiple tasks attempt to refresh with different GPAs. Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
2022-11-30KVM: Use gfn_to_pfn_cache's immutable "kvm" in kvm_gpc_refresh()Michal Luczaj
Make kvm_gpc_refresh() use kvm instance cached in gfn_to_pfn_cache. No functional change intended. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Michal Luczaj <mhal@rbox.co> [sean: leave kvm_gpc_unmap() as-is] Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
2022-11-30KVM: Use gfn_to_pfn_cache's immutable "kvm" in kvm_gpc_check()Michal Luczaj
Make kvm_gpc_check() use kvm instance cached in gfn_to_pfn_cache. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Michal Luczaj <mhal@rbox.co> Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
2022-11-30KVM: Store immutable gfn_to_pfn_cache propertiesMichal Luczaj
Move the assignment of immutable properties @kvm, @vcpu, and @usage to the initializer. Make _activate() and _deactivate() use stored values. Note, @len is also effectively immutable for most cases, but not in the case of the Xen runstate cache, which may be split across two pages and the length of the first segment will depend on its address. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Michal Luczaj <mhal@rbox.co> [sean: handle @len in a separate patch] Signed-off-by: Sean Christopherson <seanjc@google.com> [dwmw2: acknowledge that @len can actually change for some use cases] Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
2022-11-30KVM: x86: fix uninitialized variable use on KVM_REQ_TRIPLE_FAULTPaolo Bonzini
If a triple fault was fixed by kvm_x86_ops.nested_ops->triple_fault (by turning it into a vmexit), there is no need to leave vcpu_enter_guest(). Any vcpu->requests will be caught later before the actual vmentry, and in fact vcpu_enter_guest() was not initializing the "r" variable. Depending on the compiler's whims, this could cause the x86_64/triple_fault_event_test test to fail. Cc: Maxim Levitsky <mlevitsk@redhat.com> Fixes: 92e7d5c83aff ("KVM: x86: allow L1 to not intercept triple fault") Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-30KVM: x86: fix uninitialized variable use on KVM_REQ_TRIPLE_FAULTPaolo Bonzini
If a triple fault was fixed by kvm_x86_ops.nested_ops->triple_fault (by turning it into a vmexit), there is no need to leave vcpu_enter_guest(). Any vcpu->requests will be caught later before the actual vmentry, and in fact vcpu_enter_guest() was not initializing the "r" variable. Depending on the compiler's whims, this could cause the x86_64/triple_fault_event_test test to fail. Cc: Maxim Levitsky <mlevitsk@redhat.com> Fixes: 92e7d5c83aff ("KVM: x86: allow L1 to not intercept triple fault") Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-30KVM: Shorten gfn_to_pfn_cache function namesMichal Luczaj
Formalize "gpc" as the acronym and use it in function names. No functional change intended. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Michal Luczaj <mhal@rbox.co> Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-30KVM: x86/xen: Allow XEN_RUNSTATE_UPDATE flag behaviour to be configuredDavid Woodhouse
Closer inspection of the Xen code shows that we aren't supposed to be using the XEN_RUNSTATE_UPDATE flag unconditionally. It should be explicitly enabled by guests through the HYPERVISOR_vm_assist hypercall. If we randomly set the top bit of ->state_entry_time for a guest that hasn't asked for it and doesn't expect it, that could make the runtimes fail to add up and confuse the guest. Without the flag it's perfectly safe for a vCPU to read its own vcpu_runstate_info; just not for one vCPU to read *another's*. I briefly pondered adding a word for the whole set of VMASST_TYPE_* flags but the only one we care about for HVM guests is this, so it seemed a bit pointless. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Message-Id: <20221127122210.248427-3-dwmw2@infradead.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-27Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm fixes from Paolo Bonzini: "x86: - Fixes for Xen emulation. While nobody should be enabling it in the kernel (the only public users of the feature are the selftests), the bug effectively allows userspace to read arbitrary memory. - Correctness fixes for nested hypervisors that do not intercept INIT or SHUTDOWN on AMD; the subsequent CPU reset can cause a use-after-free when it disables virtualization extensions. While downgrading the panic to a WARN is quite easy, the full fix is a bit more laborious; there are also tests. This is the bulk of the pull request. - Fix race condition due to incorrect mmu_lock use around make_mmu_pages_available(). Generic: - Obey changes to the kvm.halt_poll_ns module parameter in VMs not using KVM_CAP_HALT_POLL, restoring behavior from before the introduction of the capability" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: Update gfn_to_pfn_cache khva when it moves within the same page KVM: x86/xen: Only do in-kernel acceleration of hypercalls for guest CPL0 KVM: x86/xen: Validate port number in SCHEDOP_poll KVM: x86/mmu: Fix race condition in direct_page_fault KVM: x86: remove exit_int_info warning in svm_handle_exit KVM: selftests: add svm part to triple_fault_test KVM: x86: allow L1 to not intercept triple fault kvm: selftests: add svm nested shutdown test KVM: selftests: move idt_entry to header KVM: x86: forcibly leave nested mode on vCPU reset KVM: x86: add kvm_leave_nested KVM: x86: nSVM: harden svm_free_nested against freeing vmcb02 while still in use KVM: x86: nSVM: leave nested mode on vCPU free KVM: Obey kvm.halt_poll_ns in VMs not using KVM_CAP_HALT_POLL KVM: Avoid re-reading kvm->max_halt_poll_ns during halt-polling KVM: Cap vcpu->halt_poll_ns before halting rather than after
2022-11-18KVM: x86: hyper-v: Introduce TLB flush fifoVitaly Kuznetsov
To allow flushing individual GVAs instead of always flushing the whole VPID a per-vCPU structure to pass the requests is needed. Use standard 'kfifo' to queue two types of entries: individual GVA (GFN + up to 4095 following GFNs in the lower 12 bits) and 'flush all'. The size of the fifo is arbitrarily set to '16'. Note, kvm_hv_flush_tlb() only queues 'flush all' entries for now and kvm_hv_vcpu_flush_tlb() doesn't actually read the fifo just resets the queue before returning -EOPNOTSUPP (which triggers full TLB flush) so the functional change is very small but the infrastructure is prepared to handle individual GVA flush requests. Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20221101145426.251680-10-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-18KVM: x86: hyper-v: Resurrect dedicated KVM_REQ_HV_TLB_FLUSH flagVitaly Kuznetsov
In preparation to implementing fine-grained Hyper-V TLB flush and L2 TLB flush, resurrect dedicated KVM_REQ_HV_TLB_FLUSH request bit. As KVM_REQ_TLB_FLUSH_GUEST is a stronger operation, clear KVM_REQ_HV_TLB_FLUSH request in kvm_vcpu_flush_tlb_guest(). The flush itself is temporary handled by kvm_vcpu_flush_tlb_guest(). No functional change intended. Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20221101145426.251680-9-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-18KVM: x86: Move clearing of TLB_FLUSH_CURRENT to kvm_vcpu_flush_tlb_all()Sean Christopherson
Clear KVM_REQ_TLB_FLUSH_CURRENT in kvm_vcpu_flush_tlb_all() instead of in its sole caller that processes KVM_REQ_TLB_FLUSH. Regardless of why/when kvm_vcpu_flush_tlb_all() is called, flushing "all" TLB entries also flushes "current" TLB entries. Ideally, there will never be another caller of kvm_vcpu_flush_tlb_all(), and moving the handling "requires" extra work to document the ordering requirement, but future Hyper-V paravirt TLB flushing support will add similar logic for flush "guest" (Hyper-V can flush a subset of "guest" entries). And in the Hyper-V case, KVM needs to do more than just clear the request, the queue of GPAs to flush also needs to purged, and doing all only in the request path is undesirable as kvm_vcpu_flush_tlb_guest() does have multiple callers (though it's unlikely KVM's paravirt TLB flush will coincide with Hyper-V's paravirt TLB flush). Move the logic even though it adds extra "work" so that KVM will be consistent with how flush requests are processed when the Hyper-V support lands. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20221101145426.251680-8-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>