summaryrefslogtreecommitdiff
path: root/Documentation/virt
AgeCommit message (Collapse)Author
2024-09-12user_mode_linux_howto_v2: add VDE vector support in docRenzo Davoli
Add a description of the VDE vector transport in user_mode_linux_howto_v2.rst. Signed-off-by: Renzo Davoli <renzo@cs.unibo.it> Signed-off-by: Renzo Davoi <renzo@cs.unibo.it> Acked-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Richard Weinberger <richard@nod.at>
2024-09-09Merge tag 'hyperv-fixes-signed-20240908' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux Pull hyperv fixes from Wei Liu: - Add a documentation overview of Confidential Computing VM support (Michael Kelley) - Use lapic timer in a TDX VM without paravisor (Dexuan Cui) - Set X86_FEATURE_TSC_KNOWN_FREQ when Hyper-V provides frequency (Michael Kelley) - Fix a kexec crash due to VP assist page corruption (Anirudh Rayabharam) - Python3 compatibility fix for lsvmbus (Anthony Nandaa) - Misc fixes (Rachel Menge, Roman Kisel, zhang jiao, Hongbo Li) * tag 'hyperv-fixes-signed-20240908' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux: hv: vmbus: Constify struct kobj_type and struct attribute_group tools: hv: rm .*.cmd when make clean x86/hyperv: fix kexec crash due to VP assist page corruption Drivers: hv: vmbus: Fix the misplaced function description tools: hv: lsvmbus: change shebang to use python3 x86/hyperv: Set X86_FEATURE_TSC_KNOWN_FREQ when Hyper-V provides frequency Documentation: hyperv: Add overview of Confidential Computing VM support clocksource: hyper-v: Use lapic timer in a TDX VM without paravisor Drivers: hv: Remove deprecated hv_fcopy declarations
2024-09-05Loongarch: KVM: Add KVM hypercalls documentation for LoongArchBibo Mao
Add documentation topic for using pv_virt when running as a guest on KVM hypervisor. Signed-off-by: Bibo Mao <maobibo@loongson.cn> Signed-off-by: Xianglai Li <lixianglai@loongson.cn> Co-developed-by: Mingcong Bai <jeffbai@aosc.io> Signed-off-by: Mingcong Bai <jeffbai@aosc.io> Link: https://lore.kernel.org/all/5c338084b1bcccc1d57dce9ddb1e7081@aosc.io/ Signed-off-by: Dandan Zhang <zhangdandan@uniontech.com> [jc: fixed htmldocs build error] Signed-off-by: Jonathan Corbet <corbet@lwn.net> Link: https://lore.kernel.org/r/4769C036576F8816+20240828045950.3484113-1-zhangdandan@uniontech.com
2024-09-04KVM: Register cpuhp and syscore callbacks when enabling hardwareSean Christopherson
Register KVM's cpuhp and syscore callback when enabling virtualization in hardware instead of registering the callbacks during initialization, and let the CPU up/down framework invoke the inner enable/disable functions. Registering the callbacks during initialization makes things more complex than they need to be, as KVM needs to be very careful about handling races between enabling CPUs being onlined/offlined and hardware being enabled/disabled. Intel TDX support will require KVM to enable virtualization during KVM initialization, i.e. will add another wrinkle to things, at which point sorting out the potential races with kvm_usage_count would become even more complex. Note, using the cpuhp framework has a subtle behavioral change: enabling will be done serially across all CPUs, whereas KVM currently sends an IPI to all CPUs in parallel. While serializing virtualization enabling could create undesirable latency, the issue is limited to the 0=>1 transition of VM creation. And even that can be mitigated, e.g. by letting userspace force virtualization to be enabled when KVM is initialized. Cc: Chao Gao <chao.gao@intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Acked-by: Kai Huang <kai.huang@intel.com> Tested-by: Farrah Chen <farrah.chen@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240830043600.127750-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-09-04KVM: Use dedicated mutex to protect kvm_usage_count to avoid deadlockSean Christopherson
Use a dedicated mutex to guard kvm_usage_count to fix a potential deadlock on x86 due to a chain of locks and SRCU synchronizations. Translating the below lockdep splat, CPU1 #6 will wait on CPU0 #1, CPU0 #8 will wait on CPU2 #3, and CPU2 #7 will wait on CPU1 #4 (if there's a writer, due to the fairness of r/w semaphores). CPU0 CPU1 CPU2 1 lock(&kvm->slots_lock); 2 lock(&vcpu->mutex); 3 lock(&kvm->srcu); 4 lock(cpu_hotplug_lock); 5 lock(kvm_lock); 6 lock(&kvm->slots_lock); 7 lock(cpu_hotplug_lock); 8 sync(&kvm->srcu); Note, there are likely more potential deadlocks in KVM x86, e.g. the same pattern of taking cpu_hotplug_lock outside of kvm_lock likely exists with __kvmclock_cpufreq_notifier(): cpuhp_cpufreq_online() | -> cpufreq_online() | -> cpufreq_gov_performance_limits() | -> __cpufreq_driver_target() | -> __target_index() | -> cpufreq_freq_transition_begin() | -> cpufreq_notify_transition() | -> ... __kvmclock_cpufreq_notifier() But, actually triggering such deadlocks is beyond rare due to the combination of dependencies and timings involved. E.g. the cpufreq notifier is only used on older CPUs without a constant TSC, mucking with the NX hugepage mitigation while VMs are running is very uncommon, and doing so while also onlining/offlining a CPU (necessary to generate contention on cpu_hotplug_lock) would be even more unusual. The most robust solution to the general cpu_hotplug_lock issue is likely to switch vm_list to be an RCU-protected list, e.g. so that x86's cpufreq notifier doesn't to take kvm_lock. For now, settle for fixing the most blatant deadlock, as switching to an RCU-protected list is a much more involved change, but add a comment in locking.rst to call out that care needs to be taken when walking holding kvm_lock and walking vm_list. ====================================================== WARNING: possible circular locking dependency detected 6.10.0-smp--c257535a0c9d-pip #330 Tainted: G S O ------------------------------------------------------ tee/35048 is trying to acquire lock: ff6a80eced71e0a8 (&kvm->slots_lock){+.+.}-{3:3}, at: set_nx_huge_pages+0x179/0x1e0 [kvm] but task is already holding lock: ffffffffc07abb08 (kvm_lock){+.+.}-{3:3}, at: set_nx_huge_pages+0x14a/0x1e0 [kvm] which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #3 (kvm_lock){+.+.}-{3:3}: __mutex_lock+0x6a/0xb40 mutex_lock_nested+0x1f/0x30 kvm_dev_ioctl+0x4fb/0xe50 [kvm] __se_sys_ioctl+0x7b/0xd0 __x64_sys_ioctl+0x21/0x30 x64_sys_call+0x15d0/0x2e60 do_syscall_64+0x83/0x160 entry_SYSCALL_64_after_hwframe+0x76/0x7e -> #2 (cpu_hotplug_lock){++++}-{0:0}: cpus_read_lock+0x2e/0xb0 static_key_slow_inc+0x16/0x30 kvm_lapic_set_base+0x6a/0x1c0 [kvm] kvm_set_apic_base+0x8f/0xe0 [kvm] kvm_set_msr_common+0x9ae/0xf80 [kvm] vmx_set_msr+0xa54/0xbe0 [kvm_intel] __kvm_set_msr+0xb6/0x1a0 [kvm] kvm_arch_vcpu_ioctl+0xeca/0x10c0 [kvm] kvm_vcpu_ioctl+0x485/0x5b0 [kvm] __se_sys_ioctl+0x7b/0xd0 __x64_sys_ioctl+0x21/0x30 x64_sys_call+0x15d0/0x2e60 do_syscall_64+0x83/0x160 entry_SYSCALL_64_after_hwframe+0x76/0x7e -> #1 (&kvm->srcu){.+.+}-{0:0}: __synchronize_srcu+0x44/0x1a0 synchronize_srcu_expedited+0x21/0x30 kvm_swap_active_memslots+0x110/0x1c0 [kvm] kvm_set_memslot+0x360/0x620 [kvm] __kvm_set_memory_region+0x27b/0x300 [kvm] kvm_vm_ioctl_set_memory_region+0x43/0x60 [kvm] kvm_vm_ioctl+0x295/0x650 [kvm] __se_sys_ioctl+0x7b/0xd0 __x64_sys_ioctl+0x21/0x30 x64_sys_call+0x15d0/0x2e60 do_syscall_64+0x83/0x160 entry_SYSCALL_64_after_hwframe+0x76/0x7e -> #0 (&kvm->slots_lock){+.+.}-{3:3}: __lock_acquire+0x15ef/0x2e30 lock_acquire+0xe0/0x260 __mutex_lock+0x6a/0xb40 mutex_lock_nested+0x1f/0x30 set_nx_huge_pages+0x179/0x1e0 [kvm] param_attr_store+0x93/0x100 module_attr_store+0x22/0x40 sysfs_kf_write+0x81/0xb0 kernfs_fop_write_iter+0x133/0x1d0 vfs_write+0x28d/0x380 ksys_write+0x70/0xe0 __x64_sys_write+0x1f/0x30 x64_sys_call+0x281b/0x2e60 do_syscall_64+0x83/0x160 entry_SYSCALL_64_after_hwframe+0x76/0x7e Cc: Chao Gao <chao.gao@intel.com> Fixes: 0bf50497f03b ("KVM: Drop kvm_count_lock and instead protect kvm_usage_count with kvm_lock") Cc: stable@vger.kernel.org Reviewed-by: Kai Huang <kai.huang@intel.com> Acked-by: Kai Huang <kai.huang@intel.com> Tested-by: Farrah Chen <farrah.chen@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240830043600.127750-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-08-30drivers/virt: pkvm: Intercept ioremap using pKVM MMIO_GUARD hypercallWill Deacon
Hook up pKVM's MMIO_GUARD hypercall so that ioremap() and friends will register the target physical address as MMIO with the hypervisor, allowing guest exits to that page to be emulated by the host with full syndrome information. Acked-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20240830130150.8568-7-will@kernel.org Signed-off-by: Will Deacon <will@kernel.org>
2024-08-30drivers/virt: pkvm: Hook up mem_encrypt API using pKVM hypercallsWill Deacon
If we detect the presence of pKVM's SHARE and UNSHARE hypercalls, then register a backend implementation of the mem_encrypt API so that things like DMA buffers can be shared appropriately with the host. Acked-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20240830130150.8568-5-will@kernel.org Signed-off-by: Will Deacon <will@kernel.org>
2024-08-30drivers/virt: pkvm: Add initial support for running as a protected guestWill Deacon
Implement a pKVM protected guest driver to probe the presence of pKVM and determine the memory protection granule using the HYP_MEMINFO hypercall. Acked-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20240830130150.8568-3-will@kernel.org Signed-off-by: Will Deacon <will@kernel.org>
2024-08-22KVM: nVMX: Honor userspace MSR filter lists for nested VM-Enter/VM-ExitSean Christopherson
Synthesize a consistency check VM-Exit (VM-Enter) or VM-Abort (VM-Exit) if L1 attempts to load/store an MSR via the VMCS MSR lists that userspace has disallowed access to via an MSR filter. Intel already disallows including a handful of "special" MSRs in the VMCS lists, so denying access isn't completely without precedent. More importantly, the behavior is well-defined _and_ can be communicated the end user, e.g. to the customer that owns a VM running as L1 on top of KVM. On the other hand, ignoring userspace MSR filters is all but guaranteed to result in unexpected behavior as the access will hit KVM's internal state, which is likely not up-to-date. Unlike KVM-internal accesses, instruction emulation, and dedicated VMCS fields, the MSRs in the VMCS load/store lists are 100% guest controlled, thus making it all but impossible to reason about the correctness of ignoring the MSR filter. And if userspace *really* wants to deny access to MSRs via the aforementioned scenarios, userspace can hide the associated feature from the guest, e.g. by disabling the PMU to prevent accessing PERF_GLOBAL_CTRL via its VMCS field. But for the MSR lists, KVM is blindly processing MSRs; the MSR filters are the _only_ way for userspace to deny access. This partially reverts commit ac8d6cad3c7b ("KVM: x86: Only do MSR filtering when access MSR by rdmsr/wrmsr"). Cc: Hou Wenlong <houwenlong.hwl@antgroup.com> Cc: Jim Mattson <jmattson@google.com> Link: https://lore.kernel.org/r/20240722235922.3351122-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-14KVM: x86/mmu: Introduce a quirk to control memslot zap behaviorYan Zhao
Introduce the quirk KVM_X86_QUIRK_SLOT_ZAP_ALL to allow users to select KVM's behavior when a memslot is moved or deleted for KVM_X86_DEFAULT_VM VMs. Make sure KVM behave as if the quirk is always disabled for non-KVM_X86_DEFAULT_VM VMs. The KVM_X86_QUIRK_SLOT_ZAP_ALL quirk offers two behavior options: - when enabled: Invalidate/zap all SPTEs ("zap-all"), - when disabled: Precisely zap only the leaf SPTEs within the range of the moving/deleting memory slot ("zap-slot-leafs-only"). "zap-all" is today's KVM behavior to work around a bug [1] where the changing the zapping behavior of memslot move/deletion would cause VM instability for VMs with an Nvidia GPU assigned; while "zap-slot-leafs-only" allows for more precise zapping of SPTEs within the memory slot range, improving performance in certain scenarios [2], and meeting the functional requirements for TDX. Previous attempts to select "zap-slot-leafs-only" include a per-VM capability approach [3] (which was not preferred because the root cause of the bug remained unidentified) and a per-memslot flag approach [4]. Sean and Paolo finally recommended the implementation of this quirk and explained that it's the least bad option [5]. By default, the quirk is enabled on KVM_X86_DEFAULT_VM VMs to use "zap-all". Users have the option to disable the quirk to select "zap-slot-leafs-only" for specific KVM_X86_DEFAULT_VM VMs that are unaffected by this bug. For non-KVM_X86_DEFAULT_VM VMs, the "zap-slot-leafs-only" behavior is always selected without user's opt-in, regardless of if the user opts for "zap-all". This is because it is assumed until proven otherwise that non- KVM_X86_DEFAULT_VM VMs will not be exposed to the bug [1], and most importantly, it's because TDX must have "zap-slot-leafs-only" always selected. In TDX's case a memslot's GPA range can be a mixture of "private" or "shared" memory. Shared is roughly analogous to how EPT is handled for normal VMs, but private GPAs need lots of special treatment: 1) "zap-all" would require to zap private root page or non-leaf entries or at least leaf-entries beyond the deleting memslot scope. However, TDX demands that the root page of the private page table remains unchanged, with leaf entries being zapped before non-leaf entries, and any dropped private guest pages must be re-accepted by the guest. 2) if "zap-all" zaps only shared page tables, it would result in private pages still being mapped when the memslot is gone. This may affect even other processes if later the gmem fd was whole punched, causing the pages being freed on the host while still mapped in the TD, because there's no pgoff to the gfn information to zap the private page table after memslot is gone. So, simply go "zap-slot-leafs-only" as if the quirk is always disabled for non-KVM_X86_DEFAULT_VM VMs to avoid manual opt-in for every VM type [6] or complicating quirk disabling interface (current quirk disabling interface is limited, no way to query quirks, or force them to be disabled). Add a new function kvm_mmu_zap_memslot_leafs() to implement "zap-slot-leafs-only". This function does not call kvm_unmap_gfn_range(), bypassing special handling to APIC_ACCESS_PAGE_PRIVATE_MEMSLOT, as 1) The APIC_ACCESS_PAGE_PRIVATE_MEMSLOT cannot be created by users, nor can it be moved. It is only deleted by KVM when APICv is permanently inhibited. 2) kvm_vcpu_reload_apic_access_page() effectively does nothing when APIC_ACCESS_PAGE_PRIVATE_MEMSLOT is deleted. 3) Avoid making all cpus request of KVM_REQ_APIC_PAGE_RELOAD can save on costly IPIs. Suggested-by: Kai Huang <kai.huang@intel.com> Suggested-by: Sean Christopherson <seanjc@google.com> Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://patchwork.kernel.org/project/kvm/patch/20190205210137.1377-11-sean.j.christopherson@intel.com [1] Link: https://patchwork.kernel.org/project/kvm/patch/20190205210137.1377-11-sean.j.christopherson@intel.com/#25054908 [2] Link: https://lore.kernel.org/kvm/20200713190649.GE29725@linux.intel.com/T/#mabc0119583dacf621025e9d873c85f4fbaa66d5c [3] Link: https://lore.kernel.org/all/20240515005952.3410568-3-rick.p.edgecombe@intel.com [4] Link: https://lore.kernel.org/all/7df9032d-83e4-46a1-ab29-6c7973a2ab0b@redhat.com [5] Link: https://lore.kernel.org/all/ZnGa550k46ow2N3L@google.com [6] Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Message-ID: <20240703021043.13881-1-yan.y.zhao@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-08-13Merge tag 'kvmarm-fixes-6.11-1' of ↵Paolo Bonzini
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 fixes for 6.11, round #1 - Use kvfree() for the kvmalloc'd nested MMUs array - Set of fixes to address warnings in W=1 builds - Make KVM depend on assembler support for ARMv8.4 - Fix for vgic-debug interface for VMs without LPIs - Actually check ID_AA64MMFR3_EL1.S1PIE in get-reg-list selftest - Minor code / comment cleanups for configuring PAuth traps - Take kvm->arch.config_lock to prevent destruction / initialization race for a vCPU's CPUIF which may lead to a UAF
2024-08-05docs: KVM: Fix register ID of SPSR_FIQTakahiro Itazuri
Fixes the register ID of SPSR_FIQ. SPSR_FIQ is a 64-bit register and the 64-bit register size mask is 0x0030000000000000ULL. Fixes: fd3bc912d3d1 ("KVM: Documentation: Document arm64 core registers in detail") Signed-off-by: Takahiro Itazuri <itazur@amazon.com> Acked-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20230606154628.95498-1-itazur@amazon.com Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-08-02Merge branch 'kvm-fixes' into HEADPaolo Bonzini
* fix latent bug in how usage of large pages is determined for confidential VMs * fix "underline too short" in docs * eliminate log spam from limited APIC timer periods * disallow pre-faulting of memory before SEV-SNP VMs are initialized * delay clearing and encrypting private memory until it is added to guest page tables * this change also enables another small cleanup: the checks in SNP_LAUNCH_UPDATE that limit it to non-populated, private pages can now be moved in the common kvm_gmem_populate() function
2024-07-26KVM: x86: disallow pre-fault for SNP VMs before initializationPaolo Bonzini
KVM_PRE_FAULT_MEMORY for an SNP guest can race with sev_gmem_post_populate() in bad ways. The following sequence for instance can potentially trigger an RMP fault: thread A, sev_gmem_post_populate: called thread B, sev_gmem_prepare: places below 'pfn' in a private state in RMP thread A, sev_gmem_post_populate: *vaddr = kmap_local_pfn(pfn + i); thread A, sev_gmem_post_populate: copy_from_user(vaddr, src + i * PAGE_SIZE, PAGE_SIZE); RMP #PF Fix this by only allowing KVM_PRE_FAULT_MEMORY to run after a guest's initial private memory contents have been finalized via KVM_SEV_SNP_LAUNCH_FINISH. Beyond fixing this issue, it just sort of makes sense to enforce this, since the KVM_PRE_FAULT_MEMORY documentation states: "KVM maps memory as if the vCPU generated a stage-2 read page fault" which sort of implies we should be acting on the same guest state that a vCPU would see post-launch after the initial guest memory is all set up. Co-developed-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-07-26KVM: Documentation: Fix title underline too short warningChang Yu
Fix "WARNING: Title underline too short" by extending title line to the proper length. Signed-off-by: Chang Yu <marcus.yu.56@gmail.com> Message-ID: <ZqB3lofbzMQh5Q-5@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-07-25Merge tag 'uml-for-linus-6.11-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/uml/linux Pull UML updates from Richard Weinberger: - Support for preemption - i386 Rust support - Huge cleanup by Benjamin Berg - UBSAN support - Removal of dead code * tag 'uml-for-linus-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/uml/linux: (41 commits) um: vector: always reset vp->opened um: vector: remove vp->lock um: register power-off handler um: line: always fill *error_out in setup_one_line() um: remove pcap driver from documentation um: Enable preemption in UML um: refactor TLB update handling um: simplify and consolidate TLB updates um: remove force_flush_all from fork_handler um: Do not flush MM in flush_thread um: Delay flushing syscalls until the thread is restarted um: remove copy_context_skas0 um: remove LDT support um: compress memory related stub syscalls while adding them um: Rework syscall handling um: Add generic stub_syscall6 function um: Create signal stack memory assignment in stub_data um: Remove stub-data.h include from common-offsets.h um: time-travel: fix signal blocking race/hang um: time-travel: remove time_exit() ...
2024-07-20Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm updates from Paolo Bonzini: "ARM: - Initial infrastructure for shadow stage-2 MMUs, as part of nested virtualization enablement - Support for userspace changes to the guest CTR_EL0 value, enabling (in part) migration of VMs between heterogenous hardware - Fixes + improvements to pKVM's FF-A proxy, adding support for v1.1 of the protocol - FPSIMD/SVE support for nested, including merged trap configuration and exception routing - New command-line parameter to control the WFx trap behavior under KVM - Introduce kCFI hardening in the EL2 hypervisor - Fixes + cleanups for handling presence/absence of FEAT_TCRX - Miscellaneous fixes + documentation updates LoongArch: - Add paravirt steal time support - Add support for KVM_DIRTY_LOG_INITIALLY_SET - Add perf kvm-stat support for loongarch RISC-V: - Redirect AMO load/store access fault traps to guest - perf kvm stat support - Use guest files for IMSIC virtualization, when available s390: - Assortment of tiny fixes which are not time critical x86: - Fixes for Xen emulation - Add a global struct to consolidate tracking of host values, e.g. EFER - Add KVM_CAP_X86_APIC_BUS_CYCLES_NS to allow configuring the effective APIC bus frequency, because TDX - Print the name of the APICv/AVIC inhibits in the relevant tracepoint - Clean up KVM's handling of vendor specific emulation to consistently act on "compatible with Intel/AMD", versus checking for a specific vendor - Drop MTRR virtualization, and instead always honor guest PAT on CPUs that support self-snoop - Update to the newfangled Intel CPU FMS infrastructure - Don't advertise IA32_PERF_GLOBAL_OVF_CTRL as an MSR-to-be-saved, as it reads '0' and writes from userspace are ignored - Misc cleanups x86 - MMU: - Small cleanups, renames and refactoring extracted from the upcoming Intel TDX support - Don't allocate kvm_mmu_page.shadowed_translation for shadow pages that can't hold leafs SPTEs - Unconditionally drop mmu_lock when allocating TDP MMU page tables for eager page splitting, to avoid stalling vCPUs when splitting huge pages - Bug the VM instead of simply warning if KVM tries to split a SPTE that is non-present or not-huge. KVM is guaranteed to end up in a broken state because the callers fully expect a valid SPTE, it's all but dangerous to let more MMU changes happen afterwards x86 - AMD: - Make per-CPU save_area allocations NUMA-aware - Force sev_es_host_save_area() to be inlined to avoid calling into an instrumentable function from noinstr code - Base support for running SEV-SNP guests. API-wise, this includes a new KVM_X86_SNP_VM type, encrypting/measure the initial image into guest memory, and finalizing it before launching it. Internally, there are some gmem/mmu hooks needed to prepare gmem-allocated pages before mapping them into guest private memory ranges This includes basic support for attestation guest requests, enough to say that KVM supports the GHCB 2.0 specification There is no support yet for loading into the firmware those signing keys to be used for attestation requests, and therefore no need yet for the host to provide certificate data for those keys. To support fetching certificate data from userspace, a new KVM exit type will be needed to handle fetching the certificate from userspace. An attempt to define a new KVM_EXIT_COCO / KVM_EXIT_COCO_REQ_CERTS exit type to handle this was introduced in v1 of this patchset, but is still being discussed by community, so for now this patchset only implements a stub version of SNP Extended Guest Requests that does not provide certificate data x86 - Intel: - Remove an unnecessary EPT TLB flush when enabling hardware - Fix a series of bugs that cause KVM to fail to detect nested pending posted interrupts as valid wake eents for a vCPU executing HLT in L2 (with HLT-exiting disable by L1) - KVM: x86: Suppress MMIO that is triggered during task switch emulation Explicitly suppress userspace emulated MMIO exits that are triggered when emulating a task switch as KVM doesn't support userspace MMIO during complex (multi-step) emulation Silently ignoring the exit request can result in the WARN_ON_ONCE(vcpu->mmio_needed) firing if KVM exits to userspace for some other reason prior to purging mmio_needed See commit 0dc902267cb3 ("KVM: x86: Suppress pending MMIO write exits if emulator detects exception") for more details on KVM's limitations with respect to emulated MMIO during complex emulator flows Generic: - Rename the AS_UNMOVABLE flag that was introduced for KVM to AS_INACCESSIBLE, because the special casing needed by these pages is not due to just unmovability (and in fact they are only unmovable because the CPU cannot access them) - New ioctl to populate the KVM page tables in advance, which is useful to mitigate KVM page faults during guest boot or after live migration. The code will also be used by TDX, but (probably) not through the ioctl - Enable halt poll shrinking by default, as Intel found it to be a clear win - Setup empty IRQ routing when creating a VM to avoid having to synchronize SRCU when creating a split IRQCHIP on x86 - Rework the sched_in/out() paths to replace kvm_arch_sched_in() with a flag that arch code can use for hooking both sched_in() and sched_out() - Take the vCPU @id as an "unsigned long" instead of "u32" to avoid truncating a bogus value from userspace, e.g. to help userspace detect bugs - Mark a vCPU as preempted if and only if it's scheduled out while in the KVM_RUN loop, e.g. to avoid marking it preempted and thus writing guest memory when retrieving guest state during live migration blackout Selftests: - Remove dead code in the memslot modification stress test - Treat "branch instructions retired" as supported on all AMD Family 17h+ CPUs - Print the guest pseudo-RNG seed only when it changes, to avoid spamming the log for tests that create lots of VMs - Make the PMU counters test less flaky when counting LLC cache misses by doing CLFLUSH{OPT} in every loop iteration" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (227 commits) crypto: ccp: Add the SNP_VLEK_LOAD command KVM: x86/pmu: Add kvm_pmu_call() to simplify static calls of kvm_pmu_ops KVM: x86: Introduce kvm_x86_call() to simplify static calls of kvm_x86_ops KVM: x86: Replace static_call_cond() with static_call() KVM: SEV: Provide support for SNP_EXTENDED_GUEST_REQUEST NAE event x86/sev: Move sev_guest.h into common SEV header KVM: SEV: Provide support for SNP_GUEST_REQUEST NAE event KVM: x86: Suppress MMIO that is triggered during task switch emulation KVM: x86/mmu: Clean up make_huge_page_split_spte() definition and intro KVM: x86/mmu: Bug the VM if KVM tries to split a !hugepage SPTE KVM: selftests: x86: Add test for KVM_PRE_FAULT_MEMORY KVM: x86: Implement kvm_arch_vcpu_pre_fault_memory() KVM: x86/mmu: Make kvm_mmu_do_page_fault() return mapped level KVM: x86/mmu: Account pf_{fixed,emulate,spurious} in callers of "do page fault" KVM: x86/mmu: Bump pf_taken stat only in the "real" page fault handler KVM: Add KVM_PRE_FAULT_MEMORY vcpu ioctl to pre-populate guest memory KVM: Document KVM_PRE_FAULT_MEMORY ioctl mm, virt: merge AS_UNMOVABLE and AS_INACCESSIBLE perf kvm: Add kvm-stat for loongarch64 LoongArch: KVM: Add PV steal time support in guest side ...
2024-07-19Merge tag 'powerpc-6.11-1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc updates from Michael Ellerman: - Remove support for 40x CPUs & platforms - Add support to the 64-bit BPF JIT for cpu v4 instructions - Fix PCI hotplug driver crash on powernv - Fix doorbell emulation for KVM on PAPR guests (nestedv2) - Fix KVM nested guest handling of some less used SPRs - Online NUMA nodes with no CPU/memory if they have a PCI device attached - Reduce memory overhead of enabling kfence on 64-bit Radix MMU kernels - Reimplement the iommu table_group_ops for pseries for VFIO SPAPR TCE Thanks to: Anjali K, Artem Savkov, Athira Rajeev, Breno Leitao, Brian King, Celeste Liu, Christophe Leroy, Esben Haabendal, Gaurav Batra, Gautam Menghani, Haren Myneni, Hari Bathini, Jeff Johnson, Krishna Kumar, Krzysztof Kozlowski, Nathan Lynch, Nicholas Piggin, Nick Bowler, Nilay Shroff, Rob Herring (Arm), Shawn Anastasio, Shivaprasad G Bhat, Sourabh Jain, Srikar Dronamraju, Timothy Pearson, Uwe Kleine-König, and Vaibhav Jain. * tag 'powerpc-6.11-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (57 commits) Documentation/powerpc: Mention 40x is removed powerpc: Remove 40x leftovers macintosh/therm_windtunnel: fix module unload. powerpc: Check only single values are passed to CPU/MMU feature checks powerpc/xmon: Fix disassembly CPU feature checks powerpc: Drop clang workaround for builtin constant checks powerpc64/bpf: jit support for signed division and modulo powerpc64/bpf: jit support for sign extended mov powerpc64/bpf: jit support for sign extended load powerpc64/bpf: jit support for unconditional byte swap powerpc64/bpf: jit support for 32bit offset jmp instruction powerpc/pci: Hotplug driver bridge support pci/hotplug/pnv_php: Fix hotplug driver crash on Powernv powerpc/configs: Update defconfig with now user-visible CONFIG_FSL_IFC powerpc: add missing MODULE_DESCRIPTION() macros macintosh/mac_hid: add MODULE_DESCRIPTION() KVM: PPC: add missing MODULE_DESCRIPTION() macros powerpc/kexec: Use of_property_read_reg() powerpc/64s/radix/kfence: map __kfence_pool at page granularity powerpc/pseries/iommu: Define spapr_tce_table_group_ops only with CONFIG_IOMMU_API ...
2024-07-17crypto: ccp: Add the SNP_VLEK_LOAD commandMichael Roth
When requesting an attestation report a guest is able to specify whether it wants SNP firmware to sign the report using either a Versioned Chip Endorsement Key (VCEK), which is derived from chip-unique secrets, or a Versioned Loaded Endorsement Key (VLEK) which is obtained from an AMD Key Derivation Service (KDS) and derived from seeds allocated to enrolled cloud service providers (CSPs). For VLEK keys, an SNP_VLEK_LOAD SNP firmware command is used to load them into the system after obtaining them from the KDS. Add a corresponding userspace interface so to allow the loading of VLEK keys into the system. See SEV-SNP Firmware ABI 1.54, SNP_VLEK_LOAD for more details. Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Michael Roth <michael.roth@amd.com> Message-ID: <20240501085210.2213060-21-michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-07-16Merge tag 'x86_sev_for_v6.11_rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 SEV updates from Borislav Petkov: - Add support for running the kernel in a SEV-SNP guest, over a Secure VM Service Module (SVSM). When running over a SVSM, different services can run at different protection levels, apart from the guest OS but still within the secure SNP environment. They can provide services to the guest, like a vTPM, for example. This series adds the required facilities to interface with such a SVSM module. - The usual fixlets, refactoring and cleanups [ And as always: "SEV" is AMD's "Secure Encrypted Virtualization". I can't be the only one who gets all the newer x86 TLA's confused, can I? - Linus ] * tag 'x86_sev_for_v6.11_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: Documentation/ABI/configfs-tsm: Fix an unexpected indentation silly x86/sev: Do RMP memory coverage check after max_pfn has been set x86/sev: Move SEV compilation units virt: sev-guest: Mark driver struct with __refdata to prevent section mismatch x86/sev: Allow non-VMPL0 execution when an SVSM is present x86/sev: Extend the config-fs attestation support for an SVSM x86/sev: Take advantage of configfs visibility support in TSM fs/configfs: Add a callback to determine attribute visibility sev-guest: configfs-tsm: Allow the privlevel_floor attribute to be updated virt: sev-guest: Choose the VMPCK key based on executing VMPL x86/sev: Provide guest VMPL level to userspace x86/sev: Provide SVSM discovery support x86/sev: Use the SVSM to create a vCPU when not in VMPL0 x86/sev: Perform PVALIDATE using the SVSM when not at VMPL0 x86/sev: Use kernel provided SVSM Calling Areas x86/sev: Check for the presence of an SVSM in the SNP secrets page x86/irqflags: Provide native versions of the local_irq_save()/restore()
2024-07-16Merge tag 'kvm-x86-mtrrs-6.11' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM x86 MTRR virtualization removal Remove support for virtualizing MTRRs on Intel CPUs, along with a nasty CR0.CD hack, and instead always honor guest PAT on CPUs that support self-snoop.
2024-07-16Merge tag 'kvm-x86-misc-6.11' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM x86 misc changes for 6.11 - Add a global struct to consolidate tracking of host values, e.g. EFER, and move "shadow_phys_bits" into the structure as "maxphyaddr". - Add KVM_CAP_X86_APIC_BUS_CYCLES_NS to allow configuring the effective APIC bus frequency, because TDX. - Print the name of the APICv/AVIC inhibits in the relevant tracepoint. - Clean up KVM's handling of vendor specific emulation to consistently act on "compatible with Intel/AMD", versus checking for a specific vendor. - Misc cleanups
2024-07-16Merge tag 'kvm-x86-generic-6.11' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM generic changes for 6.11 - Enable halt poll shrinking by default, as Intel found it to be a clear win. - Setup empty IRQ routing when creating a VM to avoid having to synchronize SRCU when creating a split IRQCHIP on x86. - Rework the sched_in/out() paths to replace kvm_arch_sched_in() with a flag that arch code can use for hooking both sched_in() and sched_out(). - Take the vCPU @id as an "unsigned long" instead of "u32" to avoid truncating a bogus value from userspace, e.g. to help userspace detect bugs. - Mark a vCPU as preempted if and only if it's scheduled out while in the KVM_RUN loop, e.g. to avoid marking it preempted and thus writing guest memory when retrieving guest state during live migration blackout. - A few minor cleanups
2024-07-16Merge tag 'kvmarm-6.11' of ↵Paolo Bonzini
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 changes for 6.11 - Initial infrastructure for shadow stage-2 MMUs, as part of nested virtualization enablement - Support for userspace changes to the guest CTR_EL0 value, enabling (in part) migration of VMs between heterogenous hardware - Fixes + improvements to pKVM's FF-A proxy, adding support for v1.1 of the protocol - FPSIMD/SVE support for nested, including merged trap configuration and exception routing - New command-line parameter to control the WFx trap behavior under KVM - Introduce kCFI hardening in the EL2 hypervisor - Fixes + cleanups for handling presence/absence of FEAT_TCRX - Miscellaneous fixes + documentation updates
2024-07-12Merge tag 'loongarch-kvm-6.11' of ↵Paolo Bonzini
git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson into HEAD LoongArch KVM changes for v6.11 1. Add ParaVirt steal time support. 2. Add some VM migration enhancement. 3. Add perf kvm-stat support for loongarch.
2024-07-12Merge tag 'kvm-s390-next-6.11-1' of ↵Paolo Bonzini
https://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD Assortment of tiny fixes which are not time critical: - Rejecting memory region operations for ucontrol mode VMs - Rewind the PSW on host intercepts for VSIE - Remove unneeded include
2024-07-12KVM: Document KVM_PRE_FAULT_MEMORY ioctlIsaku Yamahata
Adds documentation of KVM_PRE_FAULT_MEMORY ioctl. [1] It populates guest memory. It doesn't do extra operations on the underlying technology-specific initialization [2]. For example, CoCo-related operations won't be performed. Concretely for TDX, this API won't invoke TDH.MEM.PAGE.ADD() or TDH.MR.EXTEND(). Vendor-specific APIs are required for such operations. The key point is to adapt of vcpu ioctl instead of VM ioctl. First, populating guest memory requires vcpu. If it is VM ioctl, we need to pick one vcpu somehow. Secondly, vcpu ioctl allows each vcpu to invoke this ioctl in parallel. It helps to scale regarding guest memory size, e.g., hundreds of GB. [1] https://lore.kernel.org/kvm/Zbrj5WKVgMsUFDtb@google.com/ [2] https://lore.kernel.org/kvm/Ze-TJh0BBOWm9spT@google.com/ Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Message-ID: <9a060293c9ad9a78f1d8994cfe1311e818e99257.1712785629.git.isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-07-04um: remove pcap driver from documentationJohannes Berg
We just removed the driver, but forgot the row in the documentation table mentioning it. Remove that. Link: https://patch.msgid.link/20240703172013.c3232dc5cbe9.I7f9941095d45a39d0f1e83680a408b3c6d445242@changeid Acked-By: Anton Ivanov <anton.ivanov@cambridgegreys.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-07-04kvm: s390: Reject memory region operations for ucontrol VMsChristoph Schlameuss
This change rejects the KVM_SET_USER_MEMORY_REGION and KVM_SET_USER_MEMORY_REGION2 ioctls when called on a ucontrol VM. This is necessary since ucontrol VMs have kvm->arch.gmap set to 0 and would thus result in a null pointer dereference further in. Memory management needs to be performed in userspace and using the ioctls KVM_S390_UCAS_MAP and KVM_S390_UCAS_UNMAP. Also improve s390 specific documentation for KVM_SET_USER_MEMORY_REGION and KVM_SET_USER_MEMORY_REGION2. Signed-off-by: Christoph Schlameuss <schlameuss@linux.ibm.com> Fixes: 27e0393f15fc ("KVM: s390: ucontrol: per vcpu address spaces") Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Link: https://lore.kernel.org/r/20240624095902.29375-1-schlameuss@linux.ibm.com Signed-off-by: Janosch Frank <frankja@linux.ibm.com> [frankja@linux.ibm.com: commit message spelling fix, subject prefix fix] Message-ID: <20240624095902.29375-1-schlameuss@linux.ibm.com>
2024-06-28KVM: Documentation: Correct the VGIC V2 CPU interface addr space sizeChangyuan Lyu
In arch/arm64/include/uapi/asm/kvm.h, we have #define KVM_VGIC_V2_CPU_SIZE 0x2000 So the CPU interface address space should cover 8 KByte not 4 KByte. Signed-off-by: Changyuan Lyu <changyuanl@google.com> Link: https://lore.kernel.org/r/20240623164542.2999626-3-changyuanl@google.com Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-06-28KVM: Documentation: Enumerate allowed value macros of `irq_type`Changyuan Lyu
The expression `irq_type[n]` may confuse readers to interpret `n` as the bit position and think of CPU = 1 << 0, SPI = 1 << 1, and PPI = 1 << 2. Since arch/arm64/include/uapi/asm/kvm.h already has macro definitions for the allowed values, this commit uses these symbols to clear up the ambiguity. Signed-off-by: Changyuan Lyu <changyuanl@google.com> Link: https://lore.kernel.org/r/20240623164542.2999626-2-changyuanl@google.com Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-06-28KVM: Documentation: Fix typo `BFD`Changyuan Lyu
BDF is the acronym for Bus, Device, Function. Signed-off-by: Changyuan Lyu <changyuanl@google.com> Link: https://lore.kernel.org/r/20240623164542.2999626-1-changyuanl@google.com Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-06-24Documentation: hyperv: Add overview of Confidential Computing VM supportMichael Kelley
Add documentation topic for Confidential Computing (CoCo) VM support in Linux guests on Hyper-V. Signed-off-by: Michael Kelley <mhklinux@outlook.com> Reviewed-by: Easwar Hariharan <eahariha@linux.microsoft.com> Link: https://lore.kernel.org/r/20240618165059.10174-1-mhklinux@outlook.com Signed-off-by: Wei Liu <wei.liu@kernel.org> Message-ID: <20240618165059.10174-1-mhklinux@outlook.com>
2024-06-17virt: sev-guest: Choose the VMPCK key based on executing VMPLTom Lendacky
Currently, the sev-guest driver uses the vmpck-0 key by default. When an SVSM is present, the kernel is running at a VMPL other than 0 and the vmpck-0 key is no longer available. If a specific vmpck key has not be requested by the user via the vmpck_id module parameter, choose the vmpck key based on the active VMPL level. Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/b88081c5d88263176849df8ea93e90a404619cab.1717600736.git.thomas.lendacky@amd.com
2024-06-11KVM: x86: Add KVM_RUN_X86_GUEST_MODE kvm_run flagThomas Prescher
When a vCPU is interrupted by a signal while running a nested guest, KVM will exit to userspace with L2 state. However, userspace has no way to know whether it sees L1 or L2 state (besides calling KVM_GET_STATS_FD, which does not have a stable ABI). This causes multiple problems: The simplest one is L2 state corruption when userspace marks the sregs as dirty. See this mailing list thread [1] for a complete discussion. Another problem is that if userspace decides to continue by emulating instructions, it will unknowingly emulate with L2 state as if L1 doesn't exist, which can be considered a weird guest escape. Introduce a new flag, KVM_RUN_X86_GUEST_MODE, in the kvm_run data structure, which is set when the vCPU exited while running a nested guest. Also introduce a new capability, KVM_CAP_X86_GUEST_MODE, to advertise the functionality to userspace. [1] https://lore.kernel.org/kvm/20240416123558.212040-1-julian.stecklina@cyberus-technology.de/T/#m280aadcb2e10ae02c191a7dc4ed4b711a74b1f55 Signed-off-by: Thomas Prescher <thomas.prescher@cyberus-technology.de> Signed-off-by: Julian Stecklina <julian.stecklina@cyberus-technology.de> Link: https://lore.kernel.org/r/20240508132502.184428-1-julian.stecklina@cyberus-technology.de Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-11KVM: x86: Improve documentation for KVM_CAP_X86_BUS_LOCK_EXITCarlos López
Improve the description for the KVM_CAP_X86_BUS_LOCK_EXIT capability to fix a few typos and grammar issues, and to clarify the purpose of the capability. Signed-off-by: Carlos López <clopez@suse.de> Link: https://lore.kernel.org/r/20240424105616.29596-1-clopez@suse.de [sean: massage changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-06KVM: PPC: Book3S HV: Add one-reg interface for HASHPKEYR registerShivaprasad G Bhat
The patch adds a one-reg register identifier which can be used to read and set the virtual HASHPKEYR for the guest during enter/exit with KVM_REG_PPC_HASHPKEYR. The specific SPR KVM API documentation too updated. Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/171759285547.1480.12374595786792346073.stgit@linux.ibm.com
2024-06-06KVM: PPC: Book3S HV: Add one-reg interface for HASHKEYR registerShivaprasad G Bhat
The patch adds a one-reg register identifier which can be used to read and set the virtual HASHKEYR for the guest during enter/exit with KVM_REG_PPC_HASHKEYR. The specific SPR KVM API documentation too updated. Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com> Reviewed-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/171759283170.1480.12904332463112769129.stgit@linux.ibm.com
2024-06-06KVM: PPC: Book3S HV: Add one-reg interface for DEXCR registerShivaprasad G Bhat
The patch adds a one-reg register identifier which can be used to read and set the DEXCR for the guest during enter/exit with KVM_REG_PPC_DEXCR. The specific SPR KVM API documentation too updated. Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com> Reviewed-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/171759279613.1480.12873911783530175699.stgit@linux.ibm.com
2024-06-05KVM: VMX: Drop support for forcing UC memory when guest CR0.CD=1Sean Christopherson
Drop KVM's emulation of CR0.CD=1 on Intel CPUs now that KVM no longer honors guest MTRR memtypes, as forcing UC memory for VMs with non-coherent DMA only makes sense if the guest is using something other than PAT to configure the memtype for the DMA region. Furthermore, KVM has forced WB memory for CR0.CD=1 since commit fb279950ba02 ("KVM: vmx: obey KVM_QUIRK_CD_NW_CLEARED"), and no known VMM in existence disables KVM_X86_QUIRK_CD_NW_CLEARED, let alone does so with non-coherent DMA. Lastly, commit fb279950ba02 ("KVM: vmx: obey KVM_QUIRK_CD_NW_CLEARED") was from the same author as commit b18d5431acc7 ("KVM: x86: fix CR0.CD virtualization"), and followed by a mere month. I.e. forcing UC memory was likely the result of code inspection or perhaps misdiagnosed failures, and not the necessitate by a concrete use case. Update KVM's documentation to note that KVM_X86_QUIRK_CD_NW_CLEARED is now AMD-only, and to take an erratum for lack of CR0.CD virtualization on Intel. Tested-by: Xiangfei Ma <xiangfeix.ma@intel.com> Tested-by: Yongwei Ma <yongwei.ma@intel.com> Link: https://lore.kernel.org/r/20240309010929.1403984-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-05KVM: x86: Remove VMX support for virtualizing guest MTRR memtypesSean Christopherson
Remove KVM's support for virtualizing guest MTRR memtypes, as full MTRR adds no value, negatively impacts guest performance, and is a maintenance burden due to it's complexity and oddities. KVM's approach to virtualizating MTRRs make no sense, at all. KVM *only* honors guest MTRR memtypes if EPT is enabled *and* the guest has a device that may perform non-coherent DMA access. From a hardware virtualization perspective of guest MTRRs, there is _nothing_ special about EPT. Legacy shadowing paging doesn't magically account for guest MTRRs, nor does NPT. Unwinding and deciphering KVM's murky history, the MTRR virtualization code appears to be the result of misdiagnosed issues when EPT + VT-d with passthrough devices was enabled years and years ago. And importantly, the underlying bugs that were fudged around by honoring guest MTRR memtypes have since been fixed (though rather poorly in some cases). The zapping GFNs logic in the MTRR virtualization code came from: commit efdfe536d8c643391e19d5726b072f82964bfbdb Author: Xiao Guangrong <guangrong.xiao@linux.intel.com> Date: Wed May 13 14:42:27 2015 +0800 KVM: MMU: fix MTRR update Currently, whenever guest MTRR registers are changed kvm_mmu_reset_context is called to switch to the new root shadow page table, however, it's useless since: 1) the cache type is not cached into shadow page's attribute so that the original root shadow page will be reused 2) the cache type is set on the last spte, that means we should sync the last sptes when MTRR is changed This patch fixs this issue by drop all the spte in the gfn range which is being updated by MTRR which was a fix for: commit 0bed3b568b68e5835ef5da888a372b9beabf7544 Author: Sheng Yang <sheng@linux.intel.com> AuthorDate: Thu Oct 9 16:01:54 2008 +0800 Commit: Avi Kivity <avi@redhat.com> CommitDate: Wed Dec 31 16:51:44 2008 +0200 KVM: Improve MTRR structure As well as reset mmu context when set MTRR. which was part of a "MTRR/PAT support for EPT" series that also added: + if (mt_mask) { + mt_mask = get_memory_type(vcpu, gfn) << + kvm_x86_ops->get_mt_mask_shift(); + spte |= mt_mask; + } where get_memory_type() was a truly gnarly helper to retrieve the guest MTRR memtype for a given memtype. And *very* subtly, at the time of that change, KVM *always* set VMX_EPT_IGMT_BIT, kvm_mmu_set_base_ptes(VMX_EPT_READABLE_MASK | VMX_EPT_WRITABLE_MASK | VMX_EPT_DEFAULT_MT << VMX_EPT_MT_EPTE_SHIFT | VMX_EPT_IGMT_BIT); which came in via: commit 928d4bf747e9c290b690ff515d8f81e8ee226d97 Author: Sheng Yang <sheng@linux.intel.com> AuthorDate: Thu Nov 6 14:55:45 2008 +0800 Commit: Avi Kivity <avi@redhat.com> CommitDate: Tue Nov 11 21:00:37 2008 +0200 KVM: VMX: Set IGMT bit in EPT entry There is a potential issue that, when guest using pagetable without vmexit when EPT enabled, guest would use PAT/PCD/PWT bits to index PAT msr for it's memory, which would be inconsistent with host side and would cause host MCE due to inconsistent cache attribute. The patch set IGMT bit in EPT entry to ignore guest PAT and use WB as default memory type to protect host (notice that all memory mapped by KVM should be WB). Note the CommitDates! The AuthorDates strongly suggests Sheng Yang added the whole "ignoreIGMT things as a bug fix for issues that were detected during EPT + VT-d + passthrough enabling, but it was applied earlier because it was a generic fix. Jumping back to 0bed3b568b68 ("KVM: Improve MTRR structure"), the other relevant code, or rather lack thereof, is the handling of *host* MMIO. That fix came in a bit later, but given the author and timing, it's safe to say it was all part of the same EPT+VT-d enabling mess. commit 2aaf69dcee864f4fb6402638dd2f263324ac839f Author: Sheng Yang <sheng@linux.intel.com> AuthorDate: Wed Jan 21 16:52:16 2009 +0800 Commit: Avi Kivity <avi@redhat.com> CommitDate: Sun Feb 15 02:47:37 2009 +0200 KVM: MMU: Map device MMIO as UC in EPT Software are not allow to access device MMIO using cacheable memory type, the patch limit MMIO region with UC and WC(guest can select WC using PAT and PCD/PWT). In addition to the host MMIO and IGMT issues, KVM's MTRR virtualization was obviously never tested on NPT until much later, which lends further credence to the theory/argument that this was all the result of misdiagnosed issues. Discussion from the EPT+MTRR enabling thread[*] more or less confirms that Sheng Yang was trying to resolve issues with passthrough MMIO. * Sheng Yang : Do you mean host(qemu) would access this memory and if we set it to guest : MTRR, host access would be broken? We would cover this in our shadow MTRR : patch, for we encountered this in video ram when doing some experiment with : VGA assignment. And in the same thread, there's also what appears to be confirmation of Intel running into issues with Windows XP related to a guest device driver mapping DMA with WC in the PAT. * Avi Kavity : Sheng Yang wrote: : > Yes... But it's easy to do with assigned devices' mmio, but what if guest : > specific some non-mmio memory's memory type? E.g. we have met one issue in : > Xen, that a assigned-device's XP driver specific one memory region as buffer, : > and modify the memory type then do DMA. : > : > Only map MMIO space can be first step, but I guess we can modify assigned : > memory region memory type follow guest's? : > : : With ept/npt, we can't, since the memory type is in the guest's : pagetable entries, and these are not accessible. [*] https://lore.kernel.org/all/1223539317-32379-1-git-send-email-sheng@linux.intel.com So, for the most part, what likely happened is that 15 years ago, a few engineers (a) fixed a #MC problem by ignoring guest PAT and (b) initially "fixed" passthrough device MMIO by emulating *guest* MTRRs. Except for the below case, everything since then has been a result of those two intertwined changes. The one exception, which is actually yet more confirmation of all of the above, is the revert of Paolo's attempt at "full" virtualization of guest MTRRs: commit 606decd67049217684e3cb5a54104d51ddd4ef35 Author: Paolo Bonzini <pbonzini@redhat.com> Date: Thu Oct 1 13:12:47 2015 +0200 Revert "KVM: x86: apply guest MTRR virtualization on host reserved pages" This reverts commit fd717f11015f673487ffc826e59b2bad69d20fe5. It was reported to cause Machine Check Exceptions (bug 104091). ... commit fd717f11015f673487ffc826e59b2bad69d20fe5 Author: Paolo Bonzini <pbonzini@redhat.com> Date: Tue Jul 7 14:38:13 2015 +0200 KVM: x86: apply guest MTRR virtualization on host reserved pages Currently guest MTRR is avoided if kvm_is_reserved_pfn returns true. However, the guest could prefer a different page type than UC for such pages. A good example is that pass-throughed VGA frame buffer is not always UC as host expected. This patch enables full use of virtual guest MTRRs. I.e. Paolo tried to add back KVM's behavior before "Map device MMIO as UC in EPT" and got the same result: machine checks, likely due to the guest MTRRs not being trustworthy/sane at all times. Note, Paolo also tried to enable MTRR virtualization on SVM+NPT, but that too got reverted. Unfortunately, it doesn't appear that anyone ever found a smoking gun, i.e. exactly why emulating guest MTRRs via NPT PAT caused extremely slow boot times doesn't appear to have a definitive root cause. commit fc07e76ac7ffa3afd621a1c3858a503386a14281 Author: Paolo Bonzini <pbonzini@redhat.com> Date: Thu Oct 1 13:20:22 2015 +0200 Revert "KVM: SVM: use NPT page attributes" This reverts commit 3c2e7f7de3240216042b61073803b61b9b3cfb22. Initializing the mapping from MTRR to PAT values was reported to fail nondeterministically, and it also caused extremely slow boot (due to caching getting disabled---bug 103321) with assigned devices. ... commit 3c2e7f7de3240216042b61073803b61b9b3cfb22 Author: Paolo Bonzini <pbonzini@redhat.com> Date: Tue Jul 7 14:32:17 2015 +0200 KVM: SVM: use NPT page attributes Right now, NPT page attributes are not used, and the final page attribute depends solely on gPAT (which however is not synced correctly), the guest MTRRs and the guest page attributes. However, we can do better by mimicking what is done for VMX. In the absence of PCI passthrough, the guest PAT can be ignored and the page attributes can be just WB. If passthrough is being used, instead, keep respecting the guest PAT, and emulate the guest MTRRs through the PAT field of the nested page tables. The only snag is that WP memory cannot be emulated correctly, because Linux's default PAT setting only includes the other types. In short, honoring guest MTRRs for VMX was initially a workaround of sorts for KVM ignoring guest PAT *and* for KVM not forcing UC for host MMIO. And while there *are* known cases where honoring guest MTRRs is desirable, e.g. passthrough VGA frame buffers, the desired behavior in that case is to get WC instead of UC, i.e. at this point it's for performance, not correctness. Furthermore, the complete absence of MTRR virtualization on NPT and shadow paging proves that, while KVM theoretically can do better, it's by no means necessary for correctnesss. Lastly, since kernels mostly rely on firmware to do MTRR setup, and the host typically provides guest firmware, honoring guest MTRRs is effectively honoring *host* userspace memtypes, which is also backwards. I.e. it would be far better for host userspace to communicate its desired memtype directly to KVM (or perhaps indirectly via VMAs in the host kernel), not through guest MTRRs. Tested-by: Xiangfei Ma <xiangfeix.ma@intel.com> Tested-by: Yongwei Ma <yongwei.ma@intel.com> Link: https://lore.kernel.org/r/20240309010929.1403984-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-05KVM: x86: Add a capability to configure bus frequency for APIC timerIsaku Yamahata
Add KVM_CAP_X86_APIC_BUS_CYCLES_NS capability to configure the APIC bus clock frequency for APIC timer emulation. Allow KVM_ENABLE_CAPABILITY(KVM_CAP_X86_APIC_BUS_CYCLES_NS) to set the frequency in nanoseconds. When using this capability, the user space VMM should configure CPUID leaf 0x15 to advertise the frequency. Vishal reported that the TDX guest kernel expects a 25MHz APIC bus frequency but ends up getting interrupts at a significantly higher rate. The TDX architecture hard-codes the core crystal clock frequency to 25MHz and mandates exposing it via CPUID leaf 0x15. The TDX architecture does not allow the VMM to override the value. In addition, per Intel SDM: "The APIC timer frequency will be the processor’s bus clock or core crystal clock frequency (when TSC/core crystal clock ratio is enumerated in CPUID leaf 0x15) divided by the value specified in the divide configuration register." The resulting 25MHz APIC bus frequency conflicts with the KVM hardcoded APIC bus frequency of 1GHz. The KVM doesn't enumerate CPUID leaf 0x15 to the guest unless the user space VMM sets it using KVM_SET_CPUID. If the CPUID leaf 0x15 is enumerated, the guest kernel uses it as the APIC bus frequency. If not, the guest kernel measures the frequency based on other known timers like the ACPI timer or the legacy PIT. As reported by Vishal the TDX guest kernel expects a 25MHz timer frequency but gets timer interrupt more frequently due to the 1GHz frequency used by KVM. To ensure that the guest doesn't have a conflicting view of the APIC bus frequency, allow the userspace to tell KVM to use the same frequency that TDX mandates instead of the default 1Ghz. Reported-by: Vishal Annapurve <vannapurve@google.com> Closes: https://lore.kernel.org/lkml/20231006011255.4163884-1-vannapurve@google.com Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Co-developed-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Yuan Yao <yuan.yao@intel.com> Link: https://lore.kernel.org/r/6748a4c12269e756f0c48680da8ccc5367c31ce7.1714081726.git.reinette.chatre@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-03Merge branch 'kvm-6.11-sev-snp' into HEADPaolo Bonzini
Pull base x86 KVM support for running SEV-SNP guests from Michael Roth: * add some basic infrastructure and introduces a new KVM_X86_SNP_VM vm_type to handle differences versus the existing KVM_X86_SEV_VM and KVM_X86_SEV_ES_VM types. * implement the KVM API to handle the creation of a cryptographic launch context, encrypt/measure the initial image into guest memory, and finalize it before launching it. * implement handling for various guest-generated events such as page state changes, onlining of additional vCPUs, etc. * implement the gmem/mmu hooks needed to prepare gmem-allocated pages before mapping them into guest private memory ranges as well as cleaning them up prior to returning them to the host for use as normal memory. Because those cleanup hooks supplant certain activities like issuing WBINVDs during KVM MMU invalidations, avoid duplicating that work to avoid unecessary overhead. This merge leaves out support support for attestation guest requests and for loading the signing keys to be used for attestation requests.
2024-06-03KVM: fix documentation rendering for KVM_CAP_VM_MOVE_ENC_CONTEXT_FROMJulian Stecklina
The documentation for KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM doesn't use the correct keyword formatting, which breaks rendering on https://www.kernel.org/doc/html/latest/virt/kvm/api.html. Signed-off-by: Julian Stecklina <julian.stecklina@cyberus-technology.de> Link: https://lore.kernel.org/r/20240520143220.340737-1-julian.stecklina@cyberus-technology.de Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-03KVM: Update halt polling documentation to note that KVM has 4 module paramsParshuram Sangle
Update KVM's halt-polling documentation to correclty reflect that KVM has 4 relevant module params, not 3 params. Co-developed-by: Rajendran Jaishankar <jaishankar.rajendran@intel.com> Signed-off-by: Rajendran Jaishankar <jaishankar.rajendran@intel.com> Signed-off-by: Parshuram Sangle <parshuram.sangle@intel.com> Link: https://lore.kernel.org/r/20231102154628.2120-3-parshuram.sangle@intel.com [sean: drop unrelated and misleading doc update] Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-03KVM: Enable halt polling shrink parameter by defaultParshuram Sangle
Default halt_poll_ns_shrink value of 0 always resets polling interval to 0 on an un-successful poll where vcpu wakeup is not received. This is mostly to avoid pointless polling for more number of shorter intervals. But disabled shrink assumes vcpu wakeup is less likely to be received in subsequent shorter polling intervals. Another side effect of 0 shrink value is that, even on a successful poll if total block time was greater than current polling interval, the polling interval starts over from 0 instead of shrinking by a factor. Enabling shrink with value of 2 allows the polling interval to gradually decrement in case of un-successful poll events as well. This gives a fair chance for successful polling events in subsequent polling intervals rather than resetting it to 0 and starting over from grow_start. Below kvm stat log snippet shows interleaved growth and shrinking of polling interval: 87162647182125: kvm_halt_poll_ns: vcpu 0: halt_poll_ns 10000 (grow 0) 87162647637763: kvm_halt_poll_ns: vcpu 0: halt_poll_ns 20000 (grow 10000) 87162649627943: kvm_halt_poll_ns: vcpu 0: halt_poll_ns 40000 (grow 20000) 87162650892407: kvm_halt_poll_ns: vcpu 0: halt_poll_ns 20000 (shrink 40000) 87162651540378: kvm_halt_poll_ns: vcpu 0: halt_poll_ns 40000 (grow 20000) 87162652276768: kvm_halt_poll_ns: vcpu 0: halt_poll_ns 20000 (shrink 40000) 87162652515037: kvm_halt_poll_ns: vcpu 0: halt_poll_ns 40000 (grow 20000) 87162653383787: kvm_halt_poll_ns: vcpu 0: halt_poll_ns 20000 (shrink 40000) 87162653627670: kvm_halt_poll_ns: vcpu 0: halt_poll_ns 10000 (shrink 20000) 87162653796321: kvm_halt_poll_ns: vcpu 0: halt_poll_ns 20000 (grow 10000) 87162656171645: kvm_halt_poll_ns: vcpu 0: halt_poll_ns 10000 (shrink 20000) 87162661607487: kvm_halt_poll_ns: vcpu 0: halt_poll_ns 0 (shrink 10000) Having both grow and shrink enabled creates a balance in polling interval growth and shrink behavior. Tests show improved successful polling attempt ratio which contribute to VM performance. Power penalty is quite negligible as shrunk polling intervals create bursts of very short durations. Performance assessment results show 3-6% improvements in CPU+GPU, Memory and Storage Android VM workloads whereas 5-9% improvement in average FPS of gaming VM workloads. Power penalty is below 1% where host OS is either idle or running a native workload having 2 VMs enabled. CPU/GPU intensive gaming workloads as well do not show any increased power overhead with shrink enabled. Co-developed-by: Rajendran Jaishankar <jaishankar.rajendran@intel.com> Signed-off-by: Rajendran Jaishankar <jaishankar.rajendran@intel.com> Signed-off-by: Parshuram Sangle <parshuram.sangle@intel.com> Link: https://lore.kernel.org/r/20231102154628.2120-2-parshuram.sangle@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-05-28Documentation: hyperv: Improve synic and interrupt handling descriptionMichael Kelley
Current documentation does not describe how Linux handles the synthetic interrupt controller (synic) that Hyper-V provides to guest VMs, nor how VMBus or timer interrupts are handled. Add text describing the synic and reorganize existing text to make this more clear. Signed-off-by: Michael Kelley <mhklinux@outlook.com> Reviewed-by: Easwar Hariharan <eahariha@linux.microsoft.com> Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com> Link: https://lore.kernel.org/r/20240511133818.19649-2-mhklinux@outlook.com Signed-off-by: Wei Liu <wei.liu@kernel.org> Message-ID: <20240511133818.19649-2-mhklinux@outlook.com>
2024-05-28Documentation: hyperv: Update spelling and fix typoMichael Kelley
Update spelling from "VMbus" to "VMBus" to match Hyper-V product documentation. Also correct typo: "SNP-SEV" should be "SEV-SNP". Signed-off-by: Michael Kelley <mhklinux@outlook.com> Reviewed-by: Easwar Hariharan <eahariha@linux.microsoft.com> Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com> Link: https://lore.kernel.org/r/20240511133818.19649-1-mhklinux@outlook.com Signed-off-by: Wei Liu <wei.liu@kernel.org> Message-ID: <20240511133818.19649-1-mhklinux@outlook.com>
2024-05-17Merge tag 'powerpc-6.10-1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc updates from Michael Ellerman: - Enable BPF Kernel Functions (kfuncs) in the powerpc BPF JIT. - Allow per-process DEXCR (Dynamic Execution Control Register) settings via prctl, notably NPHIE which controls hashst/hashchk for ROP protection. - Install powerpc selftests in sub-directories. Note this changes the way run_kselftest.sh needs to be invoked for powerpc selftests. - Change fadump (Firmware Assisted Dump) to better handle memory add/remove. - Add support for passing additional parameters to the fadump kernel. - Add support for updating the kdump image on CPU/memory add/remove events. - Other small features, cleanups and fixes. Thanks to Andrew Donnellan, Andy Shevchenko, Aneesh Kumar K.V, Arnd Bergmann, Benjamin Gray, Bjorn Helgaas, Christian Zigotzky, Christophe Jaillet, Christophe Leroy, Colin Ian King, Cédric Le Goater, Dr. David Alan Gilbert, Erhard Furtner, Frank Li, GUO Zihua, Ganesh Goudar, Geoff Levand, Ghanshyam Agrawal, Greg Kurz, Hari Bathini, Joel Stanley, Justin Stitt, Kunwu Chan, Li Yang, Lidong Zhong, Madhavan Srinivasan, Mahesh Salgaonkar, Masahiro Yamada, Matthias Schiffer, Naresh Kamboju, Nathan Chancellor, Nathan Lynch, Naveen N Rao, Nicholas Miehlbradt, Ran Wang, Randy Dunlap, Ritesh Harjani, Sachin Sant, Shirisha Ganta, Shrikanth Hegde, Sourabh Jain, Stephen Rothwell, sundar, Thorsten Blum, Vaibhav Jain, Xiaowei Bao, Yang Li, and Zhao Chenhui. * tag 'powerpc-6.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (85 commits) powerpc/fadump: Fix section mismatch warning powerpc/85xx: fix compile error without CONFIG_CRASH_DUMP powerpc/fadump: update documentation about bootargs_append powerpc/fadump: pass additional parameters when fadump is active powerpc/fadump: setup additional parameters for dump capture kernel powerpc/pseries/fadump: add support for multiple boot memory regions selftests/powerpc/dexcr: Fix spelling mistake "predicition" -> "prediction" KVM: PPC: Book3S HV nestedv2: Fix an error handling path in gs_msg_ops_kvmhv_nestedv2_config_fill_info() KVM: PPC: Fix documentation for ppc mmu caps KVM: PPC: code cleanup for kvmppc_book3s_irqprio_deliver KVM: PPC: Book3S HV nestedv2: Cancel pending DEC exception powerpc/xmon: Check cpu id in commands "c#", "dp#" and "dx#" powerpc/code-patching: Use dedicated memory routines for patching powerpc/code-patching: Test patch_instructions() during boot powerpc64/kasan: Pass virtual addresses to kasan_init_phys_region() powerpc: rename SPRN_HID2 define to SPRN_HID2_750FX powerpc: Fix typos powerpc/eeh: Fix spelling of the word "auxillary" and update comment macintosh/ams: Fix unused variable warning powerpc/Makefile: Remove bits related to the previous use of -mcmodel=large ...
2024-05-12KVM: SEV: Add KVM_SEV_SNP_LAUNCH_FINISH commandBrijesh Singh
Add a KVM_SEV_SNP_LAUNCH_FINISH command to finalize the cryptographic launch digest which stores the measurement of the guest at launch time. Also extend the existing SNP firmware data structures to support disabling the use of Versioned Chip Endorsement Keys (VCEK) by guests as part of this command. While finalizing the launch flow, the code also issues the LAUNCH_UPDATE SNP firmware commands to encrypt/measure the initial VMSA pages for each configured vCPU, which requires setting the RMP entries for those pages to private, so also add handling to clean up the RMP entries for these pages whening freeing vCPUs during shutdown. Signed-off-by: Brijesh Singh <brijesh.singh@amd.com> Co-developed-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Harald Hoyer <harald@profian.com> Signed-off-by: Ashish Kalra <ashish.kalra@amd.com> Message-ID: <20240501085210.2213060-8-michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>