summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2025-03-14KVM: TDX: Get TDX global informationKai Huang
KVM will need to consult some essential TDX global information to create and run TDX guests. Get the global information after initializing TDX. Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Message-ID: <20241030190039.77971-3-rick.p.edgecombe@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14KVM: VMX: Initialize TDX during KVM module loadKai Huang
Before KVM can use TDX to create and run TDX guests, TDX needs to be initialized from two perspectives: 1) TDX module must be initialized properly to a working state; 2) A per-cpu TDX initialization, a.k.a the TDH.SYS.LP.INIT SEAMCALL must be done on any logical cpu before it can run any other TDX SEAMCALLs. The TDX host core-kernel provides two functions to do the above two respectively: tdx_enable() and tdx_cpu_enable(). There are two options in terms of when to initialize TDX: initialize TDX at KVM module loading time, or when creating the first TDX guest. Choose to initialize TDX during KVM module loading time: Initializing TDX module is both memory and CPU time consuming: 1) the kernel needs to allocate a non-trivial size(~1/256) of system memory as metadata used by TDX module to track each TDX-usable memory page's status; 2) the TDX module needs to initialize this metadata, one entry for each TDX-usable memory page. Also, the kernel uses alloc_contig_pages() to allocate those metadata chunks, because they are large and need to be physically contiguous. alloc_contig_pages() can fail. If initializing TDX when creating the first TDX guest, then there's chance that KVM won't be able to run any TDX guests albeit KVM _declares_ to be able to support TDX. This isn't good for the user. On the other hand, initializing TDX at KVM module loading time can make sure KVM is providing a consistent view of whether KVM can support TDX to the user. Always only try to initialize TDX after VMX has been initialized. TDX is based on VMX, and if VMX fails to initialize then TDX is likely to be broken anyway. Also, in practice, supporting TDX will require part of VMX and common x86 infrastructure in working order, so TDX cannot be enabled alone w/o VMX support. There are two cases that can result in failure to initialize TDX: 1) TDX cannot be supported (e.g., because of TDX is not supported or enabled by hardware, or module is not loaded, or missing some dependency in KVM's configuration); 2) Any unexpected error during TDX bring-up. For the first case only mark TDX is disabled but still allow KVM module to be loaded. For the second case just fail to load the KVM module so that the user can be aware. Because TDX costs additional memory, don't enable TDX by default. Add a new module parameter 'enable_tdx' to allow the user to opt-in. Note, the name tdx_init() has already been taken by the early boot code. Use tdx_bringup() for initializing TDX (and tdx_cleanup() since KVM doesn't actually teardown TDX). They don't match vt_init()/vt_exit(), vmx_init()/vmx_exit() etc but it's not end of the world. Also, once initialized, the TDX module cannot be disabled and enabled again w/o the TDX module runtime update, which isn't supported by the kernel. After TDX is enabled, nothing needs to be done when KVM disables hardware virtualization, e.g., when offlining CPU, or during suspend/resume. TDX host core-kernel code internally tracks TDX status and can handle "multiple enabling" scenario. Similar to KVM_AMD_SEV, add a new KVM_INTEL_TDX Kconfig to guide KVM TDX code. Make it depend on INTEL_TDX_HOST but not replace INTEL_TDX_HOST because in the longer term there's a use case that requires making SEAMCALLs w/o KVM as mentioned by Dan [1]. Link: https://lore.kernel.org/6723fc2070a96_60c3294dc@dwillia2-mobl3.amr.corp.intel.com.notmuch/ [1] Signed-off-by: Kai Huang <kai.huang@intel.com> Message-ID: <162f9dee05c729203b9ad6688db1ca2960b4b502.1731664295.git.kai.huang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14KVM: VMX: Refactor VMX module init/exit functionsKai Huang
Add vt_init() and vt_exit() as the new module init/exit functions and refactor existing vmx_init()/vmx_exit() as helper to make room for TDX specific initialization and teardown. To support TDX, KVM will need to enable TDX during KVM module loading time. Enabling TDX requires enabling hardware virtualization first so that all online CPUs (and the new CPU going online) are in post-VMXON state. Currently, the vmx_init() flow is: 1) hv_init_evmcs(), 2) kvm_x86_vendor_init(), 3) Other VMX specific initialization, 4) kvm_init() The kvm_x86_vendor_init() invokes kvm_x86_init_ops::hardware_setup() to do VMX specific hardware setup and calls kvm_update_ops() to initialize kvm_x86_ops to VMX's version. TDX will have its own version for most of kvm_x86_ops callbacks. It would be nice if kvm_x86_init_ops::hardware_setup() could also be used for TDX, but in practice it cannot. The reason is, as mentioned above, TDX initialization requires hardware virtualization having been enabled, which must happen after kvm_update_ops(), but hardware_setup() is done before that. Also, TDX is based on VMX, and it makes sense to only initialize TDX after VMX has been initialized. If VMX fails to initialize, TDX is likely broken anyway. So the new flow of KVM module init function will be: 1) Current VMX initialization code in vmx_init() before kvm_init(), 2) TDX initialization, 3) kvm_init() Split vmx_init() into two parts based on above 1) and 3) so that TDX initialization can fit in between. Make part 1) as the new helper vmx_init(). Introduce vt_init() as the new module init function which calls vmx_init() and kvm_init(). TDX initialization will be added later. Do the same thing for vmx_exit()/vt_exit(). Signed-off-by: Kai Huang <kai.huang@intel.com> Message-ID: <3f23f24098bdcf42e213798893ffff7cdc7103be.1731664295.git.kai.huang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14KVM: Export hardware virtualization enabling/disabling functionsKai Huang
To support TDX, KVM will need to enable TDX during KVM module loading time. Enabling TDX requires enabling hardware virtualization first so that all online CPUs (and the new CPU going online) are in post-VMXON state. KVM by default enables hardware virtualization but that is done in kvm_init(), which must be the last step after all initialization is done thus is too late for enabling TDX. Export functions to enable/disable hardware virtualization so that TDX code can use them to handle hardware virtualization enabling before kvm_init(). Signed-off-by: Kai Huang <kai.huang@intel.com> Message-ID: <dfe17314c0d9978b7bc3b0833dff6f167fbd28f5.1731664295.git.kai.huang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14x86/virt/tdx: Add tdx_guest_keyid_alloc/free() to alloc and free TDX guest KeyIDIsaku Yamahata
Intel TDX protects guest VMs from malicious host and certain physical attacks. Pre-TDX Intel hardware has support for a memory encryption architecture called MK-TME, which repurposes several high bits of physical address as "KeyID". The BIOS reserves a sub-range of MK-TME KeyIDs as "TDX private KeyIDs". Each TDX guest must be assigned with a unique TDX KeyID when it is created. The kernel reserves the first TDX private KeyID for crypto-protection of specific TDX module data which has a lifecycle that exceeds the KeyID reserved for the TD's use. The rest of the KeyIDs are left for TDX guests to use. Create a small KeyID allocator. Export tdx_guest_keyid_alloc()/tdx_guest_keyid_free() to allocate and free TDX guest KeyID for KVM to use. Don't provide the stub functions when CONFIG_INTEL_TDX_HOST=n since they are not supposed to be called in this case. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Message-ID: <20241030190039.77971-5-rick.p.edgecombe@intel.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14x86/virt/tdx: Read essential global metadata for KVMKai Huang
KVM needs two classes of global metadata to create and run TDX guests: - "TD Control Structures" - "TD Configurability" The first class contains the sizes of TDX guest per-VM and per-vCPU control structures. KVM will need to use them to allocate enough space for those control structures. The second class contains info which reports things like which features are configurable to TDX guest etc. KVM will need to use them to properly configure TDX guests. Read them for KVM TDX to use. The code change is auto-generated by re-running the script in [1] after uncommenting the "td_conf" and "td_ctrl" part to regenerate the tdx_global_metadata.{hc} and update them to the existing ones in the kernel. #python tdx.py global_metadata.json tdx_global_metadata.h \ tdx_global_metadata.c The 'global_metadata.json' can be fetched from [2]. Note that as of this writing, the JSON file only allows a maximum of 32 CPUID entries. While this is enough for current contents of the CPUID leaves, there were plans to change the JSON per TDX module release which would change the ABI and potentially prevent future versions of the TDX module from working with older kernels. While discussions are ongoing with the TDX module team on what exactly constitutes an ABI breakage, in the meantime the TDX module team has agreed to not increase the number of CPUID entries beyond 128 without an opt in. Therefore the file was tweaked by hand to change the maximum number of CPUID_CONFIGs. Link: https://lore.kernel.org/kvm/0853b155ec9aac09c594caa60914ed6ea4dc0a71.camel@intel.com/ [1] Link: https://cdrdv2.intel.com/v1/dl/getContent/795381 [2] Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Message-ID: <20241030190039.77971-4-rick.p.edgecombe@intel.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14x86/virt/tdx: allocate tdx_sys_info in static memoryPaolo Bonzini
Adding all the information that KVM needs increases the size of struct tdx_sys_info, to the point that you can get warnings about the stack size of init_tdx_module(). Since KVM also needs to read the TDX metadata after init_tdx_module() returns, make the variable a global. Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14x86/virt/tdx: Add SEAMCALL wrappers for TDX flush operationsRick Edgecombe
Intel TDX protects guest VMs from malicious host and certain physical attacks. The TDX module has the concept of flushing vCPUs. These flushes include both a flush of the translation caches and also any other state internal to the TDX module. Before freeing a KeyID, this flush operation needs to be done. KVM will need to perform the flush on each pCPU associated with the TD, and also perform a TD scoped operation that checks if the flush has been done on all vCPU's associated with the TD. Add a tdh_vp_flush() function to be used to call TDH.VP.FLUSH on each pCPU associated with the TD during TD teardown. It will also be called when disabling TDX and during vCPU migration between pCPUs. Add tdh_mng_vpflushdone() to be used by KVM to call TDH.MNG.VPFLUSHDONE. KVM will use this during TD teardown to verify that TDH.VP.FLUSH has been called sufficiently, and advance the state machine that will allow for reclaiming the TD's KeyID. Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Yuan Yao <yuan.yao@intel.com> Message-ID: <20241203010317.827803-7-rick.p.edgecombe@intel.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14x86/virt/tdx: Add SEAMCALL wrappers for TDX VM/vCPU field accessRick Edgecombe
Intel TDX protects guest VMs from malicious host and certain physical attacks. The TDX module has TD scoped and vCPU scoped "metadata fields". These fields are a bit like VMCS fields, and stored in data structures maintained by the TDX module. Export 3 SEAMCALLs for use in reading and writing these fields: Make tdh_mng_rd() use MNG.VP.RD to read the TD scoped metadata. Make tdh_vp_rd()/tdh_vp_wr() use TDH.VP.RD/WR to read/write the vCPU scoped metadata. KVM will use these by creating inline helpers that target various metadata sizes. Export the raw SEAMCALL leaf, to avoid exporting the large number of various sized helpers. Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Yuan Yao <yuan.yao@intel.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Message-ID: <20241203010317.827803-6-rick.p.edgecombe@intel.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14x86/virt/tdx: Add SEAMCALL wrappers for TDX page cache managementRick Edgecombe
Intel TDX protects guest VMs from malicious host and certain physical attacks. The TDX module uses pages provided by the host for both control structures and for TD guest pages. These pages are encrypted using the MK-TME encryption engine, with its special requirements around cache invalidation. For its own security, the TDX module ensures pages are flushed properly and track which usage they are currently assigned. For creating and tearing down TD VMs and vCPUs KVM will need to use the TDH.PHYMEM.PAGE.RECLAIM, TDH.PHYMEM.CACHE.WB, and TDH.PHYMEM.PAGE.WBINVD SEAMCALLs. Add tdh_phymem_page_reclaim() to enable KVM to call TDH.PHYMEM.PAGE.RECLAIM to reclaim the page for use by the host kernel. This effectively resets its state in the TDX module's page tracking (PAMT), if the page is available to be reclaimed. This will be used by KVM to reclaim the various types of pages owned by the TDX module. It will have a small wrapper in KVM that retries in the case of a relevant error code. Don't implement this wrapper in arch/x86 because KVM's solution around retrying SEAMCALLs will be better located in a single place. Add tdh_phymem_cache_wb() to enable KVM to call TDH.PHYMEM.CACHE.WB to do a cache write back in a way that the TDX module can verify, before it allows a KeyID to be freed. The KVM code will use this to have a small wrapper that handles retries. Since the TDH.PHYMEM.CACHE.WB operation is interruptible, have tdh_phymem_cache_wb() take a resume argument to pass this info to the TDX module for restarts. It is worth noting that this SEAMCALL uses a SEAM specific MSR to do the write back in sections. In this way it does export some new functionality that affects CPU state. Add tdh_phymem_page_wbinvd_tdr() to enable KVM to call TDH.PHYMEM.PAGE.WBINVD to do a cache write back and invalidate of a TDR, using the global KeyID. The underlying TDH.PHYMEM.PAGE.WBINVD SEAMCALL requires the related KeyID to be encoded into the SEAMCALL args. Since the global KeyID is not exposed to KVM, a dedicated wrapper is needed for TDR focused TDH.PHYMEM.PAGE.WBINVD operations. Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Yuan Yao <yuan.yao@intel.com> Message-ID: <20241203010317.827803-5-rick.p.edgecombe@intel.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14x86/virt/tdx: Add SEAMCALL wrappers for TDX vCPU creationRick Edgecombe
Intel TDX protects guest VMs from malicious host and certain physical attacks. It defines various control structures that hold state for virtualized components of the TD (i.e. VMs or vCPUs) These control structures are stored in pages given to the TDX module and encrypted with either the global KeyID or the guest KeyIDs. To manipulate these control structures the TDX module defines a few SEAMCALLs. KVM will use these during the process of creating a vCPU as follows: 1) Call TDH.VP.CREATE to create a TD vCPU Root (TDVPR) page for each vCPU. 2) Call TDH.VP.ADDCX to add per-vCPU control pages (TDCX) for each vCPU. 3) Call TDH.VP.INIT to initialize the TDCX for each vCPU. To reclaim these pages for use by the kernel other SEAMCALLs are needed, which will be added in future patches. Export functions to allow KVM to make these SEAMCALLs. Export two variants for TDH.VP.CREATE, in order to support the planned logic of KVM to support TDX modules with and without the ENUM_TOPOLOGY feature. If KVM can drop support for the !ENUM_TOPOLOGY case, this could go down a single version. Leave that for later discussion. The TDX module provides SEAMCALLs to hand pages to the TDX module for storing TDX controlled state. SEAMCALLs that operate on this state are directed to the appropriate TD vCPU using references to the pages originally provided for managing the vCPU's state. So the host kernel needs to track these pages, both as an ID for specifying which vCPU to operate on, and to allow them to be eventually reclaimed. The vCPU associated pages are called TDVPR (Trust Domain Virtual Processor Root) and TDCX (Trust Domain Control Extension). Introduce "struct tdx_vp" for holding references to pages provided to the TDX module for the TD vCPU associated state. Don't plan for any vCPU associated state that is controlled by KVM to live in this struct. Only expect it to hold data for concepts specific to the TDX architecture, for which there can't already be preexisting storage for in KVM. Add both the TDVPR page and an array of TDCX pages, even though the SEAMCALL wrappers will only need to know about the TDVPR pages for directing the SEAMCALLs to the right vCPU. Adding the TDCX pages to this struct will let all of the vCPU associated pages handed to the TDX module be tracked in one location. For a type to specify physical pages, use KVM's hpa_t type. Do this for KVM's benefit This is the common type used to hold physical addresses in KVM, so will make interoperability easier. Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Yuan Yao <yuan.yao@intel.com> Message-ID: <20241203010317.827803-4-rick.p.edgecombe@intel.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14x86/virt/tdx: Add SEAMCALL wrappers for TDX TD creationRick Edgecombe
Intel TDX protects guest VMs from malicious hosts and certain physical attacks. It defines various control structures that hold state for things like TDs or vCPUs. These control structures are stored in pages given to the TDX module and encrypted with either the global KeyID or the guest KeyIDs. To manipulate these control structures the TDX module defines a few SEAMCALLs. KVM will use these during the process of creating a TD as follows: 1) Allocate a unique TDX KeyID for a new guest. 1) Call TDH.MNG.CREATE to create a "TD Root" (TDR) page, together with the new allocated KeyID. Unlike the rest of the TDX guest, the TDR page is crypto-protected by the 'global KeyID'. 2) Call the previously added TDH.MNG.KEY.CONFIG on each package to configure the KeyID for the guest. After this step, the KeyID to protect the guest is ready and the rest of the guest will be protected by this KeyID. 3) Call TDH.MNG.ADDCX to add TD Control Structure (TDCS) pages. 4) Call TDH.MNG.INIT to initialize the TDCS. To reclaim these pages for use by the kernel other SEAMCALLs are needed, which will be added in future patches. Add tdh_mng_addcx(), tdh_mng_create() and tdh_mng_init() to export these SEAMCALLs so that KVM can use them to create TDs. For SEAMCALLs that give a page to the TDX module to be encrypted, CLFLUSH the page mapped with KeyID 0, such that any dirty cache lines don't write back later and clobber TD memory or control structures. Don't worry about the other MK-TME KeyIDs because the kernel doesn't use them. The TDX docs specify that this flush is not needed unless the TDX module exposes the CLFLUSH_BEFORE_ALLOC feature bit. Be conservative and always flush. Add a helper function to facilitate this. Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Yuan Yao <yuan.yao@intel.com> Message-ID: <20241203010317.827803-3-rick.p.edgecombe@intel.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14x86/virt/tdx: Add SEAMCALL wrappers for TDX KeyID managementRick Edgecombe
Intel TDX protects guest VMs from malicious host and certain physical attacks. Pre-TDX Intel hardware has support for a memory encryption architecture called MK-TME, which repurposes several high bits of physical address as "KeyID". TDX ends up with reserving a sub-range of MK-TME KeyIDs as "TDX private KeyIDs". Like MK-TME, these KeyIDs can be associated with an ephemeral key. For TDX this association is done by the TDX module. It also has its own tracking for which KeyIDs are in use. To do this ephemeral key setup and manipulate the TDX module's internal tracking, KVM will use the following SEAMCALLs: TDH.MNG.KEY.CONFIG: Mark the KeyID as in use, and initialize its ephemeral key. TDH.MNG.KEY.FREEID: Mark the KeyID as not in use. These SEAMCALLs both operate on TDR structures, which are setup using the previously added TDH.MNG.CREATE SEAMCALL. KVM's use of these operations will go like: - tdx_guest_keyid_alloc() - Initialize TD and TDR page with TDH.MNG.CREATE (not yet-added), passing KeyID - TDH.MNG.KEY.CONFIG to initialize the key - TD runs, teardown is started - TDH.MNG.KEY.FREEID - tdx_guest_keyid_free() Don't try to combine the tdx_guest_keyid_alloc() and TDH.MNG.KEY.CONFIG operations because TDH.MNG.CREATE and some locking need to be done in the middle. Don't combine TDH.MNG.KEY.FREEID and tdx_guest_keyid_free() so they are symmetrical with the creation path. So implement tdh_mng_key_config() and tdh_mng_key_freeid() as separate functions than tdx_guest_keyid_alloc() and tdx_guest_keyid_free(). The TDX module provides SEAMCALLs to hand pages to the TDX module for storing TDX controlled state. SEAMCALLs that operate on this state are directed to the appropriate TD VM using references to the pages originally provided for managing the TD's state. So the host kernel needs to track these pages, both as an ID for specifying which TD to operate on, and to allow them to be eventually reclaimed. The TD VM associated pages are called TDR (Trust Domain Root) and TDCS (Trust Domain Control Structure). Introduce "struct tdx_td" for holding references to pages provided to the TDX module for this TD VM associated state. Don't plan for any TD associated state that is controlled by KVM to live in this struct. Only expect it to hold data for concepts specific to the TDX architecture, for which there can't already be preexisting storage for in KVM. Add both the TDR page and an array of TDCS pages, even though the SEAMCALL wrappers will only need to know about the TDR pages for directing the SEAMCALLs to the right TD. Adding the TDCS pages to this struct will let all of the TD VM associated pages handed to the TDX module be tracked in one location. For a type to specify physical pages, use KVM's hpa_t type. Do this for KVM's benefit This is the common type used to hold physical addresses in KVM, so will make interoperability easier. Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Yuan Yao <yuan.yao@intel.com> Message-ID: <20241203010317.827803-2-rick.p.edgecombe@intel.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14KVM: x86: block KVM_CAP_SYNC_REGS if guest state is protectedPaolo Bonzini
KVM_CAP_SYNC_REGS does not make sense for VMs with protected guest state, since the register values cannot actually be written. Return 0 when using the VM-level KVM_CHECK_EXTENSION ioctl, and accordingly return -EINVAL from KVM_RUN if the valid/dirty fields are nonzero. However, on exit from KVM_RUN userspace could have placed a nonzero value into kvm_run->kvm_valid_regs, so check guest_state_protected again and skip store_regs() in that case. Cc: stable@vger.kernel.org Fixes: 517987e3fb19 ("KVM: x86: add fields to struct kvm_arch for CoCo features") Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20250306202923.646075-1-pbonzini@redhat.com> Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14KVM: x86: Add infrastructure for secure TSCIsaku Yamahata
Add guest_tsc_protected member to struct kvm_arch_vcpu and prohibit changing TSC offset/multiplier when guest_tsc_protected is true. X86 confidential computing technology defines protected guest TSC so that the VMM can't change the TSC offset/multiplier once vCPU is initialized. SEV-SNP defines Secure TSC as optional, whereas TDX mandates it. KVM has common logic on x86 that tries to guess or adjust TSC offset/multiplier for better guest TSC and TSC interrupt latency at KVM vCPU creation (kvm_arch_vcpu_postcreate()), vCPU migration over pCPU (kvm_arch_vcpu_load()), vCPU TSC device attributes (kvm_arch_tsc_set_attr()) and guest/host writing to TSC or TSC adjust MSR (kvm_set_msr_common()). The current x86 KVM implementation conflicts with protected TSC because the VMM can't change the TSC offset/multiplier. Because KVM emulates the TSC timer or the TSC deadline timer with the TSC offset/multiplier, the TSC timer interrupts is injected to the guest at the wrong time if the KVM TSC offset is different from what the TDX module determined. Originally this issue was found by cyclic test of rt-test [1] as the latency in TDX case is worse than VMX value + TDX SEAMCALL overhead. It turned out that the KVM TSC offset is different from what the TDX module determines. Disable or ignore the KVM logic to change/adjust the TSC offset/multiplier somehow, thus keeping the KVM TSC offset/multiplier the same as the value of the TDX module. Writes to MSR_IA32_TSC are also blocked as they amount to a change in the TSC offset. [1] https://git.kernel.org/pub/scm/utils/rt-tests/rt-tests.git Reported-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Message-ID: <3a7444aec08042fe205666864b6858910e86aa98.1728719037.git.isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14KVM: x86: Push down setting vcpu.arch.user_set_tscIsaku Yamahata
Push down setting vcpu.arch.user_set_tsc to true from kvm_synchronize_tsc() to __kvm_synchronize_tsc() so that the two callers don't have to modify user_set_tsc directly as preparation. Later, prohibit changing TSC synchronization for TDX guests to modify __kvm_synchornize_tsc() change. We don't want to touch caller sites not to change user_set_tsc. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Message-ID: <62b1a7a35d6961844786b6e47e8ecb774af7a228.1728719037.git.isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14KVM: x86: move vm_destroy callback at end of kvm_arch_destroy_vmPaolo Bonzini
TDX needs to free the TDR control structures last, after all paging structures have been torn down; move the vm_destroy callback at a suitable place. The new place is also okay for AMD; the main difference is that the MMU has been torn down and, if anything, that is better done before the SNP ASID is released. Extracted from a patch by Yan Zhao. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-09Merge tag 'kvm-x86-fixes-6.14-rcN.2' of https://github.com/kvm-x86/linux ↵Paolo Bonzini
into HEAD KVM x86 fixes for 6.14-rcN #2 - Set RFLAGS.IF in C code on SVM to get VMRUN out of the STI shadow. - Ensure DEBUGCTL is context switched on AMD to avoid running the guest with the host's value, which can lead to unexpected bus lock #DBs. - Suppress DEBUGCTL.BTF on AMD (to match Intel), as KVM doesn't properly emulate BTF. KVM's lack of context switching has meant BTF has always been broken to some extent. - Always save DR masks for SNP vCPUs if DebugSwap is *supported*, as the guest can enable DebugSwap without KVM's knowledge. - Fix a bug in mmu_stress_tests where a vCPU could finish the "writes to RO memory" phase without actually generating a write-protection fault. - Fix a printf() goof in the SEV smoke test that causes build failures with -Werror. - Explicitly zero EAX and EBX in CPUID.0x8000_0022 output when PERFMON_V2 isn't supported by KVM.
2025-03-09Merge tag 'kvmarm-fixes-6.14-4' of ↵Paolo Bonzini
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 fixes for 6.14, take #4 - Fix a couple of bugs affecting pKVM's PSCI relay implementation when running in the hVHE mode, resulting in the host being entered with the MMU in an unknown state, and EL2 being in the wrong mode.
2025-03-04KVM: x86: Explicitly zero EAX and EBX when PERFMON_V2 isn't supported by KVMXiaoyao Li
Fix a goof where KVM sets CPUID.0x80000022.EAX to CPUID.0x80000022.EBX instead of zeroing both when PERFMON_V2 isn't supported by KVM. In practice, barring a buggy CPU (or vCPU model when running nested) only the !enable_pmu case is affected, as KVM always supports PERFMON_V2 if it's available in hardware, i.e. CPUID.0x80000022.EBX will be '0' if PERFMON_V2 is unsupported. For the !enable_pmu case, the bug is relatively benign as KVM will refuse to enable PMU capabilities, but a VMM that reflects KVM's supported CPUID into the guest could inadvertently induce #GPs in the guest due to advertising support for MSRs that KVM refuses to emulate. Fixes: 94cdeebd8211 ("KVM: x86/cpuid: Add AMD CPUID ExtPerfMonAndDbg leaf 0x80000022") Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20250304082314.472202-3-xiaoyao.li@intel.com [sean: massage shortlog and changelog, tag for stable] Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-03-03KVM: selftests: Fix printf() format goof in SEV smoke testSean Christopherson
Print out the index of mismatching XSAVE bytes using unsigned decimal format. Some versions of clang complain about trying to print an integer as an unsigned char. x86/sev_smoke_test.c:55:51: error: format specifies type 'unsigned char' but the argument has type 'int' [-Werror,-Wformat] Fixes: 8c53183dbaa2 ("selftests: kvm: add test for transferring FPU state into VMSA") Link: https://lore.kernel.org/r/20250228233852.3855676-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-03-03KVM: selftests: Ensure all vCPUs hit -EFAULT during initial RO stageSean Christopherson
During the initial mprotect(RO) stage of mmu_stress_test, keep vCPUs spinning until all vCPUs have hit -EFAULT, i.e. until all vCPUs have tried to write to a read-only page. If a vCPU manages to complete an entire iteration of the loop without hitting a read-only page, *and* the vCPU observes mprotect_ro_done before starting a second iteration, then the vCPU will prematurely fall through to GUEST_SYNC(3) (on x86 and arm64) and get out of sequence. Replace the "do-while (!r)" loop around the associated _vcpu_run() with a single invocation, as barring a KVM bug, the vCPU is guaranteed to hit -EFAULT, and retrying on success is super confusion, hides KVM bugs, and complicates this fix. The do-while loop was semi-unintentionally added specifically to fudge around a KVM x86 bug, and said bug is unhittable without modifying the test to force x86 down the !(x86||arm64) path. On x86, if forced emulation is enabled, vcpu_arch_put_guest() may trigger emulation of the store to memory. Due a (very, very) longstanding bug in KVM x86's emulator, emulate writes to guest memory that fail during __kvm_write_guest_page() unconditionally return KVM_EXIT_MMIO. While that is desirable in the !memslot case, it's wrong in this case as the failure happens due to __copy_to_user() hitting a read-only page, not an emulated MMIO region. But as above, x86 only uses vcpu_arch_put_guest() if the __x86_64__ guards are clobbered to force x86 down the common path, and of course the unexpected MMIO is a KVM bug, i.e. *should* cause a test failure. Fixes: b6c304aec648 ("KVM: selftests: Verify KVM correctly handles mprotect(PROT_READ)") Reported-by: Yan Zhao <yan.y.zhao@intel.com> Closes: https://lore.kernel.org/all/20250208105318.16861-1-yan.y.zhao@intel.com Debugged-by: Yan Zhao <yan.y.zhao@intel.com> Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Yan Zhao <yan.y.zhao@intel.com> Link: https://lore.kernel.org/r/20250228230804.3845860-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-03-03KVM: SVM: Don't rely on DebugSwap to restore host DR0..DR3Sean Christopherson
Never rely on the CPU to restore/load host DR0..DR3 values, even if the CPU supports DebugSwap, as there are no guarantees that SNP guests will actually enable DebugSwap on APs. E.g. if KVM were to rely on the CPU to load DR0..DR3 and skipped them during hw_breakpoint_restore(), KVM would run with clobbered-to-zero DRs if an SNP guest created APs without DebugSwap enabled. Update the comment to explain the dangers, and hopefully prevent breaking KVM in the future. Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20250227012541.3234589-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-03-03KVM: SVM: Save host DR masks on CPUs with DebugSwapSean Christopherson
When running SEV-SNP guests on a CPU that supports DebugSwap, always save the host's DR0..DR3 mask MSR values irrespective of whether or not DebugSwap is enabled, to ensure the host values aren't clobbered by the CPU. And for now, also save DR0..DR3, even though doing so isn't necessary (see below). SVM_VMGEXIT_AP_CREATE is deeply flawed in that it allows the *guest* to create a VMSA with guest-controlled SEV_FEATURES. A well behaved guest can inform the hypervisor, i.e. KVM, of its "requested" features, but on CPUs without ALLOWED_SEV_FEATURES support, nothing prevents the guest from lying about which SEV features are being enabled (or not!). If a misbehaving guest enables DebugSwap in a secondary vCPU's VMSA, the CPU will load the DR0..DR3 mask MSRs on #VMEXIT, i.e. will clobber the MSRs with '0' if KVM doesn't save its desired value. Note, DR0..DR3 themselves are "ok", as DR7 is reset on #VMEXIT, and KVM restores all DRs in common x86 code as needed via hw_breakpoint_restore(). I.e. there is no risk of host DR0..DR3 being clobbered (when it matters). However, there is a flaw in the opposite direction; because the guest can lie about enabling DebugSwap, i.e. can *disable* DebugSwap without KVM's knowledge, KVM must not rely on the CPU to restore DRs. Defer fixing that wart, as it's more of a documentation issue than a bug in the code. Note, KVM added support for DebugSwap on commit d1f85fbe836e ("KVM: SEV: Enable data breakpoints in SEV-ES"), but that is not an appropriate Fixes, as the underlying flaw exists in hardware, not in KVM. I.e. all kernels that support SEV-SNP need to be patched, not just kernels with KVM's full support for DebugSwap (ignoring that DebugSwap support landed first). Opportunistically fix an incorrect statement in the comment; on CPUs without DebugSwap, the CPU does NOT save or load debug registers, i.e. Fixes: e366f92ea99e ("KVM: SEV: Support SEV-SNP AP Creation NAE event") Cc: stable@vger.kernel.org Cc: Naveen N Rao <naveen@kernel.org> Cc: Kim Phillips <kim.phillips@amd.com> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Alexey Kardashevskiy <aik@amd.com> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20250227012541.3234589-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-03-02KVM: arm64: Initialize SCTLR_EL1 in __kvm_hyp_init_cpu()Ahmed Genidi
When KVM is in protected mode, host calls to PSCI are proxied via EL2, and cold entries from CPU_ON, CPU_SUSPEND, and SYSTEM_SUSPEND bounce through __kvm_hyp_init_cpu() at EL2 before entering the host kernel's entry point at EL1. While __kvm_hyp_init_cpu() initializes SPSR_EL2 for the exception return to EL1, it does not initialize SCTLR_EL1. Due to this, it's possible to enter EL1 with SCTLR_EL1 in an UNKNOWN state. In practice this has been seen to result in kernel crashes after CPU_ON as a result of SCTLR_EL1.M being 1 in violation of the initial core configuration specified by PSCI. Fix this by initializing SCTLR_EL1 for cold entry to the host kernel. As it's necessary to write to SCTLR_EL12 in VHE mode, this initialization is moved into __kvm_host_psci_cpu_entry() where we can use write_sysreg_el1(). The remnants of the '__init_el2_nvhe_prepare_eret' macro are folded into its only caller, as this is clearer than having the macro. Fixes: cdf367192766ad11 ("KVM: arm64: Intercept host's CPU_ON SMCs") Reported-by: Leo Yan <leo.yan@arm.com> Signed-off-by: Ahmed Genidi <ahmed.genidi@arm.com> [ Mark: clarify commit message, handle E2H, move to C, remove macro ] Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Ahmed Genidi <ahmed.genidi@arm.com> Cc: Ben Horgan <ben.horgan@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Leo Yan <leo.yan@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Oliver Upton <oliver.upton@linux.dev> Cc: Will Deacon <will@kernel.org> Reviewed-by: Leo Yan <leo.yan@arm.com> Link: https://lore.kernel.org/r/20250227180526.1204723-3-mark.rutland@arm.com Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-03-02KVM: arm64: Initialize HCR_EL2.E2H earlyMark Rutland
On CPUs without FEAT_E2H0, HCR_EL2.E2H is RES1, but may reset to an UNKNOWN value out of reset and consequently may not read as 1 unless it has been explicitly initialized. We handled this for the head.S boot code in commits: 3944382fa6f22b54 ("arm64: Treat HCR_EL2.E2H as RES1 when ID_AA64MMFR4_EL1.E2H0 is negative") b3320142f3db9b3f ("arm64: Fix early handling of FEAT_E2H0 not being implemented") Unfortunately, we forgot to apply a similar fix to the KVM PSCI entry points used when relaying CPU_ON, CPU_SUSPEND, and SYSTEM SUSPEND. When KVM is entered via these entry points, the value of HCR_EL2.E2H may be consumed before it has been initialized (e.g. by the 'init_el2_state' macro). Initialize HCR_EL2.E2H early in these paths such that it can be consumed reliably. The existing code in head.S is factored out into a new 'init_el2_hcr' macro, and this is used in the __kvm_hyp_init_cpu() function common to all the relevant PSCI entry points. For clarity, I've tweaked the assembly used to check whether ID_AA64MMFR4_EL1.E2H0 is negative. The bitfield is extracted as a signed value, and this is checked with a signed-greater-or-equal (GE) comparison. As the hyp code will reconfigure HCR_EL2 later in ___kvm_hyp_init(), all bits other than E2H are initialized to zero in __kvm_hyp_init_cpu(). Fixes: 3944382fa6f22b54 ("arm64: Treat HCR_EL2.E2H as RES1 when ID_AA64MMFR4_EL1.E2H0 is negative") Fixes: b3320142f3db9b3f ("arm64: Fix early handling of FEAT_E2H0 not being implemented") Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Ahmed Genidi <ahmed.genidi@arm.com> Cc: Ben Horgan <ben.horgan@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Leo Yan <leo.yan@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Oliver Upton <oliver.upton@linux.dev> Cc: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20250227180526.1204723-2-mark.rutland@arm.com [maz: fixed LT->GE thinko] Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-03-01kvm: retry nx_huge_page_recovery_thread creationKeith Busch
A VMM may send a non-fatal signal to its threads, including vCPU tasks, at any time, and thus may signal vCPU tasks during KVM_RUN. If a vCPU task receives the signal while its trying to spawn the huge page recovery vhost task, then KVM_RUN will fail due to copy_process() returning -ERESTARTNOINTR. Rework call_once() to mark the call complete if and only if the called function succeeds, and plumb the function's true error code back to the call_once() invoker. This provides userspace with the correct, non-fatal error code so that the VMM doesn't terminate the VM on -ENOMEM, and allows subsequent KVM_RUN a succeed by virtue of retrying creation of the NX huge page task. Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> [implemented the kvm user side] Signed-off-by: Keith Busch <kbusch@kernel.org> Message-ID: <20250227230631.303431-3-kbusch@meta.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-01vhost: return task creation error instead of NULLKeith Busch
Lets callers distinguish why the vhost task creation failed. No one currently cares why it failed, so no real runtime change from this patch, but that will not be the case for long. Signed-off-by: Keith Busch <kbusch@kernel.org> Message-ID: <20250227230631.303431-2-kbusch@meta.com> Reviewed-by: Mike Christie <michael.christie@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-02-28KVM: x86: Snapshot the host's DEBUGCTL after disabling IRQsSean Christopherson
Snapshot the host's DEBUGCTL after disabling IRQs, as perf can toggle debugctl bits from IRQ context, e.g. when enabling/disabling events via smp_call_function_single(). Taking the snapshot (long) before IRQs are disabled could result in KVM effectively clobbering DEBUGCTL due to using a stale snapshot. Cc: stable@vger.kernel.org Reviewed-and-tested-by: Ravi Bangoria <ravi.bangoria@amd.com> Link: https://lore.kernel.org/r/20250227222411.3490595-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-28KVM: SVM: Manually context switch DEBUGCTL if LBR virtualization is disabledSean Christopherson
Manually load the guest's DEBUGCTL prior to VMRUN (and restore the host's value on #VMEXIT) if it diverges from the host's value and LBR virtualization is disabled, as hardware only context switches DEBUGCTL if LBR virtualization is fully enabled. Running the guest with the host's value has likely been mildly problematic for quite some time, e.g. it will result in undesirable behavior if BTF diverges (with the caveat that KVM now suppresses guest BTF due to lack of support). But the bug became fatal with the introduction of Bus Lock Trap ("Detect" in kernel paralance) support for AMD (commit 408eb7417a92 ("x86/bus_lock: Add support for AMD")), as a bus lock in the guest will trigger an unexpected #DB. Note, suppressing the bus lock #DB, i.e. simply resuming the guest without injecting a #DB, is not an option. It wouldn't address the general issue with DEBUGCTL, e.g. for things like BTF, and there are other guest-visible side effects if BusLockTrap is left enabled. If BusLockTrap is disabled, then DR6.BLD is reserved-to-1; any attempts to clear it by software are ignored. But if BusLockTrap is enabled, software can clear DR6.BLD: Software enables bus lock trap by setting DebugCtl MSR[BLCKDB] (bit 2) to 1. When bus lock trap is enabled, ... The processor indicates that this #DB was caused by a bus lock by clearing DR6[BLD] (bit 11). DR6[11] previously had been defined to be always 1. and clearing DR6.BLD is "sticky" in that it's not set (i.e. lowered) by other #DBs: All other #DB exceptions leave DR6[BLD] unmodified E.g. leaving BusLockTrap enable can confuse a legacy guest that writes '0' to reset DR6. Reported-by: rangemachine@gmail.com Reported-by: whanos@sergal.fun Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219787 Closes: https://lore.kernel.org/all/bug-219787-28872@https.bugzilla.kernel.org%2F Cc: Ravi Bangoria <ravi.bangoria@amd.com> Cc: stable@vger.kernel.org Reviewed-and-tested-by: Ravi Bangoria <ravi.bangoria@amd.com> Link: https://lore.kernel.org/r/20250227222411.3490595-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-28KVM: x86: Snapshot the host's DEBUGCTL in common x86Sean Christopherson
Move KVM's snapshot of DEBUGCTL to kvm_vcpu_arch and take the snapshot in common x86, so that SVM can also use the snapshot. Opportunistically change the field to a u64. While bits 63:32 are reserved on AMD, not mentioned at all in Intel's SDM, and managed as an "unsigned long" by the kernel, DEBUGCTL is an MSR and therefore a 64-bit value. Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Cc: stable@vger.kernel.org Reviewed-and-tested-by: Ravi Bangoria <ravi.bangoria@amd.com> Link: https://lore.kernel.org/r/20250227222411.3490595-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-28KVM: SVM: Suppress DEBUGCTL.BTF on AMDSean Christopherson
Mark BTF as reserved in DEBUGCTL on AMD, as KVM doesn't actually support BTF, and fully enabling BTF virtualization is non-trivial due to interactions with the emulator, guest_debug, #DB interception, nested SVM, etc. Don't inject #GP if the guest attempts to set BTF, as there's no way to communicate lack of support to the guest, and instead suppress the flag and treat the WRMSR as (partially) unsupported. In short, make KVM behave the same on AMD and Intel (VMX already squashes BTF). Note, due to other bugs in KVM's handling of DEBUGCTL, the only way BTF has "worked" in any capacity is if the guest simultaneously enables LBRs. Reported-by: Ravi Bangoria <ravi.bangoria@amd.com> Cc: stable@vger.kernel.org Reviewed-and-tested-by: Ravi Bangoria <ravi.bangoria@amd.com> Link: https://lore.kernel.org/r/20250227222411.3490595-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-28KVM: SVM: Drop DEBUGCTL[5:2] from guest's effective valueSean Christopherson
Drop bits 5:2 from the guest's effective DEBUGCTL value, as AMD changed the architectural behavior of the bits and broke backwards compatibility. On CPUs without BusLockTrap (or at least, in APMs from before ~2023), bits 5:2 controlled the behavior of external pins: Performance-Monitoring/Breakpoint Pin-Control (PBi)—Bits 5:2, read/write. Software uses thesebits to control the type of information reported by the four external performance-monitoring/breakpoint pins on the processor. When a PBi bit is cleared to 0, the corresponding external pin (BPi) reports performance-monitor information. When a PBi bit is set to 1, the corresponding external pin (BPi) reports breakpoint information. With the introduction of BusLockTrap, presumably to be compatible with Intel CPUs, AMD redefined bit 2 to be BLCKDB: Bus Lock #DB Trap (BLCKDB)—Bit 2, read/write. Software sets this bit to enable generation of a #DB trap following successful execution of a bus lock when CPL is > 0. and redefined bits 5:3 (and bit 6) as "6:3 Reserved MBZ". Ideally, KVM would treat bits 5:2 as reserved. Defer that change to a feature cleanup to avoid breaking existing guest in LTS kernels. For now, drop the bits to retain backwards compatibility (of a sort). Note, dropping bits 5:2 is still a guest-visible change, e.g. if the guest is enabling LBRs *and* the legacy PBi bits, then the state of the PBi bits is visible to the guest, whereas now the guest will always see '0'. Reported-by: Ravi Bangoria <ravi.bangoria@amd.com> Cc: stable@vger.kernel.org Reviewed-and-tested-by: Ravi Bangoria <ravi.bangoria@amd.com> Link: https://lore.kernel.org/r/20250227222411.3490595-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-28KVM: selftests: Assert that STI blocking isn't set after event injectionSean Christopherson
Add an L1 (guest) assert to the nested exceptions test to verify that KVM doesn't put VMRUN in an STI shadow (AMD CPUs bleed the shadow into the guest's int_state if a #VMEXIT occurs before VMRUN fully completes). Add a similar assert to the VMX side as well, because why not. Reviewed-by: Jim Mattson <jmattson@google.com> Link: https://lore.kernel.org/r/20250224165442.2338294-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-28KVM: SVM: Set RFLAGS.IF=1 in C code, to get VMRUN out of the STI shadowSean Christopherson
Enable/disable local IRQs, i.e. set/clear RFLAGS.IF, in the common svm_vcpu_enter_exit() just after/before guest_state_{enter,exit}_irqoff() so that VMRUN is not executed in an STI shadow. AMD CPUs have a quirk (some would say "bug"), where the STI shadow bleeds into the guest's intr_state field if a #VMEXIT occurs during injection of an event, i.e. if the VMRUN doesn't complete before the subsequent #VMEXIT. The spurious "interrupts masked" state is relatively benign, as it only occurs during event injection and is transient. Because KVM is already injecting an event, the guest can't be in HLT, and if KVM is querying IRQ blocking for injection, then KVM would need to force an immediate exit anyways since injecting multiple events is impossible. However, because KVM copies int_state verbatim from vmcb02 to vmcb12, the spurious STI shadow is visible to L1 when running a nested VM, which can trip sanity checks, e.g. in VMware's VMM. Hoist the STI+CLI all the way to C code, as the aforementioned calls to guest_state_{enter,exit}_irqoff() already inform lockdep that IRQs are enabled/disabled, and taking a fault on VMRUN with RFLAGS.IF=1 is already possible. I.e. if there's kernel code that is confused by running with RFLAGS.IF=1, then it's already a problem. In practice, since GIF=0 also blocks NMIs, the only change in exposure to non-KVM code (relative to surrounding VMRUN with STI+CLI) is exception handling code, and except for the kvm_rebooting=1 case, all exception in the core VM-Enter/VM-Exit path are fatal. Use the "raw" variants to enable/disable IRQs to avoid tracing in the "no instrumentation" code; the guest state helpers also take care of tracing IRQ state. Oppurtunstically document why KVM needs to do STI in the first place. Reported-by: Doug Covelli <doug.covelli@broadcom.com> Closes: https://lore.kernel.org/all/CADH9ctBs1YPmE4aCfGPNBwA10cA8RuAk2gO7542DjMZgs4uzJQ@mail.gmail.com Fixes: f14eec0a3203 ("KVM: SVM: move more vmentry code to assembly") Cc: stable@vger.kernel.org Reviewed-by: Jim Mattson <jmattson@google.com> Link: https://lore.kernel.org/r/20250224165442.2338294-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-26KVM: nVMX: Process events on nested VM-Exit if injectable IRQ or NMI is pendingSean Christopherson
Process pending events on nested VM-Exit if the vCPU has an injectable IRQ or NMI, as the event may have become pending while L2 was active, i.e. may not be tracked in the context of vmcs01. E.g. if L1 has passed its APIC through to L2 and an IRQ arrives while L2 is active, then KVM needs to request an IRQ window prior to running L1, otherwise delivery of the IRQ will be delayed until KVM happens to process events for some other reason. The missed failure is detected by vmx_apic_passthrough_tpr_threshold_test in KVM-Unit-Tests, but has effectively been masked due to a flaw in KVM's PIC emulation that causes KVM to make spurious KVM_REQ_EVENT requests (and apparently no one ever ran the test with split IRQ chips). Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20250224235542.2562848-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-02-26KVM: x86: Free vCPUs before freeing VM stateSean Christopherson
Free vCPUs before freeing any VM state, as both SVM and VMX may access VM state when "freeing" a vCPU that is currently "in" L2, i.e. that needs to be kicked out of nested guest mode. Commit 6fcee03df6a1 ("KVM: x86: avoid loading a vCPU after .vm_destroy was called") partially fixed the issue, but for unknown reasons only moved the MMU unloading before VM destruction. Complete the change, and free all vCPU state prior to destroying VM state, as nVMX accesses even more state than nSVM. In addition to the AVIC, KVM can hit a use-after-free on MSR filters: kvm_msr_allowed+0x4c/0xd0 __kvm_set_msr+0x12d/0x1e0 kvm_set_msr+0x19/0x40 load_vmcs12_host_state+0x2d8/0x6e0 [kvm_intel] nested_vmx_vmexit+0x715/0xbd0 [kvm_intel] nested_vmx_free_vcpu+0x33/0x50 [kvm_intel] vmx_free_vcpu+0x54/0xc0 [kvm_intel] kvm_arch_vcpu_destroy+0x28/0xf0 kvm_vcpu_destroy+0x12/0x50 kvm_arch_destroy_vm+0x12c/0x1c0 kvm_put_kvm+0x263/0x3c0 kvm_vm_release+0x21/0x30 and an upcoming fix to process injectable interrupts on nested VM-Exit will access the PIC: BUG: kernel NULL pointer dereference, address: 0000000000000090 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page CPU: 23 UID: 1000 PID: 2658 Comm: kvm-nx-lpage-re RIP: 0010:kvm_cpu_has_extint+0x2f/0x60 [kvm] Call Trace: <TASK> kvm_cpu_has_injectable_intr+0xe/0x60 [kvm] nested_vmx_vmexit+0x2d7/0xdf0 [kvm_intel] nested_vmx_free_vcpu+0x40/0x50 [kvm_intel] vmx_vcpu_free+0x2d/0x80 [kvm_intel] kvm_arch_vcpu_destroy+0x2d/0x130 [kvm] kvm_destroy_vcpus+0x8a/0x100 [kvm] kvm_arch_destroy_vm+0xa7/0x1d0 [kvm] kvm_destroy_vm+0x172/0x300 [kvm] kvm_vcpu_release+0x31/0x50 [kvm] Inarguably, both nSVM and nVMX need to be fixed, but punt on those cleanups for the moment. Conceptually, vCPUs should be freed before VM state. Assets like the I/O APIC and PIC _must_ be allocated before vCPUs are created, so it stands to reason that they must be freed _after_ vCPUs are destroyed. Reported-by: Aaron Lewis <aaronlewis@google.com> Closes: https://lore.kernel.org/all/20240703175618.2304869-2-aaronlewis@google.com Cc: Jim Mattson <jmattson@google.com> Cc: Yan Zhao <yan.y.zhao@intel.com> Cc: Rick P Edgecombe <rick.p.edgecombe@intel.com> Cc: Kai Huang <kai.huang@intel.com> Cc: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20250224235542.2562848-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-02-24Merge tag 'kvm-riscv-fixes-6.14-1' of https://github.com/kvm-riscv/linux ↵Paolo Bonzini
into HEAD KVM/riscv fixes for 6.14, take #1 - Fix hart status check in SBI HSM extension - Fix hart suspend_type usage in SBI HSM extension - Fix error returned by SBI IPI and TIME extensions for unsupported function IDs - Fix suspend_type usage in SBI SUSP extension - Remove unnecessary vcpu kick after injecting interrupt via IMSIC guest file
2025-02-24Merge tag 'kvmarm-fixes-6.14-3' of ↵Paolo Bonzini
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 fixes for 6.14, take #3 - Fix TCR_EL2 configuration to not use the ASID in TTBR1_EL2 and not mess-up T1SZ/PS by using the HCR_EL2.E2H==0 layout. - Bring back the VMID allocation to the vcpu_load phase, ensuring that we only setup VTTBR_EL2 once on VHE. This cures an ugly race that would lead to running with an unallocated VMID.
2025-02-21riscv: KVM: Remove unnecessary vcpu kickBillXiang
Remove the unnecessary kick to the vCPU after writing to the vs_file of IMSIC in kvm_riscv_vcpu_aia_imsic_inject. For vCPUs that are running, writing to the vs_file directly forwards the interrupt as an MSI to them and does not need an extra kick. For vCPUs that are descheduled after emulating WFI, KVM will enable the guest external interrupt for that vCPU in kvm_riscv_aia_wakeon_hgei. This means that writing to the vs_file will cause a guest external interrupt, which will cause KVM to wake up the vCPU in hgei_interrupt to handle the interrupt properly. Signed-off-by: BillXiang <xiangwencheng@lanxincomputing.com> Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Reviewed-by: Radim Krčmář <rkrcmar@ventanamicro.com> Link: https://lore.kernel.org/r/20250221104538.2147-1-xiangwencheng@lanxincomputing.com Signed-off-by: Anup Patel <anup@brainfault.org>
2025-02-20KVM: arm64: Ensure a VMID is allocated before programming VTTBR_EL2Oliver Upton
Vladimir reports that a race condition to attach a VMID to a stage-2 MMU sometimes results in a vCPU entering the guest with a VMID of 0: | CPU1 | CPU2 | | | | kvm_arch_vcpu_ioctl_run | | vcpu_load <= load VTTBR_EL2 | | kvm_vmid->id = 0 | | | kvm_arch_vcpu_ioctl_run | | vcpu_load <= load VTTBR_EL2 | | with kvm_vmid->id = 0| | kvm_arm_vmid_update <= allocates fresh | | kvm_vmid->id and | | reload VTTBR_EL2 | | | | | kvm_arm_vmid_update <= observes that kvm_vmid->id | | already allocated, | | skips reload VTTBR_EL2 Oh yeah, it's as bad as it looks. Remember that VHE loads the stage-2 MMU eagerly but a VMID only gets attached to the MMU later on in the KVM_RUN loop. Even in the "best case" where VTTBR_EL2 correctly gets reprogrammed before entering the EL1&0 regime, there is a period of time where hardware is configured with VMID 0. That's completely insane. So, rather than decorating the 'late' binding with another hack, just allocate the damn thing up front. Attaching a VMID from vcpu_load() is still rollover safe since (surprise!) it'll always get called after a vCPU was preempted. Excuse me while I go find a brown paper bag. Cc: stable@vger.kernel.org Fixes: 934bf871f011 ("KVM: arm64: Load the stage-2 MMU context in kvm_vcpu_load_vhe()") Reported-by: Vladimir Murzin <vladimir.murzin@arm.com> Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20250219220737.130842-1-oliver.upton@linux.dev Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-02-19KVM: arm64: Fix tcr_el2 initialisation in hVHE modeWill Deacon
When not running in VHE mode, cpu_prepare_hyp_mode() computes the value of TCR_EL2 using the host's TCR_EL1 settings as a starting point. For nVHE, this amounts to masking out everything apart from the TG0, SH0, ORGN0, IRGN0 and T0SZ fields before setting the RES1 bits, shifting the IPS field down to the PS field and setting DS if LPA2 is enabled. Unfortunately, for hVHE, things go slightly wonky: EPD1 is correctly set to disable walks via TTBR1_EL2 but then the T1SZ and IPS fields are corrupted when we mistakenly attempt to initialise the PS and DS fields in their E2H=0 positions. Furthermore, many fields are retained from TCR_EL1 which should not be propagated to TCR_EL2. Notably, this means we can end up with A1 set despite not initialising TTBR1_EL2 at all. This has been shown to cause unexpected translation faults at EL2 with pKVM due to TLB invalidation not taking effect when running with a non-zero ASID. Fix the TCR_EL2 initialisation code to set PS and DS only when E2H=0, masking out HD, HA and A1 when E2H=1. Cc: Marc Zyngier <maz@kernel.org> Cc: Oliver Upton <oliver.upton@linux.dev> Fixes: ad744e8cb346 ("arm64: Allow arm64_sw.hvhe on command line") Signed-off-by: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20250214133724.13179-1-will@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-02-17riscv: KVM: Fix SBI sleep_type useAndrew Jones
The spec says sleep_type is 32 bits wide and "In case the data is defined as 32bit wide, higher privilege software must ensure that it only uses 32 bit data." Mask off upper bits of sleep_type before using it. Fixes: 023c15151fbb ("RISC-V: KVM: Add SBI system suspend support") Signed-off-by: Andrew Jones <ajones@ventanamicro.com> Reviewed-by: Anup Patel <anup@brainfault.org> Link: https://lore.kernel.org/r/20250217084506.18763-12-ajones@ventanamicro.com Signed-off-by: Anup Patel <anup@brainfault.org>
2025-02-17riscv: KVM: Fix SBI TIME error generationAndrew Jones
When an invalid function ID of an SBI extension is used we should return not-supported, not invalid-param. Fixes: 5f862df5585c ("RISC-V: KVM: Add v0.1 replacement SBI extensions defined in v0.2") Signed-off-by: Andrew Jones <ajones@ventanamicro.com> Reviewed-by: Anup Patel <anup@brainfault.org> Link: https://lore.kernel.org/r/20250217084506.18763-11-ajones@ventanamicro.com Signed-off-by: Anup Patel <anup@brainfault.org>
2025-02-17riscv: KVM: Fix SBI IPI error generationAndrew Jones
When an invalid function ID of an SBI extension is used we should return not-supported, not invalid-param. Also, when we see that at least one hartid constructed from the base and mask parameters is invalid, then we should return invalid-param. Finally, rather than relying on overflowing a left shift to result in zero and then using that zero in a condition which [correctly] skips sending an IPI (but loops unnecessarily), explicitly check for overflow and exit the loop immediately. Fixes: 5f862df5585c ("RISC-V: KVM: Add v0.1 replacement SBI extensions defined in v0.2") Signed-off-by: Andrew Jones <ajones@ventanamicro.com> Reviewed-by: Anup Patel <anup@brainfault.org> Link: https://lore.kernel.org/r/20250217084506.18763-10-ajones@ventanamicro.com Signed-off-by: Anup Patel <anup@brainfault.org>
2025-02-17riscv: KVM: Fix hart suspend_type useAndrew Jones
The spec says suspend_type is 32 bits wide and "In case the data is defined as 32bit wide, higher privilege software must ensure that it only uses 32 bit data." Mask off upper bits of suspend_type before using it. Fixes: 763c8bed8c05 ("RISC-V: KVM: Implement SBI HSM suspend call") Signed-off-by: Andrew Jones <ajones@ventanamicro.com> Reviewed-by: Anup Patel <anup@brainfault.org> Link: https://lore.kernel.org/r/20250217084506.18763-9-ajones@ventanamicro.com Signed-off-by: Anup Patel <anup@brainfault.org>
2025-02-17riscv: KVM: Fix hart suspend status checkAndrew Jones
"Not stopped" means started or suspended so we need to check for a single state in order to have a chance to check for each state. Also, we need to use target_vcpu when checking for the suspend state. Fixes: 763c8bed8c05 ("RISC-V: KVM: Implement SBI HSM suspend call") Signed-off-by: Andrew Jones <ajones@ventanamicro.com> Reviewed-by: Anup Patel <anup@brainfault.org> Link: https://lore.kernel.org/r/20250217084506.18763-8-ajones@ventanamicro.com Signed-off-by: Anup Patel <anup@brainfault.org>
2025-02-16Linux 6.14-rc3Linus Torvalds
2025-02-16Merge tag 'kbuild-fixes-v6.14-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull Kbuild fixes from Masahiro Yamada: - Fix annoying logs when building tools in parallel - Fix the Debian linux-headers package build again - Fix the target triple detection for userspace programs on Clang * tag 'kbuild-fixes-v6.14-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: modpost: Fix a few typos in a comment kbuild: userprogs: fix bitsize and target detection on clang kbuild: fix linux-headers package build when $(CC) cannot link userspace tools: fix annoying "mkdir -p ..." logs when building tools in parallel
2025-02-16Merge tag 'driver-core-6.14-rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core api addition from Greg KH: "Here is a driver core new api for 6.14-rc3 that is being added to allow platform devices from stop being abused. It adds a new 'faux_device' structure and bus and api to allow almost a straight or simpler conversion from platform devices that were not really a platform device. It also comes with a binding for rust, with an example driver in rust showing how it's used. I'm adding this now so that the patches that convert the different drivers and subsystems can all start flowing into linux-next now through their different development trees, in time for 6.15-rc1. We have a number that are already reviewed and tested, but adding those conversions now doesn't seem right. For now, no one is using this, and it passes all build tests from 0-day and linux-next, so all should be good" * tag 'driver-core-6.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: rust/kernel: Add faux device bindings driver core: add a faux bus for use when a simple device/bus is needed