summaryrefslogtreecommitdiff
path: root/Documentation/virt/kvm/api.rst
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/virt/kvm/api.rst')
-rw-r--r--Documentation/virt/kvm/api.rst184
1 files changed, 175 insertions, 9 deletions
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 47c7c3f92314..6aa40ee05a4a 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -1411,6 +1411,9 @@ the memory region are automatically reflected into the guest. For example, an
mmap() that affects the region will be made visible immediately. Another
example is madvise(MADV_DROP).
+For TDX guest, deleting/moving memory region loses guest memory contents.
+Read only region isn't supported. Only as-id 0 is supported.
+
Note: On arm64, a write generated by the page-table walker (to update
the Access and Dirty flags, for example) never results in a
KVM_EXIT_MMIO exit when the slot has the KVM_MEM_READONLY flag. This
@@ -2003,7 +2006,14 @@ frequency is KHz.
If the KVM_CAP_VM_TSC_CONTROL capability is advertised, this can also
be used as a vm ioctl to set the initial tsc frequency of subsequently
-created vCPUs.
+created vCPUs. Note, the vm ioctl is only allowed prior to creating vCPUs.
+
+For TSC protected Confidential Computing (CoCo) VMs where TSC frequency
+is configured once at VM scope and remains unchanged during VM's
+lifetime, the vm ioctl should be used to configure the TSC frequency
+and the vcpu ioctl is not supported.
+
+Example of such CoCo VMs: TDX guests.
4.56 KVM_GET_TSC_KHZ
--------------------
@@ -3460,7 +3470,8 @@ The initial values are defined as:
- FPSIMD/NEON registers: set to 0
- SVE registers: set to 0
- System registers: Reset to their architecturally defined
- values as for a warm reset to EL1 (resp. SVC)
+ values as for a warm reset to EL1 (resp. SVC) or EL2 (in the
+ case of EL2 being enabled).
Note that because some registers reflect machine topology, all vcpus
should be created before this ioctl is invoked.
@@ -3527,6 +3538,17 @@ Possible features:
- the KVM_REG_ARM64_SVE_VLS pseudo-register is immutable, and can
no longer be written using KVM_SET_ONE_REG.
+ - KVM_ARM_VCPU_HAS_EL2: Enable Nested Virtualisation support,
+ booting the guest from EL2 instead of EL1.
+ Depends on KVM_CAP_ARM_EL2.
+ The VM is running with HCR_EL2.E2H being RES1 (VHE) unless
+ KVM_ARM_VCPU_HAS_EL2_E2H0 is also set.
+
+ - KVM_ARM_VCPU_HAS_EL2_E2H0: Restrict Nested Virtualisation
+ support to HCR_EL2.E2H being RES0 (non-VHE).
+ Depends on KVM_CAP_ARM_EL2_E2H0.
+ KVM_ARM_VCPU_HAS_EL2 must also be set.
+
4.83 KVM_ARM_PREFERRED_TARGET
-----------------------------
@@ -4768,7 +4790,7 @@ H_GET_CPU_CHARACTERISTICS hypercall.
:Capability: basic
:Architectures: x86
-:Type: vm
+:Type: vm ioctl, vcpu ioctl
:Parameters: an opaque platform specific structure (in/out)
:Returns: 0 on success; -1 on error
@@ -4776,9 +4798,11 @@ If the platform supports creating encrypted VMs then this ioctl can be used
for issuing platform-specific memory encryption commands to manage those
encrypted VMs.
-Currently, this ioctl is used for issuing Secure Encrypted Virtualization
-(SEV) commands on AMD Processors. The SEV commands are defined in
-Documentation/virt/kvm/x86/amd-memory-encryption.rst.
+Currently, this ioctl is used for issuing both Secure Encrypted Virtualization
+(SEV) commands on AMD Processors and Trusted Domain Extensions (TDX) commands
+on Intel Processors. The detailed commands are defined in
+Documentation/virt/kvm/x86/amd-memory-encryption.rst and
+Documentation/virt/kvm/x86/intel-tdx.rst.
4.111 KVM_MEMORY_ENCRYPT_REG_REGION
-----------------------------------
@@ -6628,7 +6652,8 @@ to the byte array.
.. note::
For KVM_EXIT_IO, KVM_EXIT_MMIO, KVM_EXIT_OSI, KVM_EXIT_PAPR, KVM_EXIT_XEN,
- KVM_EXIT_EPR, KVM_EXIT_X86_RDMSR and KVM_EXIT_X86_WRMSR the corresponding
+ KVM_EXIT_EPR, KVM_EXIT_HYPERCALL, KVM_EXIT_TDX,
+ KVM_EXIT_X86_RDMSR and KVM_EXIT_X86_WRMSR the corresponding
operations are complete (and guest state is consistent) only after userspace
has re-entered the kernel with KVM_RUN. The kernel side will first finish
incomplete operations and then check for pending signals.
@@ -6827,6 +6852,7 @@ should put the acknowledged interrupt vector into the 'epr' field.
#define KVM_SYSTEM_EVENT_WAKEUP 4
#define KVM_SYSTEM_EVENT_SUSPEND 5
#define KVM_SYSTEM_EVENT_SEV_TERM 6
+ #define KVM_SYSTEM_EVENT_TDX_FATAL 7
__u32 type;
__u32 ndata;
__u64 data[16];
@@ -6853,6 +6879,11 @@ Valid values for 'type' are:
reset/shutdown of the VM.
- KVM_SYSTEM_EVENT_SEV_TERM -- an AMD SEV guest requested termination.
The guest physical address of the guest's GHCB is stored in `data[0]`.
+ - KVM_SYSTEM_EVENT_TDX_FATAL -- a TDX guest reported a fatal error state.
+ KVM doesn't do any parsing or conversion, it just dumps 16 general-purpose
+ registers to userspace, in ascending order of the 4-bit indices for x86-64
+ general-purpose registers in instruction encoding, as defined in the Intel
+ SDM.
- KVM_SYSTEM_EVENT_WAKEUP -- the exiting vCPU is in a suspended state and
KVM has recognized a wakeup event. Userspace may honor this event by
marking the exiting vCPU as runnable, or deny it and call KVM_RUN again.
@@ -7153,6 +7184,69 @@ The valid value for 'flags' is:
::
+ /* KVM_EXIT_TDX */
+ struct {
+ __u64 flags;
+ __u64 nr;
+ union {
+ struct {
+ u64 ret;
+ u64 data[5];
+ } unknown;
+ struct {
+ u64 ret;
+ u64 gpa;
+ u64 size;
+ } get_quote;
+ struct {
+ u64 ret;
+ u64 leaf;
+ u64 r11, r12, r13, r14;
+ } get_tdvmcall_info;
+ struct {
+ u64 ret;
+ u64 vector;
+ } setup_event_notify;
+ };
+ } tdx;
+
+Process a TDVMCALL from the guest. KVM forwards select TDVMCALL based
+on the Guest-Hypervisor Communication Interface (GHCI) specification;
+KVM bridges these requests to the userspace VMM with minimal changes,
+placing the inputs in the union and copying them back to the guest
+on re-entry.
+
+Flags are currently always zero, whereas ``nr`` contains the TDVMCALL
+number from register R11. The remaining field of the union provide the
+inputs and outputs of the TDVMCALL. Currently the following values of
+``nr`` are defined:
+
+ * ``TDVMCALL_GET_QUOTE``: the guest has requested to generate a TD-Quote
+ signed by a service hosting TD-Quoting Enclave operating on the host.
+ Parameters and return value are in the ``get_quote`` field of the union.
+ The ``gpa`` field and ``size`` specify the guest physical address
+ (without the shared bit set) and the size of a shared-memory buffer, in
+ which the TDX guest passes a TD Report. The ``ret`` field represents
+ the return value of the GetQuote request. When the request has been
+ queued successfully, the TDX guest can poll the status field in the
+ shared-memory area to check whether the Quote generation is completed or
+ not. When completed, the generated Quote is returned via the same buffer.
+
+ * ``TDVMCALL_GET_TD_VM_CALL_INFO``: the guest has requested the support
+ status of TDVMCALLs. The output values for the given leaf should be
+ placed in fields from ``r11`` to ``r14`` of the ``get_tdvmcall_info``
+ field of the union.
+
+ * ``TDVMCALL_SETUP_EVENT_NOTIFY_INTERRUPT``: the guest has requested to
+ set up a notification interrupt for vector ``vector``.
+
+KVM may add support for more values in the future that may cause a userspace
+exit, even without calls to ``KVM_ENABLE_CAP`` or similar. In this case,
+it will enter with output fields already valid; in the common case, the
+``unknown.ret`` field of the union will be ``TDVMCALL_STATUS_SUBFUNC_UNSUPPORTED``.
+Userspace need not do anything if it does not wish to support a TDVMCALL.
+::
+
/* Fix the size of the union. */
char padding[256];
};
@@ -7757,6 +7851,7 @@ Valid bits in args[0] are::
#define KVM_X86_DISABLE_EXITS_HLT (1 << 1)
#define KVM_X86_DISABLE_EXITS_PAUSE (1 << 2)
#define KVM_X86_DISABLE_EXITS_CSTATE (1 << 3)
+ #define KVM_X86_DISABLE_EXITS_APERFMPERF (1 << 4)
Enabling this capability on a VM provides userspace with a way to no
longer intercept some instructions for improved latency in some
@@ -7767,6 +7862,28 @@ all such vmexits.
Do not enable KVM_FEATURE_PV_UNHALT if you disable HLT exits.
+Virtualizing the ``IA32_APERF`` and ``IA32_MPERF`` MSRs requires more
+than just disabling APERF/MPERF exits. While both Intel and AMD
+document strict usage conditions for these MSRs--emphasizing that only
+the ratio of their deltas over a time interval (T0 to T1) is
+architecturally defined--simply passing through the MSRs can still
+produce an incorrect ratio.
+
+This erroneous ratio can occur if, between T0 and T1:
+
+1. The vCPU thread migrates between logical processors.
+2. Live migration or suspend/resume operations take place.
+3. Another task shares the vCPU's logical processor.
+4. C-states lower than C0 are emulated (e.g., via HLT interception).
+5. The guest TSC frequency doesn't match the host TSC frequency.
+
+Due to these complexities, KVM does not automatically associate this
+passthrough capability with the guest CPUID bit,
+``CPUID.6:ECX.APERFMPERF[bit 0]``. Userspace VMMs that deem this
+mechanism adequate for virtualizing the ``IA32_APERF`` and
+``IA32_MPERF`` MSRs must set the guest CPUID bit explicitly.
+
+
7.14 KVM_CAP_S390_HPAGE_1M
--------------------------
@@ -7978,6 +8095,11 @@ apply some other policy-based mitigation. When exiting to userspace, KVM sets
KVM_RUN_X86_BUS_LOCK in vcpu-run->flags, and conditionally sets the exit_reason
to KVM_EXIT_X86_BUS_LOCK.
+Due to differences in the underlying hardware implementation, the vCPU's RIP at
+the time of exit diverges between Intel and AMD. On Intel hosts, RIP points at
+the next instruction, i.e. the exit is trap-like. On AMD hosts, RIP points at
+the offending instruction, i.e. the exit is fault-like.
+
Note! Detected bus locks may be coincident with other exits to userspace, i.e.
KVM_RUN_X86_BUS_LOCK should be checked regardless of the primary exit reason if
userspace wants to take action on all detected bus locks.
@@ -8194,6 +8316,28 @@ KVM_X86_QUIRK_STUFF_FEATURE_MSRS By default, at vCPU creation, KVM sets the
and 0x489), as KVM does now allow them to
be set by userspace (KVM sets them based on
guest CPUID, for safety purposes).
+
+KVM_X86_QUIRK_IGNORE_GUEST_PAT By default, on Intel platforms, KVM ignores
+ guest PAT and forces the effective memory
+ type to WB in EPT. The quirk is not available
+ on Intel platforms which are incapable of
+ safely honoring guest PAT (i.e., without CPU
+ self-snoop, KVM always ignores guest PAT and
+ forces effective memory type to WB). It is
+ also ignored on AMD platforms or, on Intel,
+ when a VM has non-coherent DMA devices
+ assigned; KVM always honors guest PAT in
+ such case. The quirk is needed to avoid
+ slowdowns on certain Intel Xeon platforms
+ (e.g. ICX, SPR) where self-snoop feature is
+ supported but UC is slow enough to cause
+ issues with some older guests that use
+ UC instead of WC to map the video RAM.
+ Userspace can disable the quirk to honor
+ guest PAT if it knows that there is no such
+ guest software, for example if it does not
+ expose a bochs graphics device (which is
+ known to have had a buggy driver).
=================================== ============================================
7.32 KVM_CAP_MAX_VCPU_ID
@@ -8266,7 +8410,7 @@ core crystal clock frequency, if a non-zero CPUID 0x15 is exposed to the guest.
7.36 KVM_CAP_DIRTY_LOG_RING/KVM_CAP_DIRTY_LOG_RING_ACQ_REL
----------------------------------------------------------
-:Architectures: x86, arm64
+:Architectures: x86, arm64, riscv
:Type: vm
:Parameters: args[0] - size of the dirty log ring
@@ -8478,7 +8622,7 @@ ENOSYS for the others.
When enabled, KVM will exit to userspace with KVM_EXIT_SYSTEM_EVENT of
type KVM_SYSTEM_EVENT_SUSPEND to process the guest suspend request.
-7.37 KVM_CAP_ARM_WRITABLE_IMP_ID_REGS
+7.42 KVM_CAP_ARM_WRITABLE_IMP_ID_REGS
-------------------------------------
:Architectures: arm64
@@ -8496,6 +8640,28 @@ aforementioned registers before the first KVM_RUN. These registers are VM
scoped, meaning that the same set of values are presented on all vCPUs in a
given VM.
+7.43 KVM_CAP_RISCV_MP_STATE_RESET
+---------------------------------
+
+:Architectures: riscv
+:Type: VM
+:Parameters: None
+:Returns: 0 on success, -EINVAL if arg[0] is not zero
+
+When this capability is enabled, KVM resets the VCPU when setting
+MP_STATE_INIT_RECEIVED through IOCTL. The original MP_STATE is preserved.
+
+7.43 KVM_CAP_ARM_CACHEABLE_PFNMAP_SUPPORTED
+-------------------------------------------
+
+:Architectures: arm64
+:Target: VM
+:Parameters: None
+
+This capability indicate to the userspace whether a PFNMAP memory region
+can be safely mapped as cacheable. This relies on the presence of
+force write back (FWB) feature support on the hardware.
+
8. Other capabilities.
======================