diff options
Diffstat (limited to 'Documentation/virtual')
-rw-r--r-- | Documentation/virtual/kvm/00-INDEX | 7 | ||||
-rw-r--r-- | Documentation/virtual/kvm/amd-memory-encryption.rst | 247 | ||||
-rw-r--r-- | Documentation/virtual/kvm/api.txt | 227 | ||||
-rw-r--r-- | Documentation/virtual/kvm/arm/vgic-mapped-irqs.txt | 187 | ||||
-rw-r--r-- | Documentation/virtual/kvm/cpuid.txt | 23 | ||||
-rw-r--r-- | Documentation/virtual/kvm/msr.txt | 3 |
6 files changed, 481 insertions, 213 deletions
diff --git a/Documentation/virtual/kvm/00-INDEX b/Documentation/virtual/kvm/00-INDEX index 69fe1a8b7ad1..3492458a4ae8 100644 --- a/Documentation/virtual/kvm/00-INDEX +++ b/Documentation/virtual/kvm/00-INDEX @@ -1,7 +1,12 @@ 00-INDEX - this file. +amd-memory-encryption.rst + - notes on AMD Secure Encrypted Virtualization feature and SEV firmware + command description api.txt - KVM userspace API. +arm + - internal ABI between the kernel and HYP (for arm/arm64) cpuid.txt - KVM-specific cpuid leaves (x86). devices/ @@ -26,3 +31,5 @@ s390-diag.txt - Diagnose hypercall description (for IBM S/390) timekeeping.txt - timekeeping virtualization for x86-based architectures. +vcpu-requests.rst + - internal VCPU request API diff --git a/Documentation/virtual/kvm/amd-memory-encryption.rst b/Documentation/virtual/kvm/amd-memory-encryption.rst new file mode 100644 index 000000000000..71d6d257074f --- /dev/null +++ b/Documentation/virtual/kvm/amd-memory-encryption.rst @@ -0,0 +1,247 @@ +====================================== +Secure Encrypted Virtualization (SEV) +====================================== + +Overview +======== + +Secure Encrypted Virtualization (SEV) is a feature found on AMD processors. + +SEV is an extension to the AMD-V architecture which supports running +virtual machines (VMs) under the control of a hypervisor. When enabled, +the memory contents of a VM will be transparently encrypted with a key +unique to that VM. + +The hypervisor can determine the SEV support through the CPUID +instruction. The CPUID function 0x8000001f reports information related +to SEV:: + + 0x8000001f[eax]: + Bit[1] indicates support for SEV + ... + [ecx]: + Bits[31:0] Number of encrypted guests supported simultaneously + +If support for SEV is present, MSR 0xc001_0010 (MSR_K8_SYSCFG) and MSR 0xc001_0015 +(MSR_K7_HWCR) can be used to determine if it can be enabled:: + + 0xc001_0010: + Bit[23] 1 = memory encryption can be enabled + 0 = memory encryption can not be enabled + + 0xc001_0015: + Bit[0] 1 = memory encryption can be enabled + 0 = memory encryption can not be enabled + +When SEV support is available, it can be enabled in a specific VM by +setting the SEV bit before executing VMRUN.:: + + VMCB[0x90]: + Bit[1] 1 = SEV is enabled + 0 = SEV is disabled + +SEV hardware uses ASIDs to associate a memory encryption key with a VM. +Hence, the ASID for the SEV-enabled guests must be from 1 to a maximum value +defined in the CPUID 0x8000001f[ecx] field. + +SEV Key Management +================== + +The SEV guest key management is handled by a separate processor called the AMD +Secure Processor (AMD-SP). Firmware running inside the AMD-SP provides a secure +key management interface to perform common hypervisor activities such as +encrypting bootstrap code, snapshot, migrating and debugging the guest. For more +information, see the SEV Key Management spec [api-spec]_ + +KVM implements the following commands to support common lifecycle events of SEV +guests, such as launching, running, snapshotting, migrating and decommissioning. + +1. KVM_SEV_INIT +--------------- + +The KVM_SEV_INIT command is used by the hypervisor to initialize the SEV platform +context. In a typical workflow, this command should be the first command issued. + +Returns: 0 on success, -negative on error + +2. KVM_SEV_LAUNCH_START +----------------------- + +The KVM_SEV_LAUNCH_START command is used for creating the memory encryption +context. To create the encryption context, user must provide a guest policy, +the owner's public Diffie-Hellman (PDH) key and session information. + +Parameters: struct kvm_sev_launch_start (in/out) + +Returns: 0 on success, -negative on error + +:: + + struct kvm_sev_launch_start { + __u32 handle; /* if zero then firmware creates a new handle */ + __u32 policy; /* guest's policy */ + + __u64 dh_uaddr; /* userspace address pointing to the guest owner's PDH key */ + __u32 dh_len; + + __u64 session_addr; /* userspace address which points to the guest session information */ + __u32 session_len; + }; + +On success, the 'handle' field contains a new handle and on error, a negative value. + +For more details, see SEV spec Section 6.2. + +3. KVM_SEV_LAUNCH_UPDATE_DATA +----------------------------- + +The KVM_SEV_LAUNCH_UPDATE_DATA is used for encrypting a memory region. It also +calculates a measurement of the memory contents. The measurement is a signature +of the memory contents that can be sent to the guest owner as an attestation +that the memory was encrypted correctly by the firmware. + +Parameters (in): struct kvm_sev_launch_update_data + +Returns: 0 on success, -negative on error + +:: + + struct kvm_sev_launch_update { + __u64 uaddr; /* userspace address to be encrypted (must be 16-byte aligned) */ + __u32 len; /* length of the data to be encrypted (must be 16-byte aligned) */ + }; + +For more details, see SEV spec Section 6.3. + +4. KVM_SEV_LAUNCH_MEASURE +------------------------- + +The KVM_SEV_LAUNCH_MEASURE command is used to retrieve the measurement of the +data encrypted by the KVM_SEV_LAUNCH_UPDATE_DATA command. The guest owner may +wait to provide the guest with confidential information until it can verify the +measurement. Since the guest owner knows the initial contents of the guest at +boot, the measurement can be verified by comparing it to what the guest owner +expects. + +Parameters (in): struct kvm_sev_launch_measure + +Returns: 0 on success, -negative on error + +:: + + struct kvm_sev_launch_measure { + __u64 uaddr; /* where to copy the measurement */ + __u32 len; /* length of measurement blob */ + }; + +For more details on the measurement verification flow, see SEV spec Section 6.4. + +5. KVM_SEV_LAUNCH_FINISH +------------------------ + +After completion of the launch flow, the KVM_SEV_LAUNCH_FINISH command can be +issued to make the guest ready for the execution. + +Returns: 0 on success, -negative on error + +6. KVM_SEV_GUEST_STATUS +----------------------- + +The KVM_SEV_GUEST_STATUS command is used to retrieve status information about a +SEV-enabled guest. + +Parameters (out): struct kvm_sev_guest_status + +Returns: 0 on success, -negative on error + +:: + + struct kvm_sev_guest_status { + __u32 handle; /* guest handle */ + __u32 policy; /* guest policy */ + __u8 state; /* guest state (see enum below) */ + }; + +SEV guest state: + +:: + + enum { + SEV_STATE_INVALID = 0; + SEV_STATE_LAUNCHING, /* guest is currently being launched */ + SEV_STATE_SECRET, /* guest is being launched and ready to accept the ciphertext data */ + SEV_STATE_RUNNING, /* guest is fully launched and running */ + SEV_STATE_RECEIVING, /* guest is being migrated in from another SEV machine */ + SEV_STATE_SENDING /* guest is getting migrated out to another SEV machine */ + }; + +7. KVM_SEV_DBG_DECRYPT +---------------------- + +The KVM_SEV_DEBUG_DECRYPT command can be used by the hypervisor to request the +firmware to decrypt the data at the given memory region. + +Parameters (in): struct kvm_sev_dbg + +Returns: 0 on success, -negative on error + +:: + + struct kvm_sev_dbg { + __u64 src_uaddr; /* userspace address of data to decrypt */ + __u64 dst_uaddr; /* userspace address of destination */ + __u32 len; /* length of memory region to decrypt */ + }; + +The command returns an error if the guest policy does not allow debugging. + +8. KVM_SEV_DBG_ENCRYPT +---------------------- + +The KVM_SEV_DEBUG_ENCRYPT command can be used by the hypervisor to request the +firmware to encrypt the data at the given memory region. + +Parameters (in): struct kvm_sev_dbg + +Returns: 0 on success, -negative on error + +:: + + struct kvm_sev_dbg { + __u64 src_uaddr; /* userspace address of data to encrypt */ + __u64 dst_uaddr; /* userspace address of destination */ + __u32 len; /* length of memory region to encrypt */ + }; + +The command returns an error if the guest policy does not allow debugging. + +9. KVM_SEV_LAUNCH_SECRET +------------------------ + +The KVM_SEV_LAUNCH_SECRET command can be used by the hypervisor to inject secret +data after the measurement has been validated by the guest owner. + +Parameters (in): struct kvm_sev_launch_secret + +Returns: 0 on success, -negative on error + +:: + + struct kvm_sev_launch_secret { + __u64 hdr_uaddr; /* userspace address containing the packet header */ + __u32 hdr_len; + + __u64 guest_uaddr; /* the guest memory region where the secret should be injected */ + __u32 guest_len; + + __u64 trans_uaddr; /* the hypervisor memory region which contains the secret */ + __u32 trans_len; + }; + +References +========== + +.. [white-paper] http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_Memory_Encryption_Whitepaper_v7-Public.pdf +.. [api-spec] http://support.amd.com/TechDocs/55766_SEV-KM%20API_Specification.pdf +.. [amd-apm] http://support.amd.com/TechDocs/24593.pdf (section 15.34) +.. [kvm-forum] http://www.linux-kvm.org/images/7/74/02x08A-Thomas_Lendacky-AMDs_Virtualizatoin_Memory_Encryption_Technology.pdf diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index fc3ae951bc07..1c7958b57fe9 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -123,14 +123,15 @@ memory layout to fit in user mode), check KVM_CAP_MIPS_VZ and use the flag KVM_VM_MIPS_VZ. -4.3 KVM_GET_MSR_INDEX_LIST +4.3 KVM_GET_MSR_INDEX_LIST, KVM_GET_MSR_FEATURE_INDEX_LIST -Capability: basic +Capability: basic, KVM_CAP_GET_MSR_FEATURES for KVM_GET_MSR_FEATURE_INDEX_LIST Architectures: x86 -Type: system +Type: system ioctl Parameters: struct kvm_msr_list (in/out) Returns: 0 on success; -1 on error Errors: + EFAULT: the msr index list cannot be read from or written to E2BIG: the msr index list is to be to fit in the array specified by the user. @@ -139,16 +140,23 @@ struct kvm_msr_list { __u32 indices[0]; }; -This ioctl returns the guest msrs that are supported. The list varies -by kvm version and host processor, but does not change otherwise. The -user fills in the size of the indices array in nmsrs, and in return -kvm adjusts nmsrs to reflect the actual number of msrs and fills in -the indices array with their numbers. +The user fills in the size of the indices array in nmsrs, and in return +kvm adjusts nmsrs to reflect the actual number of msrs and fills in the +indices array with their numbers. + +KVM_GET_MSR_INDEX_LIST returns the guest msrs that are supported. The list +varies by kvm version and host processor, but does not change otherwise. Note: if kvm indicates supports MCE (KVM_CAP_MCE), then the MCE bank MSRs are not returned in the MSR list, as different vcpus can have a different number of banks, as set via the KVM_X86_SETUP_MCE ioctl. +KVM_GET_MSR_FEATURE_INDEX_LIST returns the list of MSRs that can be passed +to the KVM_GET_MSRS system ioctl. This lets userspace probe host capabilities +and processor features that are exposed via MSRs (e.g., VMX capabilities). +This list also varies by kvm version and host processor, but does not change +otherwise. + 4.4 KVM_CHECK_EXTENSION @@ -475,14 +483,22 @@ Support for this has been removed. Use KVM_SET_GUEST_DEBUG instead. 4.18 KVM_GET_MSRS -Capability: basic +Capability: basic (vcpu), KVM_CAP_GET_MSR_FEATURES (system) Architectures: x86 -Type: vcpu ioctl +Type: system ioctl, vcpu ioctl Parameters: struct kvm_msrs (in/out) -Returns: 0 on success, -1 on error +Returns: number of msrs successfully returned; + -1 on error +When used as a system ioctl: +Reads the values of MSR-based features that are available for the VM. This +is similar to KVM_GET_SUPPORTED_CPUID, but it returns MSR indices and values. +The list of msr-based features can be obtained using KVM_GET_MSR_FEATURE_INDEX_LIST +in a system ioctl. + +When used as a vcpu ioctl: Reads model-specific registers from the vcpu. Supported msr indices can -be obtained using KVM_GET_MSR_INDEX_LIST. +be obtained using KVM_GET_MSR_INDEX_LIST in a system ioctl. struct kvm_msrs { __u32 nmsrs; /* number of msrs in entries */ @@ -1841,6 +1857,7 @@ registers, find a list below: PPC | KVM_REG_PPC_DBSR | 32 PPC | KVM_REG_PPC_TIDR | 64 PPC | KVM_REG_PPC_PSSCR | 64 + PPC | KVM_REG_PPC_DEC_EXPIRY | 64 PPC | KVM_REG_PPC_TM_GPR0 | 64 ... PPC | KVM_REG_PPC_TM_GPR31 | 64 @@ -3403,7 +3420,7 @@ invalid, if invalid pages are written to (e.g. after the end of memory) or if no page table is present for the addresses (e.g. when using hugepages). -4.108 KVM_PPC_GET_CPU_CHAR +4.109 KVM_PPC_GET_CPU_CHAR Capability: KVM_CAP_PPC_GET_CPU_CHAR Architectures: powerpc @@ -3449,6 +3466,89 @@ array bounds check and the array access. These fields use the same bit definitions as the new H_GET_CPU_CHARACTERISTICS hypercall. +4.110 KVM_MEMORY_ENCRYPT_OP + +Capability: basic +Architectures: x86 +Type: system +Parameters: an opaque platform specific structure (in/out) +Returns: 0 on success; -1 on error + +If the platform supports creating encrypted VMs then this ioctl can be used +for issuing platform-specific memory encryption commands to manage those +encrypted VMs. + +Currently, this ioctl is used for issuing Secure Encrypted Virtualization +(SEV) commands on AMD Processors. The SEV commands are defined in +Documentation/virtual/kvm/amd-memory-encryption.rst. + +4.111 KVM_MEMORY_ENCRYPT_REG_REGION + +Capability: basic +Architectures: x86 +Type: system +Parameters: struct kvm_enc_region (in) +Returns: 0 on success; -1 on error + +This ioctl can be used to register a guest memory region which may +contain encrypted data (e.g. guest RAM, SMRAM etc). + +It is used in the SEV-enabled guest. When encryption is enabled, a guest +memory region may contain encrypted data. The SEV memory encryption +engine uses a tweak such that two identical plaintext pages, each at +different locations will have differing ciphertexts. So swapping or +moving ciphertext of those pages will not result in plaintext being +swapped. So relocating (or migrating) physical backing pages for the SEV +guest will require some additional steps. + +Note: The current SEV key management spec does not provide commands to +swap or migrate (move) ciphertext pages. Hence, for now we pin the guest +memory region registered with the ioctl. + +4.112 KVM_MEMORY_ENCRYPT_UNREG_REGION + +Capability: basic +Architectures: x86 +Type: system +Parameters: struct kvm_enc_region (in) +Returns: 0 on success; -1 on error + +This ioctl can be used to unregister the guest memory region registered +with KVM_MEMORY_ENCRYPT_REG_REGION ioctl above. + +4.113 KVM_HYPERV_EVENTFD + +Capability: KVM_CAP_HYPERV_EVENTFD +Architectures: x86 +Type: vm ioctl +Parameters: struct kvm_hyperv_eventfd (in) + +This ioctl (un)registers an eventfd to receive notifications from the guest on +the specified Hyper-V connection id through the SIGNAL_EVENT hypercall, without +causing a user exit. SIGNAL_EVENT hypercall with non-zero event flag number +(bits 24-31) still triggers a KVM_EXIT_HYPERV_HCALL user exit. + +struct kvm_hyperv_eventfd { + __u32 conn_id; + __s32 fd; + __u32 flags; + __u32 padding[3]; +}; + +The conn_id field should fit within 24 bits: + +#define KVM_HYPERV_CONN_ID_MASK 0x00ffffff + +The acceptable values for the flags field are: + +#define KVM_HYPERV_EVENTFD_DEASSIGN (1 << 0) + +Returns: 0 on success, + -EINVAL if conn_id or flags is outside the allowed range + -ENOENT on deassign if the conn_id isn't registered + -EEXIST on assign if the conn_id is already registered + + 5. The kvm_run structure ------------------------ @@ -3805,7 +3905,7 @@ in userspace. __u64 kvm_dirty_regs; union { struct kvm_sync_regs regs; - char padding[1024]; + char padding[SYNC_REGS_SIZE_BYTES]; } s; If KVM_CAP_SYNC_REGS is defined, these fields allow userspace to access @@ -4010,6 +4110,46 @@ Once this is done the KVM_REG_MIPS_VEC_* and KVM_REG_MIPS_MSA_* registers can be accessed, and the Config5.MSAEn bit is accessible via the KVM API and also from the guest. +6.74 KVM_CAP_SYNC_REGS +Architectures: s390, x86 +Target: s390: always enabled, x86: vcpu +Parameters: none +Returns: x86: KVM_CHECK_EXTENSION returns a bit-array indicating which register +sets are supported (bitfields defined in arch/x86/include/uapi/asm/kvm.h). + +As described above in the kvm_sync_regs struct info in section 5 (kvm_run): +KVM_CAP_SYNC_REGS "allow[s] userspace to access certain guest registers +without having to call SET/GET_*REGS". This reduces overhead by eliminating +repeated ioctl calls for setting and/or getting register values. This is +particularly important when userspace is making synchronous guest state +modifications, e.g. when emulating and/or intercepting instructions in +userspace. + +For s390 specifics, please refer to the source code. + +For x86: +- the register sets to be copied out to kvm_run are selectable + by userspace (rather that all sets being copied out for every exit). +- vcpu_events are available in addition to regs and sregs. + +For x86, the 'kvm_valid_regs' field of struct kvm_run is overloaded to +function as an input bit-array field set by userspace to indicate the +specific register sets to be copied out on the next exit. + +To indicate when userspace has modified values that should be copied into +the vCPU, the all architecture bitarray field, 'kvm_dirty_regs' must be set. +This is done using the same bitflags as for the 'kvm_valid_regs' field. +If the dirty bit is not set, then the register set values will not be copied +into the vCPU even if they've been modified. + +Unused bitfields in the bitarrays must be set to zero. + +struct kvm_sync_regs { + struct kvm_regs regs; + struct kvm_sregs sregs; + struct kvm_vcpu_events events; +}; + 7. Capabilities that can be enabled on VMs ------------------------------------------ @@ -4218,6 +4358,26 @@ enables QEMU to build error log and branch to guest kernel registered machine check handling routine. Without this capability KVM will branch to guests' 0x200 interrupt vector. +7.13 KVM_CAP_X86_DISABLE_EXITS + +Architectures: x86 +Parameters: args[0] defines which exits are disabled +Returns: 0 on success, -EINVAL when args[0] contains invalid exits + +Valid bits in args[0] are + +#define KVM_X86_DISABLE_EXITS_MWAIT (1 << 0) +#define KVM_X86_DISABLE_EXITS_HLT (1 << 1) + +Enabling this capability on a VM provides userspace with a way to no +longer intercept some instructions for improved latency in some +workloads, and is suggested when vCPUs are associated to dedicated +physical CPUs. More bits can be added in the future; userspace can +just pass the KVM_CHECK_EXTENSION result to KVM_ENABLE_CAP to disable +all such vmexits. + +Do not enable KVM_FEATURE_PV_UNHALT if you disable HLT exits. + 8. Other capabilities. ---------------------- @@ -4330,15 +4490,6 @@ reserved. Both registers and addresses are 64-bits wide. It will be possible to run 64-bit or 32-bit guest code. -8.8 KVM_CAP_X86_GUEST_MWAIT - -Architectures: x86 - -This capability indicates that guest using memory monotoring instructions -(MWAIT/MWAITX) to stop the virtual CPU will not cause a VM exit. As such time -spent while virtual CPU is halted in this way will then be accounted for as -guest running time on the host (as opposed to e.g. HLT). - 8.9 KVM_CAP_ARM_USER_IRQ Architectures: arm, arm64 @@ -4415,3 +4566,33 @@ Parameters: none This capability indicates if the flic device will be able to get/set the AIS states for migration via the KVM_DEV_FLIC_AISM_ALL attribute and allows to discover this without having to create a flic device. + +8.14 KVM_CAP_S390_PSW + +Architectures: s390 + +This capability indicates that the PSW is exposed via the kvm_run structure. + +8.15 KVM_CAP_S390_GMAP + +Architectures: s390 + +This capability indicates that the user space memory used as guest mapping can +be anywhere in the user memory address space, as long as the memory slots are +aligned and sized to a segment (1MB) boundary. + +8.16 KVM_CAP_S390_COW + +Architectures: s390 + +This capability indicates that the user space memory used as guest mapping can +use copy-on-write semantics as well as dirty pages tracking via read-only page +tables. + +8.17 KVM_CAP_S390_BPB + +Architectures: s390 + +This capability indicates that kvm will implement the interfaces to handle +reset, migration and nested KVM for branch prediction blocking. The stfle +facility 82 should not be provided to the guest without this capability. diff --git a/Documentation/virtual/kvm/arm/vgic-mapped-irqs.txt b/Documentation/virtual/kvm/arm/vgic-mapped-irqs.txt deleted file mode 100644 index 38bca2835278..000000000000 --- a/Documentation/virtual/kvm/arm/vgic-mapped-irqs.txt +++ /dev/null @@ -1,187 +0,0 @@ -KVM/ARM VGIC Forwarded Physical Interrupts -========================================== - -The KVM/ARM code implements software support for the ARM Generic -Interrupt Controller's (GIC's) hardware support for virtualization by -allowing software to inject virtual interrupts to a VM, which the guest -OS sees as regular interrupts. The code is famously known as the VGIC. - -Some of these virtual interrupts, however, correspond to physical -interrupts from real physical devices. One example could be the -architected timer, which itself supports virtualization, and therefore -lets a guest OS program the hardware device directly to raise an -interrupt at some point in time. When such an interrupt is raised, the -host OS initially handles the interrupt and must somehow signal this -event as a virtual interrupt to the guest. Another example could be a -passthrough device, where the physical interrupts are initially handled -by the host, but the device driver for the device lives in the guest OS -and KVM must therefore somehow inject a virtual interrupt on behalf of -the physical one to the guest OS. - -These virtual interrupts corresponding to a physical interrupt on the -host are called forwarded physical interrupts, but are also sometimes -referred to as 'virtualized physical interrupts' and 'mapped interrupts'. - -Forwarded physical interrupts are handled slightly differently compared -to virtual interrupts generated purely by a software emulated device. - - -The HW bit ----------- -Virtual interrupts are signalled to the guest by programming the List -Registers (LRs) on the GIC before running a VCPU. The LR is programmed -with the virtual IRQ number and the state of the interrupt (Pending, -Active, or Pending+Active). When the guest ACKs and EOIs a virtual -interrupt, the LR state moves from Pending to Active, and finally to -inactive. - -The LRs include an extra bit, called the HW bit. When this bit is set, -KVM must also program an additional field in the LR, the physical IRQ -number, to link the virtual with the physical IRQ. - -When the HW bit is set, KVM must EITHER set the Pending OR the Active -bit, never both at the same time. - -Setting the HW bit causes the hardware to deactivate the physical -interrupt on the physical distributor when the guest deactivates the -corresponding virtual interrupt. - - -Forwarded Physical Interrupts Life Cycle ----------------------------------------- - -The state of forwarded physical interrupts is managed in the following way: - - - The physical interrupt is acked by the host, and becomes active on - the physical distributor (*). - - KVM sets the LR.Pending bit, because this is the only way the GICV - interface is going to present it to the guest. - - LR.Pending will stay set as long as the guest has not acked the interrupt. - - LR.Pending transitions to LR.Active on the guest read of the IAR, as - expected. - - On guest EOI, the *physical distributor* active bit gets cleared, - but the LR.Active is left untouched (set). - - KVM clears the LR on VM exits when the physical distributor - active state has been cleared. - -(*): The host handling is slightly more complicated. For some forwarded -interrupts (shared), KVM directly sets the active state on the physical -distributor before entering the guest, because the interrupt is never actually -handled on the host (see details on the timer as an example below). For other -forwarded interrupts (non-shared) the host does not deactivate the interrupt -when the host ISR completes, but leaves the interrupt active until the guest -deactivates it. Leaving the interrupt active is allowed, because Linux -configures the physical GIC with EOIMode=1, which causes EOI operations to -perform a priority drop allowing the GIC to receive other interrupts of the -default priority. - - -Forwarded Edge and Level Triggered PPIs and SPIs ------------------------------------------------- -Forwarded physical interrupts injected should always be active on the -physical distributor when injected to a guest. - -Level-triggered interrupts will keep the interrupt line to the GIC -asserted, typically until the guest programs the device to deassert the -line. This means that the interrupt will remain pending on the physical -distributor until the guest has reprogrammed the device. Since we -always run the VM with interrupts enabled on the CPU, a pending -interrupt will exit the guest as soon as we switch into the guest, -preventing the guest from ever making progress as the process repeats -over and over. Therefore, the active state on the physical distributor -must be set when entering the guest, preventing the GIC from forwarding -the pending interrupt to the CPU. As soon as the guest deactivates the -interrupt, the physical line is sampled by the hardware again and the host -takes a new interrupt if and only if the physical line is still asserted. - -Edge-triggered interrupts do not exhibit the same problem with -preventing guest execution that level-triggered interrupts do. One -option is to not use HW bit at all, and inject edge-triggered interrupts -from a physical device as pure virtual interrupts. But that would -potentially slow down handling of the interrupt in the guest, because a -physical interrupt occurring in the middle of the guest ISR would -preempt the guest for the host to handle the interrupt. Additionally, -if you configure the system to handle interrupts on a separate physical -core from that running your VCPU, you still have to interrupt the VCPU -to queue the pending state onto the LR, even though the guest won't use -this information until the guest ISR completes. Therefore, the HW -bit should always be set for forwarded edge-triggered interrupts. With -the HW bit set, the virtual interrupt is injected and additional -physical interrupts occurring before the guest deactivates the interrupt -simply mark the state on the physical distributor as Pending+Active. As -soon as the guest deactivates the interrupt, the host takes another -interrupt if and only if there was a physical interrupt between injecting -the forwarded interrupt to the guest and the guest deactivating the -interrupt. - -Consequently, whenever we schedule a VCPU with one or more LRs with the -HW bit set, the interrupt must also be active on the physical -distributor. - - -Forwarded LPIs --------------- -LPIs, introduced in GICv3, are always edge-triggered and do not have an -active state. They become pending when a device signal them, and as -soon as they are acked by the CPU, they are inactive again. - -It therefore doesn't make sense, and is not supported, to set the HW bit -for physical LPIs that are forwarded to a VM as virtual interrupts, -typically virtual SPIs. - -For LPIs, there is no other choice than to preempt the VCPU thread if -necessary, and queue the pending state onto the LR. - - -Putting It Together: The Architected Timer ------------------------------------------- -The architected timer is a device that signals interrupts with level -triggered semantics. The timer hardware is directly accessed by VCPUs -which program the timer to fire at some point in time. Each VCPU on a -system programs the timer to fire at different times, and therefore the -hardware is multiplexed between multiple VCPUs. This is implemented by -context-switching the timer state along with each VCPU thread. - -However, this means that a scenario like the following is entirely -possible, and in fact, typical: - -1. KVM runs the VCPU -2. The guest programs the time to fire in T+100 -3. The guest is idle and calls WFI (wait-for-interrupts) -4. The hardware traps to the host -5. KVM stores the timer state to memory and disables the hardware timer -6. KVM schedules a soft timer to fire in T+(100 - time since step 2) -7. KVM puts the VCPU thread to sleep (on a waitqueue) -8. The soft timer fires, waking up the VCPU thread -9. KVM reprograms the timer hardware with the VCPU's values -10. KVM marks the timer interrupt as active on the physical distributor -11. KVM injects a forwarded physical interrupt to the guest -12. KVM runs the VCPU - -Notice that KVM injects a forwarded physical interrupt in step 11 without -the corresponding interrupt having actually fired on the host. That is -exactly why we mark the timer interrupt as active in step 10, because -the active state on the physical distributor is part of the state -belonging to the timer hardware, which is context-switched along with -the VCPU thread. - -If the guest does not idle because it is busy, the flow looks like this -instead: - -1. KVM runs the VCPU -2. The guest programs the time to fire in T+100 -4. At T+100 the timer fires and a physical IRQ causes the VM to exit - (note that this initially only traps to EL2 and does not run the host ISR - until KVM has returned to the host). -5. With interrupts still disabled on the CPU coming back from the guest, KVM - stores the virtual timer state to memory and disables the virtual hw timer. -6. KVM looks at the timer state (in memory) and injects a forwarded physical - interrupt because it concludes the timer has expired. -7. KVM marks the timer interrupt as active on the physical distributor -7. KVM enables the timer, enables interrupts, and runs the VCPU - -Notice that again the forwarded physical interrupt is injected to the -guest without having actually been handled on the host. In this case it -is because the physical interrupt is never actually seen by the host because the -timer is disabled upon guest return, and the virtual forwarded interrupt is -injected on the KVM guest entry path. diff --git a/Documentation/virtual/kvm/cpuid.txt b/Documentation/virtual/kvm/cpuid.txt index 3c65feb83010..d4f33eb805dd 100644 --- a/Documentation/virtual/kvm/cpuid.txt +++ b/Documentation/virtual/kvm/cpuid.txt @@ -23,8 +23,8 @@ This function queries the presence of KVM cpuid leafs. function: define KVM_CPUID_FEATURES (0x40000001) -returns : ebx, ecx, edx = 0 - eax = and OR'ed group of (1 << flag), where each flags is: +returns : ebx, ecx + eax = an OR'ed group of (1 << flag), where each flags is: flag || value || meaning @@ -54,7 +54,26 @@ KVM_FEATURE_PV_UNHALT || 7 || guest checks this feature bit || || before enabling paravirtualized || || spinlock support. ------------------------------------------------------------------------------ +KVM_FEATURE_PV_TLB_FLUSH || 9 || guest checks this feature bit + || || before enabling paravirtualized + || || tlb flush. +------------------------------------------------------------------------------ +KVM_FEATURE_ASYNC_PF_VMEXIT || 10 || paravirtualized async PF VM exit + || || can be enabled by setting bit 2 + || || when writing to msr 0x4b564d02 +------------------------------------------------------------------------------ KVM_FEATURE_CLOCKSOURCE_STABLE_BIT || 24 || host will warn if no guest-side || || per-cpu warps are expected in || || kvmclock. ------------------------------------------------------------------------------ + + edx = an OR'ed group of (1 << flag), where each flags is: + + +flag || value || meaning +================================================================================== +KVM_HINTS_DEDICATED || 0 || guest checks this feature bit to + || || determine if there is vCPU pinning + || || and there is no vCPU over-commitment, + || || allowing optimizations +---------------------------------------------------------------------------------- diff --git a/Documentation/virtual/kvm/msr.txt b/Documentation/virtual/kvm/msr.txt index 1ebecc115dc6..f3f0d57ced8e 100644 --- a/Documentation/virtual/kvm/msr.txt +++ b/Documentation/virtual/kvm/msr.txt @@ -170,7 +170,8 @@ MSR_KVM_ASYNC_PF_EN: 0x4b564d02 when asynchronous page faults are enabled on the vcpu 0 when disabled. Bit 1 is 1 if asynchronous page faults can be injected when vcpu is in cpl == 0. Bit 2 is 1 if asynchronous page faults - are delivered to L1 as #PF vmexits. + are delivered to L1 as #PF vmexits. Bit 2 can be set only if + KVM_FEATURE_ASYNC_PF_VMEXIT is present in CPUID. First 4 byte of 64 byte memory location will be written to by the hypervisor at the time of asynchronous page fault (APF) |