Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Paolo Bonzini: "PPC: - Better machine check handling for HV KVM - Ability to support guests with threads=2, 4 or 8 on POWER9 - Fix for a race that could cause delayed recognition of signals - Fix for a bug where POWER9 guests could sleep with interrupts pending. ARM: - VCPU request overhaul - allow timer and PMU to have their interrupt number selected from userspace - workaround for Cavium erratum 30115 - handling of memory poisonning - the usual crop of fixes and cleanups s390: - initial machine check forwarding - migration support for the CMMA page hinting information - cleanups and fixes x86: - nested VMX bugfixes and improvements - more reliable NMI window detection on AMD - APIC timer optimizations Generic: - VCPU request overhaul + documentation of common code patterns - kvm_stat improvements" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (124 commits) Update my email address kvm: vmx: allow host to access guest MSR_IA32_BNDCFGS x86: kvm: mmu: use ept a/d in vmcs02 iff used in vmcs12 kvm: x86: mmu: allow A/D bits to be disabled in an mmu x86: kvm: mmu: make spte mmio mask more explicit x86: kvm: mmu: dead code thanks to access tracking KVM: PPC: Book3S: Fix typo in XICS-on-XIVE state saving code KVM: PPC: Book3S HV: Close race with testing for signals on guest entry KVM: PPC: Book3S HV: Simplify dynamic micro-threading code KVM: x86: remove ignored type attribute KVM: LAPIC: Fix lapic timer injection delay KVM: lapic: reorganize restart_apic_timer KVM: lapic: reorganize start_hv_timer kvm: nVMX: Check memory operand to INVVPID KVM: s390: Inject machine check into the nested guest KVM: s390: Inject machine check into the guest tools/kvm_stat: add new interactive command 'b' tools/kvm_stat: add new command line switch '-i' tools/kvm_stat: fix error on interactive command 'g' KVM: SVM: suppress unnecessary NMI singlestep on GIF=0 and nested exit ...
author: Linus Torvalds <torvalds@linux-foundation.org> 2017-07-06 18:38:31 -0700
committer: Linus Torvalds <torvalds@linux-foundation.org> 2017-07-06 18:38:31 -0700
commit: c136b84393d4e340e1b53fc7f737dd5827b19ee5 (patch)
tree: 985a1bdfafe7ec5ce2d3c738f601cad3998d8ce9 /Documentation
parent: e0f25a3f2d052e36ff67a9b4db835c3e27e950d8 (diff)
parent: 1372324b328cd5dabaef5e345e37ad48c63df2a9 (diff)
8 files changed, 578 insertions, 7 deletions
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index f24ee1c99412..aa1d4409fe0a 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1862,6 +1862,18 @@
 			for all guests.
 			Default is 1 (enabled) if in 64-bit or 32-bit PAE mode.
 
+	kvm-arm.vgic_v3_group0_trap=
+			[KVM,ARM] Trap guest accesses to GICv3 group-0
+			system registers
+
+	kvm-arm.vgic_v3_group1_trap=
+			[KVM,ARM] Trap guest accesses to GICv3 group-1
+			system registers
+
+	kvm-arm.vgic_v3_common_trap=
+			[KVM,ARM] Trap guest accesses to GICv3 common
+			system registers
+
 	kvm-intel.ept=	[KVM,Intel] Disable extended page tables
 			(virtualized MMU) support on capable Intel chips.
 			Default is 1 (enabled)
diff --git a/Documentation/arm64/silicon-errata.txt b/Documentation/arm64/silicon-errata.txt
index 10f2dddbf449..f5f93dca54b7 100644
--- a/Documentation/arm64/silicon-errata.txt
+++ b/Documentation/arm64/silicon-errata.txt
@@ -62,6 +62,7 @@ stable kernels.
 | Cavium         | ThunderX GICv3  | #23154          | CAVIUM_ERRATUM_23154        |
 | Cavium         | ThunderX Core   | #27456          | CAVIUM_ERRATUM_27456        |
 | Cavium         | ThunderX SMMUv2 | #27704          | N/A                         |
+| Cavium         | ThunderX Core   | #30115          | CAVIUM_ERRATUM_30115        |
 |                |                 |                 |                             |
 | Freescale/NXP  | LS2080A/LS1043A | A-008585        | FSL_ERRATUM_A008585         |
 |                |                 |                 |                             |
diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index 4029943887a3..3a9831b72945 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -3255,6 +3255,141 @@ Otherwise, if the MCE is a corrected error, KVM will just
 store it in the corresponding bank (provided this bank is
 not holding a previously reported uncorrected error).
 
+4.107 KVM_S390_GET_CMMA_BITS
+
+Capability: KVM_CAP_S390_CMMA_MIGRATION
+Architectures: s390
+Type: vm ioctl
+Parameters: struct kvm_s390_cmma_log (in, out)
+Returns: 0 on success, a negative value on error
+
+This ioctl is used to get the values of the CMMA bits on the s390
+architecture. It is meant to be used in two scenarios:
+- During live migration to save the CMMA values. Live migration needs
+  to be enabled via the KVM_REQ_START_MIGRATION VM property.
+- To non-destructively peek at the CMMA values, with the flag
+  KVM_S390_CMMA_PEEK set.
+
+The ioctl takes parameters via the kvm_s390_cmma_log struct. The desired
+values are written to a buffer whose location is indicated via the "values"
+member in the kvm_s390_cmma_log struct.  The values in the input struct are
+also updated as needed.
+Each CMMA value takes up one byte.
+
+struct kvm_s390_cmma_log {
+	__u64 start_gfn;
+	__u32 count;
+	__u32 flags;
+	union {
+		__u64 remaining;
+		__u64 mask;
+	};
+	__u64 values;
+};
+
+start_gfn is the number of the first guest frame whose CMMA values are
+to be retrieved,
+
+count is the length of the buffer in bytes,
+
+values points to the buffer where the result will be written to.
+
+If count is greater than KVM_S390_SKEYS_MAX, then it is considered to be
+KVM_S390_SKEYS_MAX. KVM_S390_SKEYS_MAX is re-used for consistency with
+other ioctls.
+
+The result is written in the buffer pointed to by the field values, and
+the values of the input parameter are updated as follows.
+
+Depending on the flags, different actions are performed. The only
+supported flag so far is KVM_S390_CMMA_PEEK.
+
+The default behaviour if KVM_S390_CMMA_PEEK is not set is:
+start_gfn will indicate the first page frame whose CMMA bits were dirty.
+It is not necessarily the same as the one passed as input, as clean pages
+are skipped.
+
+count will indicate the number of bytes actually written in the buffer.
+It can (and very often will) be smaller than the input value, since the
+buffer is only filled until 16 bytes of clean values are found (which
+are then not copied in the buffer). Since a CMMA migration block needs
+the base address and the length, for a total of 16 bytes, we will send
+back some clean data if there is some dirty data afterwards, as long as
+the size of the clean data does not exceed the size of the header. This
+allows to minimize the amount of data to be saved or transferred over
+the network at the expense of more roundtrips to userspace. The next
+invocation of the ioctl will skip over all the clean values, saving
+potentially more than just the 16 bytes we found.
+
+If KVM_S390_CMMA_PEEK is set:
+the existing storage attributes are read even when not in migration
+mode, and no other action is performed;
+
+the output start_gfn will be equal to the input start_gfn,
+
+the output count will be equal to the input count, except if the end of
+memory has been reached.
+
+In both cases:
+the field "remaining" will indicate the total number of dirty CMMA values
+still remaining, or 0 if KVM_S390_CMMA_PEEK is set and migration mode is
+not enabled.
+
+mask is unused.
+
+values points to the userspace buffer where the result will be stored.
+
+This ioctl can fail with -ENOMEM if not enough memory can be allocated to
+complete the task, with -ENXIO if CMMA is not enabled, with -EINVAL if
+KVM_S390_CMMA_PEEK is not set but migration mode was not enabled, with
+-EFAULT if the userspace address is invalid or if no page table is
+present for the addresses (e.g. when using hugepages).
+
+4.108 KVM_S390_SET_CMMA_BITS
+
+Capability: KVM_CAP_S390_CMMA_MIGRATION
+Architectures: s390
+Type: vm ioctl
+Parameters: struct kvm_s390_cmma_log (in)
+Returns: 0 on success, a negative value on error
+
+This ioctl is used to set the values of the CMMA bits on the s390
+architecture. It is meant to be used during live migration to restore
+the CMMA values, but there are no restrictions on its use.
+The ioctl takes parameters via the kvm_s390_cmma_values struct.
+Each CMMA value takes up one byte.
+
+struct kvm_s390_cmma_log {
+	__u64 start_gfn;
+	__u32 count;
+	__u32 flags;
+	union {
+		__u64 remaining;
+		__u64 mask;
+	};
+	__u64 values;
+};
+
+start_gfn indicates the starting guest frame number,
+
+count indicates how many values are to be considered in the buffer,
+
+flags is not used and must be 0.
+
+mask indicates which PGSTE bits are to be considered.
+
+remaining is not used.
+
+values points to the buffer in userspace where to store the values.
+
+This ioctl can fail with -ENOMEM if not enough memory can be allocated to
+complete the task, with -ENXIO if CMMA is not enabled, with -EINVAL if
+the count field is too large (e.g. more than KVM_S390_CMMA_SIZE_MAX) or
+if the flags field was not 0, with -EFAULT if the userspace address is
+invalid, if invalid pages are written to (e.g. after the end of memory)
+or if no page table is present for the addresses (e.g. when using
+hugepages).
+
 5. The kvm_run structure
 ------------------------
 
@@ -3996,6 +4131,34 @@ Parameters: none
 Allow use of adapter-interruption suppression.
 Returns: 0 on success; -EBUSY if a VCPU has already been created.
 
+7.11 KVM_CAP_PPC_SMT
+
+Architectures: ppc
+Parameters: vsmt_mode, flags
+
+Enabling this capability on a VM provides userspace with a way to set
+the desired virtual SMT mode (i.e. the number of virtual CPUs per
+virtual core).  The virtual SMT mode, vsmt_mode, must be a power of 2
+between 1 and 8.  On POWER8, vsmt_mode must also be no greater than
+the number of threads per subcore for the host.  Currently flags must
+be 0.  A successful call to enable this capability will result in
+vsmt_mode being returned when the KVM_CAP_PPC_SMT capability is
+subsequently queried for the VM.  This capability is only supported by
+HV KVM, and can only be set before any VCPUs have been created.
+The KVM_CAP_PPC_SMT_POSSIBLE capability indicates which virtual SMT
+modes are available.
+
+7.12 KVM_CAP_PPC_FWNMI
+
+Architectures: ppc
+Parameters: none
+
+With this capability a machine check exception in the guest address
+space will cause KVM to exit the guest with NMI exit reason. This
+enables QEMU to build error log and branch to guest kernel registered
+machine check handling routine. Without this capability KVM will
+branch to guests' 0x200 interrupt vector.
+
 8. Other capabilities.
 ----------------------
 
@@ -4157,3 +4320,12 @@ Currently the following bits are defined for the device_irq_level bitmap:
 Future versions of kvm may implement additional events. These will get
 indicated by returning a higher number from KVM_CHECK_EXTENSION and will be
 listed above.
+
+8.10 KVM_CAP_PPC_SMT_POSSIBLE
+
+Architectures: ppc
+
+Querying this capability returns a bitmap indicating the possible
+virtual SMT modes that can be set using KVM_CAP_PPC_SMT.  If bit N
+(counting from the right) is set, then a virtual SMT mode of 2^N is
+available.
diff --git a/Documentation/virtual/kvm/devices/s390_flic.txt b/Documentation/virtual/kvm/devices/s390_flic.txt
index c2518cea8ab4..2f1cbf1301d2 100644
--- a/Documentation/virtual/kvm/devices/s390_flic.txt
+++ b/Documentation/virtual/kvm/devices/s390_flic.txt
@@ -16,6 +16,7 @@ FLIC provides support to
 - register and modify adapter interrupt sources (KVM_DEV_FLIC_ADAPTER_*)
 - modify AIS (adapter-interruption-suppression) mode state (KVM_DEV_FLIC_AISM)
 - inject adapter interrupts on a specified adapter (KVM_DEV_FLIC_AIRQ_INJECT)
+- get/set all AIS mode states (KVM_DEV_FLIC_AISM_ALL)
 
 Groups:
   KVM_DEV_FLIC_ENQUEUE
@@ -136,6 +137,20 @@ struct kvm_s390_ais_req {
     an isc according to the adapter-interruption-suppression mode on condition
     that the AIS capability is enabled.
 
+  KVM_DEV_FLIC_AISM_ALL
+    Gets or sets the adapter-interruption-suppression mode for all ISCs. Takes
+    a kvm_s390_ais_all describing:
+
+struct kvm_s390_ais_all {
+       __u8 simm; /* Single-Interruption-Mode mask */
+       __u8 nimm; /* No-Interruption-Mode mask *
+};
+
+    simm contains Single-Interruption-Mode mask for all ISCs, nimm contains
+    No-Interruption-Mode mask for all ISCs. Each bit in simm and nimm corresponds
+    to an ISC (MSB0 bit 0 to ISC 0 and so on). The combination of simm bit and
+    nimm bit presents AIS mode for a ISC.
+
 Note: The KVM_SET_DEVICE_ATTR/KVM_GET_DEVICE_ATTR device ioctls executed on
 FLIC with an unknown group or attribute gives the error code EINVAL (instead of
 ENXIO, as specified in the API documentation). It is not possible to conclude
diff --git a/Documentation/virtual/kvm/devices/vcpu.txt b/Documentation/virtual/kvm/devices/vcpu.txt
index 02f50686c418..2b5dab16c4f2 100644
--- a/Documentation/virtual/kvm/devices/vcpu.txt
+++ b/Documentation/virtual/kvm/devices/vcpu.txt
@@ -16,7 +16,9 @@ Parameters: in kvm_device_attr.addr the address for PMU overflow interrupt is a
 Returns: -EBUSY: The PMU overflow interrupt is already set
          -ENXIO: The overflow interrupt not set when attempting to get it
          -ENODEV: PMUv3 not supported
-         -EINVAL: Invalid PMU overflow interrupt number supplied
+         -EINVAL: Invalid PMU overflow interrupt number supplied or
+                  trying to set the IRQ number without using an in-kernel
+                  irqchip.
 
 A value describing the PMUv3 (Performance Monitor Unit v3) overflow interrupt
 number for this vcpu. This interrupt could be a PPI or SPI, but the interrupt
@@ -25,11 +27,36 @@ all vcpus, while as an SPI it must be a separate number per vcpu.
 
 1.2 ATTRIBUTE: KVM_ARM_VCPU_PMU_V3_INIT
 Parameters: no additional parameter in kvm_device_attr.addr
-Returns: -ENODEV: PMUv3 not supported
-         -ENXIO: PMUv3 not properly configured as required prior to calling this
-                 attribute
+Returns: -ENODEV: PMUv3 not supported or GIC not initialized
+         -ENXIO: PMUv3 not properly configured or in-kernel irqchip not
+                 configured as required prior to calling this attribute
          -EBUSY: PMUv3 already initialized
 
-Request the initialization of the PMUv3.  This must be done after creating the
-in-kernel irqchip.  Creating a PMU with a userspace irqchip is currently not
-supported.
+Request the initialization of the PMUv3.  If using the PMUv3 with an in-kernel
+virtual GIC implementation, this must be done after initializing the in-kernel
+irqchip.
+
+
+2. GROUP: KVM_ARM_VCPU_TIMER_CTRL
+Architectures: ARM,ARM64
+
+2.1. ATTRIBUTE: KVM_ARM_VCPU_TIMER_IRQ_VTIMER
+2.2. ATTRIBUTE: KVM_ARM_VCPU_TIMER_IRQ_PTIMER
+Parameters: in kvm_device_attr.addr the address for the timer interrupt is a
+            pointer to an int
+Returns: -EINVAL: Invalid timer interrupt number
+         -EBUSY:  One or more VCPUs has already run
+
+A value describing the architected timer interrupt number when connected to an
+in-kernel virtual GIC.  These must be a PPI (16 <= intid < 32).  Setting the
+attribute overrides the default values (see below).
+
+KVM_ARM_VCPU_TIMER_IRQ_VTIMER: The EL1 virtual timer intid (default: 27)
+KVM_ARM_VCPU_TIMER_IRQ_PTIMER: The EL1 physical timer intid (default: 30)
+
+Setting the same PPI for different timers will prevent the VCPUs from running.
+Setting the interrupt number on a VCPU configures all VCPUs created at that
+time to use the number provided for a given timer, overwriting any previously
+configured values on other VCPUs.  Userspace should configure the interrupt
+numbers on at least one VCPU after creating all VCPUs and before running any
+VCPUs.
diff --git a/Documentation/virtual/kvm/devices/vm.txt b/Documentation/virtual/kvm/devices/vm.txt
index 575ccb022aac..903fc926860b 100644
--- a/Documentation/virtual/kvm/devices/vm.txt
+++ b/Documentation/virtual/kvm/devices/vm.txt
@@ -222,3 +222,36 @@ Allows user space to disable dea key wrapping, clearing the wrapping key.
 
 Parameters: none
 Returns:    0
+
+5. GROUP: KVM_S390_VM_MIGRATION
+Architectures: s390
+
+5.1. ATTRIBUTE: KVM_S390_VM_MIGRATION_STOP (w/o)
+
+Allows userspace to stop migration mode, needed for PGSTE migration.
+Setting this attribute when migration mode is not active will have no
+effects.
+
+Parameters: none
+Returns:    0
+
+5.2. ATTRIBUTE: KVM_S390_VM_MIGRATION_START (w/o)
+
+Allows userspace to start migration mode, needed for PGSTE migration.
+Setting this attribute when migration mode is already active will have
+no effects.
+
+Parameters: none
+Returns:    -ENOMEM if there is not enough free memory to start migration mode
+	    -EINVAL if the state of the VM is invalid (e.g. no memory defined)
+	    0 in case of success.
+
+5.3. ATTRIBUTE: KVM_S390_VM_MIGRATION_STATUS (r/o)
+
+Allows userspace to query the status of migration mode.
+
+Parameters: address of a buffer in user space to store the data (u64) to;
+	    the data itself is either 0 if migration mode is disabled or 1
+	    if it is enabled
+Returns:    -EFAULT if the given address is not accessible from kernel space
+	    0 in case of success.
diff --git a/Documentation/virtual/kvm/mmu.txt b/Documentation/virtual/kvm/mmu.txt
index 481b6a9c25d5..f50d45b1e967 100644
--- a/Documentation/virtual/kvm/mmu.txt
+++ b/Documentation/virtual/kvm/mmu.txt
@@ -179,6 +179,10 @@ Shadow pages contain the following information:
     shadow page; it is also used to go back from a struct kvm_mmu_page
     to a memslot, through the kvm_memslots_for_spte_role macro and
     __gfn_to_memslot.
+  role.ad_disabled:
+    Is 1 if the MMU instance cannot use A/D bits.  EPT did not have A/D
+    bits before Haswell; shadow EPT page tables also cannot use A/D bits
+    if the L1 hypervisor does not enable them.
   gfn:
     Either the guest page table containing the translations shadowed by this
     page, or the base page frame for linear translations.  See role.direct.
diff --git a/Documentation/virtual/kvm/vcpu-requests.rst b/Documentation/virtual/kvm/vcpu-requests.rst
new file mode 100644
index 000000000000..5feb3706a7ae
--- /dev/null
+++ b/Documentation/virtual/kvm/vcpu-requests.rst
@@ -0,0 +1,307 @@
+=================
+KVM VCPU Requests
+=================
+
+Overview
+========
+
+KVM supports an internal API enabling threads to request a VCPU thread to
+perform some activity.  For example, a thread may request a VCPU to flush
+its TLB with a VCPU request.  The API consists of the following functions::
+
+  /* Check if any requests are pending for VCPU @vcpu. */
+  bool kvm_request_pending(struct kvm_vcpu *vcpu);
+
+  /* Check if VCPU @vcpu has request @req pending. */
+  bool kvm_test_request(int req, struct kvm_vcpu *vcpu);
+
+  /* Clear request @req for VCPU @vcpu. */
+  void kvm_clear_request(int req, struct kvm_vcpu *vcpu);
+
+  /*
+   * Check if VCPU @vcpu has request @req pending. When the request is
+   * pending it will be cleared and a memory barrier, which pairs with
+   * another in kvm_make_request(), will be issued.
+   */
+  bool kvm_check_request(int req, struct kvm_vcpu *vcpu);
+
+  /*
+   * Make request @req of VCPU @vcpu. Issues a memory barrier, which pairs
+   * with another in kvm_check_request(), prior to setting the request.
+   */
+  void kvm_make_request(int req, struct kvm_vcpu *vcpu);
+
+  /* Make request @req of all VCPUs of the VM with struct kvm @kvm. */
+  bool kvm_make_all_cpus_request(struct kvm *kvm, unsigned int req);
+
+Typically a requester wants the VCPU to perform the activity as soon
+as possible after making the request.  This means most requests
+(kvm_make_request() calls) are followed by a call to kvm_vcpu_kick(),
+and kvm_make_all_cpus_request() has the kicking of all VCPUs built
+into it.
+
+VCPU Kicks
+----------
+
+The goal of a VCPU kick is to bring a VCPU thread out of guest mode in
+order to perform some KVM maintenance.  To do so, an IPI is sent, forcing
+a guest mode exit.  However, a VCPU thread may not be in guest mode at the
+time of the kick.  Therefore, depending on the mode and state of the VCPU
+thread, there are two other actions a kick may take.  All three actions
+are listed below:
+
+1) Send an IPI.  This forces a guest mode exit.
+2) Waking a sleeping VCPU.  Sleeping VCPUs are VCPU threads outside guest
+   mode that wait on waitqueues.  Waking them removes the threads from
+   the waitqueues, allowing the threads to run again.  This behavior
+   may be suppressed, see KVM_REQUEST_NO_WAKEUP below.
+3) Nothing.  When the VCPU is not in guest mode and the VCPU thread is not
+   sleeping, then there is nothing to do.
+
+VCPU Mode
+---------
+
+VCPUs have a mode state, ``vcpu->mode``, that is used to track whether the
+guest is running in guest mode or not, as well as some specific
+outside guest mode states.  The architecture may use ``vcpu->mode`` to
+ensure VCPU requests are seen by VCPUs (see "Ensuring Requests Are Seen"),
+as well as to avoid sending unnecessary IPIs (see "IPI Reduction"), and
+even to ensure IPI acknowledgements are waited upon (see "Waiting for
+Acknowledgements").  The following modes are defined:
+
+OUTSIDE_GUEST_MODE
+
+  The VCPU thread is outside guest mode.
+
+IN_GUEST_MODE
+
+  The VCPU thread is in guest mode.
+
+EXITING_GUEST_MODE
+
+  The VCPU thread is transitioning from IN_GUEST_MODE to
+  OUTSIDE_GUEST_MODE.
+
+READING_SHADOW_PAGE_TABLES
+
+  The VCPU thread is outside guest mode, but it wants the sender of
+  certain VCPU requests, namely KVM_REQ_TLB_FLUSH, to wait until the VCPU
+  thread is done reading the page tables.
+
+VCPU Request Internals
+======================
+
+VCPU requests are simply bit indices of the ``vcpu->requests`` bitmap.
+This means general bitops, like those documented in [atomic-ops]_ could
+also be used, e.g. ::
+
+  clear_bit(KVM_REQ_UNHALT & KVM_REQUEST_MASK, &vcpu->requests);
+
+However, VCPU request users should refrain from doing so, as it would
+break the abstraction.  The first 8 bits are reserved for architecture
+independent requests, all additional bits are available for architecture
+dependent requests.
+
+Architecture Independent Requests
+---------------------------------
+
+KVM_REQ_TLB_FLUSH
+
+  KVM's common MMU notifier may need to flush all of a guest's TLB
+  entries, calling kvm_flush_remote_tlbs() to do so.  Architectures that
+  choose to use the common kvm_flush_remote_tlbs() implementation will
+  need to handle this VCPU request.
+
+KVM_REQ_MMU_RELOAD
+
+  When shadow page tables are used and memory slots are removed it's
+  necessary to inform each VCPU to completely refresh the tables.  This
+  request is used for that.
+
+KVM_REQ_PENDING_TIMER
+
+  This request may be made from a timer handler run on the host on behalf
+  of a VCPU.  It informs the VCPU thread to inject a timer interrupt.
+
+KVM_REQ_UNHALT
+
+  This request may be made from the KVM common function kvm_vcpu_block(),
+  which is used to emulate an instruction that causes a CPU to halt until
+  one of an architectural specific set of events and/or interrupts is
+  received (determined by checking kvm_arch_vcpu_runnable()).  When that
+  event or interrupt arrives kvm_vcpu_block() makes the request.  This is
+  in contrast to when kvm_vcpu_block() returns due to any other reason,
+  such as a pending signal, which does not indicate the VCPU's halt
+  emulation should stop, and therefore does not make the request.
+
+KVM_REQUEST_MASK
+----------------
+
+VCPU requests should be masked by KVM_REQUEST_MASK before using them with
+bitops.  This is because only the lower 8 bits are used to represent the
+request's number.  The upper bits are used as flags.  Currently only two
+flags are defined.
+
+VCPU Request Flags
+------------------
+
+KVM_REQUEST_NO_WAKEUP
+
+  This flag is applied to requests that only need immediate attention
+  from VCPUs running in guest mode.  That is, sleeping VCPUs do not need
+  to be awaken for these requests.  Sleeping VCPUs will handle the
+  requests when they are awaken later for some other reason.
+
+KVM_REQUEST_WAIT
+
+  When requests with this flag are made with kvm_make_all_cpus_request(),
+  then the caller will wait for each VCPU to acknowledge its IPI before
+  proceeding.  This flag only applies to VCPUs that would receive IPIs.
+  If, for example, the VCPU is sleeping, so no IPI is necessary, then
+  the requesting thread does not wait.  This means that this flag may be
+  safely combined with KVM_REQUEST_NO_WAKEUP.  See "Waiting for
+  Acknowledgements" for more information about requests with
+  KVM_REQUEST_WAIT.
+
+VCPU Requests with Associated State
+===================================
+
+Requesters that want the receiving VCPU to handle new state need to ensure
+the newly written state is observable to the receiving VCPU thread's CPU
+by the time it observes the request.  This means a write memory barrier
+must be inserted after writing the new state and before setting the VCPU
+request bit.  Additionally, on the receiving VCPU thread's side, a
+corresponding read barrier must be inserted after reading the request bit
+and before proceeding to read the new state associated with it.  See
+scenario 3, Message and Flag, of [lwn-mb]_ and the kernel documentation
+[memory-barriers]_.
+
+The pair of functions, kvm_check_request() and kvm_make_request(), provide
+the memory barriers, allowing this requirement to be handled internally by
+the API.
+
+Ensuring Requests Are Seen
+==========================
+
+When making requests to VCPUs, we want to avoid the receiving VCPU
+executing in guest mode for an arbitrary long time without handling the
+request.  We can be sure this won't happen as long as we ensure the VCPU
+thread checks kvm_request_pending() before entering guest mode and that a
+kick will send an IPI to force an exit from guest mode when necessary.
+Extra care must be taken to cover the period after the VCPU thread's last
+kvm_request_pending() check and before it has entered guest mode, as kick
+IPIs will only trigger guest mode exits for VCPU threads that are in guest
+mode or at least have already disabled interrupts in order to prepare to
+enter guest mode.  This means that an optimized implementation (see "IPI
+Reduction") must be certain when it's safe to not send the IPI.  One
+solution, which all architectures except s390 apply, is to:
+
+- set ``vcpu->mode`` to IN_GUEST_MODE between disabling the interrupts and
+  the last kvm_request_pending() check;
+- enable interrupts atomically when entering the guest.
+
+This solution also requires memory barriers to be placed carefully in both
+the requesting thread and the receiving VCPU.  With the memory barriers we
+can exclude the possibility of a VCPU thread observing
+!kvm_request_pending() on its last check and then not receiving an IPI for
+the next request made of it, even if the request is made immediately after
+the check.  This is done by way of the Dekker memory barrier pattern
+(scenario 10 of [lwn-mb]_).  As the Dekker pattern requires two variables,
+this solution pairs ``vcpu->mode`` with ``vcpu->requests``.  Substituting
+them into the pattern gives::
+
+  CPU1                                    CPU2
+  =================                       =================
+  local_irq_disable();
+  WRITE_ONCE(vcpu->mode, IN_GUEST_MODE);  kvm_make_request(REQ, vcpu);
+  smp_mb();                               smp_mb();
+  if (kvm_request_pending(vcpu)) {        if (READ_ONCE(vcpu->mode) ==
+                                              IN_GUEST_MODE) {
+      ...abort guest entry...                 ...send IPI...
+  }                                       }
+
+As stated above, the IPI is only useful for VCPU threads in guest mode or
+that have already disabled interrupts.  This is why this specific case of
+the Dekker pattern has been extended to disable interrupts before setting
+``vcpu->mode`` to IN_GUEST_MODE.  WRITE_ONCE() and READ_ONCE() are used to
+pedantically implement the memory barrier pattern, guaranteeing the
+compiler doesn't interfere with ``vcpu->mode``'s carefully planned
+accesses.
+
+IPI Reduction
+-------------
+
+As only one IPI is needed to get a VCPU to check for any/all requests,
+then they may be coalesced.  This is easily done by having the first IPI
+sending kick also change the VCPU mode to something !IN_GUEST_MODE.  The
+transitional state, EXITING_GUEST_MODE, is used for this purpose.
+
+Waiting for Acknowledgements
+----------------------------
+
+Some requests, those with the KVM_REQUEST_WAIT flag set, require IPIs to
+be sent, and the acknowledgements to be waited upon, even when the target
+VCPU threads are in modes other than IN_GUEST_MODE.  For example, one case
+is when a target VCPU thread is in READING_SHADOW_PAGE_TABLES mode, which
+is set after disabling interrupts.  To support these cases, the
+KVM_REQUEST_WAIT flag changes the condition for sending an IPI from
+checking that the VCPU is IN_GUEST_MODE to checking that it is not
+OUTSIDE_GUEST_MODE.
+
+Request-less VCPU Kicks
+-----------------------
+
+As the determination of whether or not to send an IPI depends on the
+two-variable Dekker memory barrier pattern, then it's clear that
+request-less VCPU kicks are almost never correct.  Without the assurance
+that a non-IPI generating kick will still result in an action by the
+receiving VCPU, as the final kvm_request_pending() check does for
+request-accompanying kicks, then the kick may not do anything useful at
+all.  If, for instance, a request-less kick was made to a VCPU that was
+just about to set its mode to IN_GUEST_MODE, meaning no IPI is sent, then
+the VCPU thread may continue its entry without actually having done
+whatever it was the kick was meant to initiate.
+
+One exception is x86's posted interrupt mechanism.  In this case, however,
+even the request-less VCPU kick is coupled with the same
+local_irq_disable() + smp_mb() pattern described above; the ON bit
+(Outstanding Notification) in the posted interrupt descriptor takes the
+role of ``vcpu->requests``.  When sending a posted interrupt, PIR.ON is
+set before reading ``vcpu->mode``; dually, in the VCPU thread,
+vmx_sync_pir_to_irr() reads PIR after setting ``vcpu->mode`` to
+IN_GUEST_MODE.
+
+Additional Considerations
+=========================
+
+Sleeping VCPUs
+--------------
+
+VCPU threads may need to consider requests before and/or after calling
+functions that may put them to sleep, e.g. kvm_vcpu_block().  Whether they
+do or not, and, if they do, which requests need consideration, is
+architecture dependent.  kvm_vcpu_block() calls kvm_arch_vcpu_runnable()
+to check if it should awaken.  One reason to do so is to provide
+architectures a function where requests may be checked if necessary.
+
+Clearing Requests
+-----------------
+
+Generally it only makes sense for the receiving VCPU thread to clear a
+request.  However, in some circumstances, such as when the requesting
+thread and the receiving VCPU thread are executed serially, such as when
+they are the same thread, or when they are using some form of concurrency
+control to temporarily execute synchronously, then it's possible to know
+that the request may be cleared immediately, rather than waiting for the
+receiving VCPU thread to handle the request in VCPU RUN.  The only current
+examples of this are kvm_vcpu_block() calls made by VCPUs to block
+themselves.  A possible side-effect of that call is to make the
+KVM_REQ_UNHALT request, which may then be cleared immediately when the
+VCPU returns from the call.
+
+References
+==========
+
+.. [atomic-ops] Documentation/core-api/atomic_ops.rst
+.. [memory-barriers] Documentation/memory-barriers.txt
+.. [lwn-mb] https://lwn.net/Articles/573436/
author	Linus Torvalds <torvalds@linux-foundation.org>	2017-07-06 18:38:31 -0700
committer	Linus Torvalds <torvalds@linux-foundation.org>	2017-07-06 18:38:31 -0700
commit	c136b84393d4e340e1b53fc7f737dd5827b19ee5 (patch)
tree	985a1bdfafe7ec5ce2d3c738f601cad3998d8ce9 /Documentation
parent	e0f25a3f2d052e36ff67a9b4db835c3e27e950d8 (diff)
parent	1372324b328cd5dabaef5e345e37ad48c63df2a9 (diff)