summaryrefslogtreecommitdiff
path: root/Documentation/virt
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/virt')
-rw-r--r--Documentation/virt/coco/sev-guest.rst30
-rw-r--r--Documentation/virt/hyperv/clocks.rst21
-rw-r--r--Documentation/virt/hyperv/coco.rst260
-rw-r--r--Documentation/virt/hyperv/hibernation.rst336
-rw-r--r--Documentation/virt/hyperv/index.rst2
-rw-r--r--Documentation/virt/hyperv/overview.rst22
-rw-r--r--Documentation/virt/hyperv/vmbus.rst143
-rw-r--r--Documentation/virt/kvm/api.rst437
-rw-r--r--Documentation/virt/kvm/arm/fw-pseudo-registers.rst138
-rw-r--r--Documentation/virt/kvm/arm/hypercalls.rst278
-rw-r--r--Documentation/virt/kvm/arm/index.rst1
-rw-r--r--Documentation/virt/kvm/arm/ptp_kvm.rst38
-rw-r--r--Documentation/virt/kvm/devices/arm-vgic.rst2
-rw-r--r--Documentation/virt/kvm/devices/s390_flic.rst4
-rw-r--r--Documentation/virt/kvm/devices/vcpu.rst14
-rw-r--r--Documentation/virt/kvm/halt-polling.rst12
-rw-r--r--Documentation/virt/kvm/index.rst1
-rw-r--r--Documentation/virt/kvm/locking.rst112
-rw-r--r--Documentation/virt/kvm/loongarch/hypercalls.rst89
-rw-r--r--Documentation/virt/kvm/loongarch/index.rst10
-rw-r--r--Documentation/virt/kvm/s390/s390-diag.rst35
-rw-r--r--Documentation/virt/kvm/x86/amd-memory-encryption.rst169
-rw-r--r--Documentation/virt/kvm/x86/errata.rst30
-rw-r--r--Documentation/virt/uml/user_mode_linux_howto_v2.rst39
24 files changed, 1804 insertions, 419 deletions
diff --git a/Documentation/virt/coco/sev-guest.rst b/Documentation/virt/coco/sev-guest.rst
index e1eaf6a830ce..93debceb6eb0 100644
--- a/Documentation/virt/coco/sev-guest.rst
+++ b/Documentation/virt/coco/sev-guest.rst
@@ -176,6 +176,25 @@ to SNP_CONFIG command defined in the SEV-SNP spec. The current values of
the firmware parameters affected by this command can be queried via
SNP_PLATFORM_STATUS.
+2.7 SNP_VLEK_LOAD
+-----------------
+:Technology: sev-snp
+:Type: hypervisor ioctl cmd
+:Parameters (in): struct sev_user_data_snp_vlek_load
+:Returns (out): 0 on success, -negative on error
+
+When requesting an attestation report a guest is able to specify whether
+it wants SNP firmware to sign the report using either a Versioned Chip
+Endorsement Key (VCEK), which is derived from chip-unique secrets, or a
+Versioned Loaded Endorsement Key (VLEK) which is obtained from an AMD
+Key Derivation Service (KDS) and derived from seeds allocated to
+enrolled cloud service providers.
+
+In the case of VLEK keys, the SNP_VLEK_LOAD SNP command is used to load
+them into the system after obtaining them from the KDS, and corresponds
+closely to the SNP_VLEK_LOAD firmware command specified in the SEV-SNP
+spec.
+
3. SEV-SNP CPUID Enforcement
============================
@@ -204,6 +223,17 @@ has taken care to make use of the SEV-SNP CPUID throughout all stages of boot.
Otherwise, guest owner attestation provides no assurance that the kernel wasn't
fed incorrect values at some point during boot.
+4. SEV Guest Driver Communication Key
+=====================================
+
+Communication between an SEV guest and the SEV firmware in the AMD Secure
+Processor (ASP, aka PSP) is protected by a VM Platform Communication Key
+(VMPCK). By default, the sev-guest driver uses the VMPCK associated with the
+VM Privilege Level (VMPL) at which the guest is running. Should this key be
+wiped by the sev-guest driver (see the driver for reasons why a VMPCK can be
+wiped), a different key can be used by reloading the sev-guest driver and
+specifying the desired key using the vmpck_id module parameter.
+
Reference
---------
diff --git a/Documentation/virt/hyperv/clocks.rst b/Documentation/virt/hyperv/clocks.rst
index a56f4837d443..176043265803 100644
--- a/Documentation/virt/hyperv/clocks.rst
+++ b/Documentation/virt/hyperv/clocks.rst
@@ -62,12 +62,21 @@ shared page with scale and offset values into user space. User
space code performs the same algorithm of reading the TSC and
applying the scale and offset to get the constant 10 MHz clock.
-Linux clockevents are based on Hyper-V synthetic timer 0. While
-Hyper-V offers 4 synthetic timers for each CPU, Linux only uses
-timer 0. Interrupts from stimer0 are recorded on the "HVS" line in
-/proc/interrupts. Clockevents based on the virtualized PIT and
-local APIC timer also work, but the Hyper-V synthetic timer is
-preferred.
+Linux clockevents are based on Hyper-V synthetic timer 0 (stimer0).
+While Hyper-V offers 4 synthetic timers for each CPU, Linux only uses
+timer 0. In older versions of Hyper-V, an interrupt from stimer0
+results in a VMBus control message that is demultiplexed by
+vmbus_isr() as described in the Documentation/virt/hyperv/vmbus.rst
+documentation. In newer versions of Hyper-V, stimer0 interrupts can
+be mapped to an architectural interrupt, which is referred to as
+"Direct Mode". Linux prefers to use Direct Mode when available. Since
+x86/x64 doesn't support per-CPU interrupts, Direct Mode statically
+allocates an x86 interrupt vector (HYPERV_STIMER0_VECTOR) across all CPUs
+and explicitly codes it to call the stimer0 interrupt handler. Hence
+interrupts from stimer0 are recorded on the "HVS" line in /proc/interrupts
+rather than being associated with a Linux IRQ. Clockevents based on the
+virtualized PIT and local APIC timer also work, but Hyper-V stimer0
+is preferred.
The driver for the Hyper-V synthetic system clock and timers is
drivers/clocksource/hyperv_timer.c.
diff --git a/Documentation/virt/hyperv/coco.rst b/Documentation/virt/hyperv/coco.rst
new file mode 100644
index 000000000000..c15d6fe34b4e
--- /dev/null
+++ b/Documentation/virt/hyperv/coco.rst
@@ -0,0 +1,260 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+Confidential Computing VMs
+==========================
+Hyper-V can create and run Linux guests that are Confidential Computing
+(CoCo) VMs. Such VMs cooperate with the physical processor to better protect
+the confidentiality and integrity of data in the VM's memory, even in the
+face of a hypervisor/VMM that has been compromised and may behave maliciously.
+CoCo VMs on Hyper-V share the generic CoCo VM threat model and security
+objectives described in Documentation/security/snp-tdx-threat-model.rst. Note
+that Hyper-V specific code in Linux refers to CoCo VMs as "isolated VMs" or
+"isolation VMs".
+
+A Linux CoCo VM on Hyper-V requires the cooperation and interaction of the
+following:
+
+* Physical hardware with a processor that supports CoCo VMs
+
+* The hardware runs a version of Windows/Hyper-V with support for CoCo VMs
+
+* The VM runs a version of Linux that supports being a CoCo VM
+
+The physical hardware requirements are as follows:
+
+* AMD processor with SEV-SNP. Hyper-V does not run guest VMs with AMD SME,
+ SEV, or SEV-ES encryption, and such encryption is not sufficient for a CoCo
+ VM on Hyper-V.
+
+* Intel processor with TDX
+
+To create a CoCo VM, the "Isolated VM" attribute must be specified to Hyper-V
+when the VM is created. A VM cannot be changed from a CoCo VM to a normal VM,
+or vice versa, after it is created.
+
+Operational Modes
+-----------------
+Hyper-V CoCo VMs can run in two modes. The mode is selected when the VM is
+created and cannot be changed during the life of the VM.
+
+* Fully-enlightened mode. In this mode, the guest operating system is
+ enlightened to understand and manage all aspects of running as a CoCo VM.
+
+* Paravisor mode. In this mode, a paravisor layer between the guest and the
+ host provides some operations needed to run as a CoCo VM. The guest operating
+ system can have fewer CoCo enlightenments than is required in the
+ fully-enlightened case.
+
+Conceptually, fully-enlightened mode and paravisor mode may be treated as
+points on a spectrum spanning the degree of guest enlightenment needed to run
+as a CoCo VM. Fully-enlightened mode is one end of the spectrum. A full
+implementation of paravisor mode is the other end of the spectrum, where all
+aspects of running as a CoCo VM are handled by the paravisor, and a normal
+guest OS with no knowledge of memory encryption or other aspects of CoCo VMs
+can run successfully. However, the Hyper-V implementation of paravisor mode
+does not go this far, and is somewhere in the middle of the spectrum. Some
+aspects of CoCo VMs are handled by the Hyper-V paravisor while the guest OS
+must be enlightened for other aspects. Unfortunately, there is no
+standardized enumeration of feature/functions that might be provided in the
+paravisor, and there is no standardized mechanism for a guest OS to query the
+paravisor for the feature/functions it provides. The understanding of what
+the paravisor provides is hard-coded in the guest OS.
+
+Paravisor mode has similarities to the `Coconut project`_, which aims to provide
+a limited paravisor to provide services to the guest such as a virtual TPM.
+However, the Hyper-V paravisor generally handles more aspects of CoCo VMs
+than is currently envisioned for Coconut, and so is further toward the "no
+guest enlightenments required" end of the spectrum.
+
+.. _Coconut project: https://github.com/coconut-svsm/svsm
+
+In the CoCo VM threat model, the paravisor is in the guest security domain
+and must be trusted by the guest OS. By implication, the hypervisor/VMM must
+protect itself against a potentially malicious paravisor just like it
+protects against a potentially malicious guest.
+
+The hardware architectural approach to fully-enlightened vs. paravisor mode
+varies depending on the underlying processor.
+
+* With AMD SEV-SNP processors, in fully-enlightened mode the guest OS runs in
+ VMPL 0 and has full control of the guest context. In paravisor mode, the
+ guest OS runs in VMPL 2 and the paravisor runs in VMPL 0. The paravisor
+ running in VMPL 0 has privileges that the guest OS in VMPL 2 does not have.
+ Certain operations require the guest to invoke the paravisor. Furthermore, in
+ paravisor mode the guest OS operates in "virtual Top Of Memory" (vTOM) mode
+ as defined by the SEV-SNP architecture. This mode simplifies guest management
+ of memory encryption when a paravisor is used.
+
+* With Intel TDX processor, in fully-enlightened mode the guest OS runs in an
+ L1 VM. In paravisor mode, TD partitioning is used. The paravisor runs in the
+ L1 VM, and the guest OS runs in a nested L2 VM.
+
+Hyper-V exposes a synthetic MSR to guests that describes the CoCo mode. This
+MSR indicates if the underlying processor uses AMD SEV-SNP or Intel TDX, and
+whether a paravisor is being used. It is straightforward to build a single
+kernel image that can boot and run properly on either architecture, and in
+either mode.
+
+Paravisor Effects
+-----------------
+Running in paravisor mode affects the following areas of generic Linux kernel
+CoCo VM functionality:
+
+* Initial guest memory setup. When a new VM is created in paravisor mode, the
+ paravisor runs first and sets up the guest physical memory as encrypted. The
+ guest Linux does normal memory initialization, except for explicitly marking
+ appropriate ranges as decrypted (shared). In paravisor mode, Linux does not
+ perform the early boot memory setup steps that are particularly tricky with
+ AMD SEV-SNP in fully-enlightened mode.
+
+* #VC/#VE exception handling. In paravisor mode, Hyper-V configures the guest
+ CoCo VM to route #VC and #VE exceptions to VMPL 0 and the L1 VM,
+ respectively, and not the guest Linux. Consequently, these exception handlers
+ do not run in the guest Linux and are not a required enlightenment for a
+ Linux guest in paravisor mode.
+
+* CPUID flags. Both AMD SEV-SNP and Intel TDX provide a CPUID flag in the
+ guest indicating that the VM is operating with the respective hardware
+ support. While these CPUID flags are visible in fully-enlightened CoCo VMs,
+ the paravisor filters out these flags and the guest Linux does not see them.
+ Throughout the Linux kernel, explicitly testing these flags has mostly been
+ eliminated in favor of the cc_platform_has() function, with the goal of
+ abstracting the differences between SEV-SNP and TDX. But the
+ cc_platform_has() abstraction also allows the Hyper-V paravisor configuration
+ to selectively enable aspects of CoCo VM functionality even when the CPUID
+ flags are not set. The exception is early boot memory setup on SEV-SNP, which
+ tests the CPUID SEV-SNP flag. But not having the flag in Hyper-V paravisor
+ mode VM achieves the desired effect or not running SEV-SNP specific early
+ boot memory setup.
+
+* Device emulation. In paravisor mode, the Hyper-V paravisor provides
+ emulation of devices such as the IO-APIC and TPM. Because the emulation
+ happens in the paravisor in the guest context (instead of the hypervisor/VMM
+ context), MMIO accesses to these devices must be encrypted references instead
+ of the decrypted references that would be used in a fully-enlightened CoCo
+ VM. The __ioremap_caller() function has been enhanced to make a callback to
+ check whether a particular address range should be treated as encrypted
+ (private). See the "is_private_mmio" callback.
+
+* Encrypt/decrypt memory transitions. In a CoCo VM, transitioning guest
+ memory between encrypted and decrypted requires coordinating with the
+ hypervisor/VMM. This is done via callbacks invoked from
+ __set_memory_enc_pgtable(). In fully-enlightened mode, the normal SEV-SNP and
+ TDX implementations of these callbacks are used. In paravisor mode, a Hyper-V
+ specific set of callbacks is used. These callbacks invoke the paravisor so
+ that the paravisor can coordinate the transitions and inform the hypervisor
+ as necessary. See hv_vtom_init() where these callback are set up.
+
+* Interrupt injection. In fully enlightened mode, a malicious hypervisor
+ could inject interrupts into the guest OS at times that violate x86/x64
+ architectural rules. For full protection, the guest OS should include
+ enlightenments that use the interrupt injection management features provided
+ by CoCo-capable processors. In paravisor mode, the paravisor mediates
+ interrupt injection into the guest OS, and ensures that the guest OS only
+ sees interrupts that are "legal". The paravisor uses the interrupt injection
+ management features provided by the CoCo-capable physical processor, thereby
+ masking these complexities from the guest OS.
+
+Hyper-V Hypercalls
+------------------
+When in fully-enlightened mode, hypercalls made by the Linux guest are routed
+directly to the hypervisor, just as in a non-CoCo VM. But in paravisor mode,
+normal hypercalls trap to the paravisor first, which may in turn invoke the
+hypervisor. But the paravisor is idiosyncratic in this regard, and a few
+hypercalls made by the Linux guest must always be routed directly to the
+hypervisor. These hypercall sites test for a paravisor being present, and use
+a special invocation sequence. See hv_post_message(), for example.
+
+Guest communication with Hyper-V
+--------------------------------
+Separate from the generic Linux kernel handling of memory encryption in Linux
+CoCo VMs, Hyper-V has VMBus and VMBus devices that communicate using memory
+shared between the Linux guest and the host. This shared memory must be
+marked decrypted to enable communication. Furthermore, since the threat model
+includes a compromised and potentially malicious host, the guest must guard
+against leaking any unintended data to the host through this shared memory.
+
+These Hyper-V and VMBus memory pages are marked as decrypted:
+
+* VMBus monitor pages
+
+* Synthetic interrupt controller (synic) related pages (unless supplied by
+ the paravisor)
+
+* Per-cpu hypercall input and output pages (unless running with a paravisor)
+
+* VMBus ring buffers. The direct mapping is marked decrypted in
+ __vmbus_establish_gpadl(). The secondary mapping created in
+ hv_ringbuffer_init() must also include the "decrypted" attribute.
+
+When the guest writes data to memory that is shared with the host, it must
+ensure that only the intended data is written. Padding or unused fields must
+be initialized to zeros before copying into the shared memory so that random
+kernel data is not inadvertently given to the host.
+
+Similarly, when the guest reads memory that is shared with the host, it must
+validate the data before acting on it so that a malicious host cannot induce
+the guest to expose unintended data. Doing such validation can be tricky
+because the host can modify the shared memory areas even while or after
+validation is performed. For messages passed from the host to the guest in a
+VMBus ring buffer, the length of the message is validated, and the message is
+copied into a temporary (encrypted) buffer for further validation and
+processing. The copying adds a small amount of overhead, but is the only way
+to protect against a malicious host. See hv_pkt_iter_first().
+
+Many drivers for VMBus devices have been "hardened" by adding code to fully
+validate messages received over VMBus, instead of assuming that Hyper-V is
+acting cooperatively. Such drivers are marked as "allowed_in_isolated" in the
+vmbus_devs[] table. Other drivers for VMBus devices that are not needed in a
+CoCo VM have not been hardened, and they are not allowed to load in a CoCo
+VM. See vmbus_is_valid_offer() where such devices are excluded.
+
+Two VMBus devices depend on the Hyper-V host to do DMA data transfers:
+storvsc for disk I/O and netvsc for network I/O. storvsc uses the normal
+Linux kernel DMA APIs, and so bounce buffering through decrypted swiotlb
+memory is done implicitly. netvsc has two modes for data transfers. The first
+mode goes through send and receive buffer space that is explicitly allocated
+by the netvsc driver, and is used for most smaller packets. These send and
+receive buffers are marked decrypted by __vmbus_establish_gpadl(). Because
+the netvsc driver explicitly copies packets to/from these buffers, the
+equivalent of bounce buffering between encrypted and decrypted memory is
+already part of the data path. The second mode uses the normal Linux kernel
+DMA APIs, and is bounce buffered through swiotlb memory implicitly like in
+storvsc.
+
+Finally, the VMBus virtual PCI driver needs special handling in a CoCo VM.
+Linux PCI device drivers access PCI config space using standard APIs provided
+by the Linux PCI subsystem. On Hyper-V, these functions directly access MMIO
+space, and the access traps to Hyper-V for emulation. But in CoCo VMs, memory
+encryption prevents Hyper-V from reading the guest instruction stream to
+emulate the access. So in a CoCo VM, these functions must make a hypercall
+with arguments explicitly describing the access. See
+_hv_pcifront_read_config() and _hv_pcifront_write_config() and the
+"use_calls" flag indicating to use hypercalls.
+
+load_unaligned_zeropad()
+------------------------
+When transitioning memory between encrypted and decrypted, the caller of
+set_memory_encrypted() or set_memory_decrypted() is responsible for ensuring
+the memory isn't in use and isn't referenced while the transition is in
+progress. The transition has multiple steps, and includes interaction with
+the Hyper-V host. The memory is in an inconsistent state until all steps are
+complete. A reference while the state is inconsistent could result in an
+exception that can't be cleanly fixed up.
+
+However, the kernel load_unaligned_zeropad() mechanism may make stray
+references that can't be prevented by the caller of set_memory_encrypted() or
+set_memory_decrypted(), so there's specific code in the #VC or #VE exception
+handler to fixup this case. But a CoCo VM running on Hyper-V may be
+configured to run with a paravisor, with the #VC or #VE exception routed to
+the paravisor. There's no architectural way to forward the exceptions back to
+the guest kernel, and in such a case, the load_unaligned_zeropad() fixup code
+in the #VC/#VE handlers doesn't run.
+
+To avoid this problem, the Hyper-V specific functions for notifying the
+hypervisor of the transition mark pages as "not present" while a transition
+is in progress. If load_unaligned_zeropad() causes a stray reference, a
+normal page fault is generated instead of #VC or #VE, and the page-fault-
+based handlers for load_unaligned_zeropad() fixup the reference. When the
+encrypted/decrypted transition is complete, the pages are marked as "present"
+again. See hv_vtom_clear_present() and hv_vtom_set_host_visibility().
diff --git a/Documentation/virt/hyperv/hibernation.rst b/Documentation/virt/hyperv/hibernation.rst
new file mode 100644
index 000000000000..4ff27f4a317a
--- /dev/null
+++ b/Documentation/virt/hyperv/hibernation.rst
@@ -0,0 +1,336 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+Hibernating Guest VMs
+=====================
+
+Background
+----------
+Linux supports the ability to hibernate itself in order to save power.
+Hibernation is sometimes called suspend-to-disk, as it writes a memory
+image to disk and puts the hardware into the lowest possible power
+state. Upon resume from hibernation, the hardware is restarted and the
+memory image is restored from disk so that it can resume execution
+where it left off. See the "Hibernation" section of
+Documentation/admin-guide/pm/sleep-states.rst.
+
+Hibernation is usually done on devices with a single user, such as a
+personal laptop. For example, the laptop goes into hibernation when
+the cover is closed, and resumes when the cover is opened again.
+Hibernation and resume happen on the same hardware, and Linux kernel
+code orchestrating the hibernation steps assumes that the hardware
+configuration is not changed while in the hibernated state.
+
+Hibernation can be initiated within Linux by writing "disk" to
+/sys/power/state or by invoking the reboot system call with the
+appropriate arguments. This functionality may be wrapped by user space
+commands such "systemctl hibernate" that are run directly from a
+command line or in response to events such as the laptop lid closing.
+
+Considerations for Guest VM Hibernation
+---------------------------------------
+Linux guests on Hyper-V can also be hibernated, in which case the
+hardware is the virtual hardware provided by Hyper-V to the guest VM.
+Only the targeted guest VM is hibernated, while other guest VMs and
+the underlying Hyper-V host continue to run normally. While the
+underlying Windows Hyper-V and physical hardware on which it is
+running might also be hibernated using hibernation functionality in
+the Windows host, host hibernation and its impact on guest VMs is not
+in scope for this documentation.
+
+Resuming a hibernated guest VM can be more challenging than with
+physical hardware because VMs make it very easy to change the hardware
+configuration between the hibernation and resume. Even when the resume
+is done on the same VM that hibernated, the memory size might be
+changed, or virtual NICs or SCSI controllers might be added or
+removed. Virtual PCI devices assigned to the VM might be added or
+removed. Most such changes cause the resume steps to fail, though
+adding a new virtual NIC, SCSI controller, or vPCI device should work.
+
+Additional complexity can ensue because the disks of the hibernated VM
+can be moved to another newly created VM that otherwise has the same
+virtual hardware configuration. While it is desirable for resume from
+hibernation to succeed after such a move, there are challenges. See
+details on this scenario and its limitations in the "Resuming on a
+Different VM" section below.
+
+Hyper-V also provides ways to move a VM from one Hyper-V host to
+another. Hyper-V tries to ensure processor model and Hyper-V version
+compatibility using VM Configuration Versions, and prevents moves to
+a host that isn't compatible. Linux adapts to host and processor
+differences by detecting them at boot time, but such detection is not
+done when resuming execution in the hibernation image. If a VM is
+hibernated on one host, then resumed on a host with a different processor
+model or Hyper-V version, settings recorded in the hibernation image
+may not match the new host. Because Linux does not detect such
+mismatches when resuming the hibernation image, undefined behavior
+and failures could result.
+
+
+Enabling Guest VM Hibernation
+-----------------------------
+Hibernation of a Hyper-V guest VM is disabled by default because
+hibernation is incompatible with memory hot-add, as provided by the
+Hyper-V balloon driver. If hot-add is used and the VM hibernates, it
+hibernates with more memory than it started with. But when the VM
+resumes from hibernation, Hyper-V gives the VM only the originally
+assigned memory, and the memory size mismatch causes resume to fail.
+
+To enable a Hyper-V VM for hibernation, the Hyper-V administrator must
+enable the ACPI virtual S4 sleep state in the ACPI configuration that
+Hyper-V provides to the guest VM. Such enablement is accomplished by
+modifying a WMI property of the VM, the steps for which are outside
+the scope of this documentation but are available on the web.
+Enablement is treated as the indicator that the administrator
+prioritizes Linux hibernation in the VM over hot-add, so the Hyper-V
+balloon driver in Linux disables hot-add. Enablement is indicated if
+the contents of /sys/power/disk contains "platform" as an option. The
+enablement is also visible in /sys/bus/vmbus/hibernation. See function
+hv_is_hibernation_supported().
+
+Linux supports ACPI sleep states on x86, but not on arm64. So Linux
+guest VM hibernation is not available on Hyper-V for arm64.
+
+Initiating Guest VM Hibernation
+-------------------------------
+Guest VMs can self-initiate hibernation using the standard Linux
+methods of writing "disk" to /sys/power/state or the reboot system
+call. As an additional layer, Linux guests on Hyper-V support the
+"Shutdown" integration service, via which a Hyper-V administrator can
+tell a Linux VM to hibernate using a command outside the VM. The
+command generates a request to the Hyper-V shutdown driver in Linux,
+which sends the uevent "EVENT=hibernate". See kernel functions
+shutdown_onchannelcallback() and send_hibernate_uevent(). A udev rule
+must be provided in the VM that handles this event and initiates
+hibernation.
+
+Handling VMBus Devices During Hibernation & Resume
+--------------------------------------------------
+The VMBus bus driver, and the individual VMBus device drivers,
+implement suspend and resume functions that are called as part of the
+Linux orchestration of hibernation and of resuming from hibernation.
+The overall approach is to leave in place the data structures for the
+primary VMBus channels and their associated Linux devices, such as
+SCSI controllers and others, so that they are captured in the
+hibernation image. This approach allows any state associated with the
+device to be persisted across the hibernation/resume. When the VM
+resumes, the devices are re-offered by Hyper-V and are connected to
+the data structures that already exist in the resumed hibernation
+image.
+
+VMBus devices are identified by class and instance GUID. (See section
+"VMBus device creation/deletion" in
+Documentation/virt/hyperv/vmbus.rst.) Upon resume from hibernation,
+the resume functions expect that the devices offered by Hyper-V have
+the same class/instance GUIDs as the devices present at the time of
+hibernation. Having the same class/instance GUIDs allows the offered
+devices to be matched to the primary VMBus channel data structures in
+the memory of the now resumed hibernation image. If any devices are
+offered that don't match primary VMBus channel data structures that
+already exist, they are processed normally as newly added devices. If
+primary VMBus channels that exist in the resumed hibernation image are
+not matched with a device offered in the resumed VM, the resume
+sequence waits for 10 seconds, then proceeds. But the unmatched device
+is likely to cause errors in the resumed VM.
+
+When resuming existing primary VMBus channels, the newly offered
+relids might be different because relids can change on each VM boot,
+even if the VM configuration hasn't changed. The VMBus bus driver
+resume function matches the class/instance GUIDs, and updates the
+relids in case they have changed.
+
+VMBus sub-channels are not persisted in the hibernation image. Each
+VMBus device driver's suspend function must close any sub-channels
+prior to hibernation. Closing a sub-channel causes Hyper-V to send a
+RESCIND_CHANNELOFFER message, which Linux processes by freeing the
+channel data structures so that all vestiges of the sub-channel are
+removed. By contrast, primary channels are marked closed and their
+ring buffers are freed, but Hyper-V does not send a rescind message,
+so the channel data structure continues to exist. Upon resume, the
+device driver's resume function re-allocates the ring buffer and
+re-opens the existing channel. It then communicates with Hyper-V to
+re-open sub-channels from scratch.
+
+The Linux ends of Hyper-V sockets are forced closed at the time of
+hibernation. The guest can't force closing the host end of the socket,
+but any host-side actions on the host end will produce an error.
+
+VMBus devices use the same suspend function for the "freeze" and the
+"poweroff" phases, and the same resume function for the "thaw" and
+"restore" phases. See the "Entering Hibernation" section of
+Documentation/driver-api/pm/devices.rst for the sequencing of the
+phases.
+
+Detailed Hibernation Sequence
+-----------------------------
+1. The Linux power management (PM) subsystem prepares for
+ hibernation by freezing user space processes and allocating
+ memory to hold the hibernation image.
+2. As part of the "freeze" phase, Linux PM calls the "suspend"
+ function for each VMBus device in turn. As described above, this
+ function removes sub-channels, and leaves the primary channel in
+ a closed state.
+3. Linux PM calls the "suspend" function for the VMBus bus, which
+ closes any Hyper-V socket channels and unloads the top-level
+ VMBus connection with the Hyper-V host.
+4. Linux PM disables non-boot CPUs, creates the hibernation image in
+ the previously allocated memory, then re-enables non-boot CPUs.
+ The hibernation image contains the memory data structures for the
+ closed primary channels, but no sub-channels.
+5. As part of the "thaw" phase, Linux PM calls the "resume" function
+ for the VMBus bus, which re-establishes the top-level VMBus
+ connection and requests that Hyper-V re-offer the VMBus devices.
+ As offers are received for the primary channels, the relids are
+ updated as previously described.
+6. Linux PM calls the "resume" function for each VMBus device. Each
+ device re-opens its primary channel, and communicates with Hyper-V
+ to re-establish sub-channels if appropriate. The sub-channels
+ are re-created as new channels since they were previously removed
+ entirely in Step 2.
+7. With VMBus devices now working again, Linux PM writes the
+ hibernation image from memory to disk.
+8. Linux PM repeats Steps 2 and 3 above as part of the "poweroff"
+ phase. VMBus channels are closed and the top-level VMBus
+ connection is unloaded.
+9. Linux PM disables non-boot CPUs, and then enters ACPI sleep state
+ S4. Hibernation is now complete.
+
+Detailed Resume Sequence
+------------------------
+1. The guest VM boots into a fresh Linux OS instance. During boot,
+ the top-level VMBus connection is established, and synthetic
+ devices are enabled. This happens via the normal paths that don't
+ involve hibernation.
+2. Linux PM hibernation code reads swap space is to find and read
+ the hibernation image into memory. If there is no hibernation
+ image, then this boot becomes a normal boot.
+3. If this is a resume from hibernation, the "freeze" phase is used
+ to shutdown VMBus devices and unload the top-level VMBus
+ connection in the running fresh OS instance, just like Steps 2
+ and 3 in the hibernation sequence.
+4. Linux PM disables non-boot CPUs, and transfers control to the
+ read-in hibernation image. In the now-running hibernation image,
+ non-boot CPUs are restarted.
+5. As part of the "resume" phase, Linux PM repeats Steps 5 and 6
+ from the hibernation sequence. The top-level VMBus connection is
+ re-established, and offers are received and matched to primary
+ channels in the image. Relids are updated. VMBus device resume
+ functions re-open primary channels and re-create sub-channels.
+6. Linux PM exits the hibernation resume sequence and the VM is now
+ running normally from the hibernation image.
+
+Key-Value Pair (KVP) Pseudo-Device Anomalies
+--------------------------------------------
+The VMBus KVP device behaves differently from other pseudo-devices
+offered by Hyper-V. When the KVP primary channel is closed, Hyper-V
+sends a rescind message, which causes all vestiges of the device to be
+removed. But Hyper-V then re-offers the device, causing it to be newly
+re-created. The removal and re-creation occurs during the "freeze"
+phase of hibernation, so the hibernation image contains the re-created
+KVP device. Similar behavior occurs during the "freeze" phase of the
+resume sequence while still in the fresh OS instance. But in both
+cases, the top-level VMBus connection is subsequently unloaded, which
+causes the device to be discarded on the Hyper-V side. So no harm is
+done and everything still works.
+
+Virtual PCI devices
+-------------------
+Virtual PCI devices are physical PCI devices that are mapped directly
+into the VM's physical address space so the VM can interact directly
+with the hardware. vPCI devices include those accessed via what Hyper-V
+calls "Discrete Device Assignment" (DDA), as well as SR-IOV NIC
+Virtual Functions (VF) devices. See Documentation/virt/hyperv/vpci.rst.
+
+Hyper-V DDA devices are offered to guest VMs after the top-level VMBus
+connection is established, just like VMBus synthetic devices. They are
+statically assigned to the VM, and their instance GUIDs don't change
+unless the Hyper-V administrator makes changes to the configuration.
+DDA devices are represented in Linux as virtual PCI devices that have
+a VMBus identity as well as a PCI identity. Consequently, Linux guest
+hibernation first handles DDA devices as VMBus devices in order to
+manage the VMBus channel. But then they are also handled as PCI
+devices using the hibernation functions implemented by their native
+PCI driver.
+
+SR-IOV NIC VFs also have a VMBus identity as well as a PCI
+identity, and overall are processed similarly to DDA devices. A
+difference is that VFs are not offered to the VM during initial boot
+of the VM. Instead, the VMBus synthetic NIC driver first starts
+operating and communicates to Hyper-V that it is prepared to accept a
+VF, and then the VF offer is made. However, the VMBus connection
+might later be unloaded and then re-established without the VM being
+rebooted, as happens in Steps 3 and 5 in the Detailed Hibernation
+Sequence above and in the Detailed Resume Sequence. In such a case,
+the VFs likely became part of the VM during initial boot, so when the
+VMBus connection is re-established, the VFs are offered on the
+re-established connection without intervention by the synthetic NIC driver.
+
+UIO Devices
+-----------
+A VMBus device can be exposed to user space using the Hyper-V UIO
+driver (uio_hv_generic.c) so that a user space driver can control and
+operate the device. However, the VMBus UIO driver does not support the
+suspend and resume operations needed for hibernation. If a VMBus
+device is configured to use the UIO driver, hibernating the VM fails
+and Linux continues to run normally. The most common use of the Hyper-V
+UIO driver is for DPDK networking, but there are other uses as well.
+
+Resuming on a Different VM
+--------------------------
+This scenario occurs in the Azure public cloud in that a hibernated
+customer VM only exists as saved configuration and disks -- the VM no
+longer exists on any Hyper-V host. When the customer VM is resumed, a
+new Hyper-V VM with identical configuration is created, likely on a
+different Hyper-V host. That new Hyper-V VM becomes the resumed
+customer VM, and the steps the Linux kernel takes to resume from the
+hibernation image must work in that new VM.
+
+While the disks and their contents are preserved from the original VM,
+the Hyper-V-provided VMBus instance GUIDs of the disk controllers and
+other synthetic devices would typically be different. The difference
+would cause the resume from hibernation to fail, so several things are
+done to solve this problem:
+
+* For VMBus synthetic devices that support only a single instance,
+ Hyper-V always assigns the same instance GUIDs. For example, the
+ Hyper-V mouse, the shutdown pseudo-device, the time sync pseudo
+ device, etc., always have the same instance GUID, both for local
+ Hyper-V installs as well as in the Azure cloud.
+
+* VMBus synthetic SCSI controllers may have multiple instances in a
+ VM, and in the general case instance GUIDs vary from VM to VM.
+ However, Azure VMs always have exactly two synthetic SCSI
+ controllers, and Azure code overrides the normal Hyper-V behavior
+ so these controllers are always assigned the same two instance
+ GUIDs. Consequently, when a customer VM is resumed on a newly
+ created VM, the instance GUIDs match. But this guarantee does not
+ hold for local Hyper-V installs.
+
+* Similarly, VMBus synthetic NICs may have multiple instances in a
+ VM, and the instance GUIDs vary from VM to VM. Again, Azure code
+ overrides the normal Hyper-V behavior so that the instance GUID
+ of a synthetic NIC in a customer VM does not change, even if the
+ customer VM is deallocated or hibernated, and then re-constituted
+ on a newly created VM. As with SCSI controllers, this behavior
+ does not hold for local Hyper-V installs.
+
+* vPCI devices do not have the same instance GUIDs when resuming
+ from hibernation on a newly created VM. Consequently, Azure does
+ not support hibernation for VMs that have DDA devices such as
+ NVMe controllers or GPUs. For SR-IOV NIC VFs, Azure removes the
+ VF from the VM before it hibernates so that the hibernation image
+ does not contain a VF device. When the VM is resumed it
+ instantiates a new VF, rather than trying to match against a VF
+ that is present in the hibernation image. Because Azure must
+ remove any VFs before initiating hibernation, Azure VM
+ hibernation must be initiated externally from the Azure Portal or
+ Azure CLI, which in turn uses the Shutdown integration service to
+ tell Linux to do the hibernation. If hibernation is self-initiated
+ within the Azure VM, VFs remain in the hibernation image, and are
+ not resumed properly.
+
+In summary, Azure takes special actions to remove VFs and to ensure
+that VMBus device instance GUIDs match on a new/different VM, allowing
+hibernation to work for most general-purpose Azure VMs sizes. While
+similar special actions could be taken when resuming on a different VM
+on a local Hyper-V install, orchestrating such actions is not provided
+out-of-the-box by local Hyper-V and so requires custom scripting.
diff --git a/Documentation/virt/hyperv/index.rst b/Documentation/virt/hyperv/index.rst
index de447e11b4a5..c84c40fd61c9 100644
--- a/Documentation/virt/hyperv/index.rst
+++ b/Documentation/virt/hyperv/index.rst
@@ -11,3 +11,5 @@ Hyper-V Enlightenments
vmbus
clocks
vpci
+ hibernation
+ coco
diff --git a/Documentation/virt/hyperv/overview.rst b/Documentation/virt/hyperv/overview.rst
index cd493332c88a..77408a89d1a4 100644
--- a/Documentation/virt/hyperv/overview.rst
+++ b/Documentation/virt/hyperv/overview.rst
@@ -40,7 +40,7 @@ Linux guests communicate with Hyper-V in four different ways:
arm64, these synthetic registers must be accessed using explicit
hypercalls.
-* VMbus: VMbus is a higher-level software construct that is built on
+* VMBus: VMBus is a higher-level software construct that is built on
the other 3 mechanisms. It is a message passing interface between
the Hyper-V host and the Linux guest. It uses memory that is shared
between Hyper-V and the guest, along with various signaling
@@ -54,8 +54,8 @@ x86/x64 architecture only.
.. _Hyper-V Top Level Functional Spec (TLFS): https://docs.microsoft.com/en-us/virtualization/hyper-v-on-windows/tlfs/tlfs
-VMbus is not documented. This documentation provides a high-level
-overview of VMbus and how it works, but the details can be discerned
+VMBus is not documented. This documentation provides a high-level
+overview of VMBus and how it works, but the details can be discerned
only from the code.
Sharing Memory
@@ -74,7 +74,7 @@ follows:
physical address space. How Hyper-V is told about the GPA or list
of GPAs varies. In some cases, a single GPA is written to a
synthetic register. In other cases, a GPA or list of GPAs is sent
- in a VMbus message.
+ in a VMBus message.
* Hyper-V translates the GPAs into "real" physical memory addresses,
and creates a virtual mapping that it can use to access the memory.
@@ -133,9 +133,9 @@ only the CPUs actually present in the VM, so Linux does not report
any hot-add CPUs.
A Linux guest CPU may be taken offline using the normal Linux
-mechanisms, provided no VMbus channel interrupts are assigned to
-the CPU. See the section on VMbus Interrupts for more details
-on how VMbus channel interrupts can be re-assigned to permit
+mechanisms, provided no VMBus channel interrupts are assigned to
+the CPU. See the section on VMBus Interrupts for more details
+on how VMBus channel interrupts can be re-assigned to permit
taking a CPU offline.
32-bit and 64-bit
@@ -169,14 +169,14 @@ and functionality. Hyper-V indicates feature/function availability
via flags in synthetic MSRs that Hyper-V provides to the guest,
and the guest code tests these flags.
-VMbus has its own protocol version that is negotiated during the
-initial VMbus connection from the guest to Hyper-V. This version
+VMBus has its own protocol version that is negotiated during the
+initial VMBus connection from the guest to Hyper-V. This version
number is also output to dmesg during boot. This version number
is checked in a few places in the code to determine if specific
functionality is present.
-Furthermore, each synthetic device on VMbus also has a protocol
-version that is separate from the VMbus protocol version. Device
+Furthermore, each synthetic device on VMBus also has a protocol
+version that is separate from the VMBus protocol version. Device
drivers for these synthetic devices typically negotiate the device
protocol version, and may test that protocol version to determine
if specific device functionality is present.
diff --git a/Documentation/virt/hyperv/vmbus.rst b/Documentation/virt/hyperv/vmbus.rst
index d2012d9022c5..1dcef6a7fda3 100644
--- a/Documentation/virt/hyperv/vmbus.rst
+++ b/Documentation/virt/hyperv/vmbus.rst
@@ -1,8 +1,8 @@
.. SPDX-License-Identifier: GPL-2.0
-VMbus
+VMBus
=====
-VMbus is a software construct provided by Hyper-V to guest VMs. It
+VMBus is a software construct provided by Hyper-V to guest VMs. It
consists of a control path and common facilities used by synthetic
devices that Hyper-V presents to guest VMs. The control path is
used to offer synthetic devices to the guest VM and, in some cases,
@@ -12,9 +12,9 @@ and the synthetic device implementation that is part of Hyper-V, and
signaling primitives to allow Hyper-V and the guest to interrupt
each other.
-VMbus is modeled in Linux as a bus, with the expected /sys/bus/vmbus
-entry in a running Linux guest. The VMbus driver (drivers/hv/vmbus_drv.c)
-establishes the VMbus control path with the Hyper-V host, then
+VMBus is modeled in Linux as a bus, with the expected /sys/bus/vmbus
+entry in a running Linux guest. The VMBus driver (drivers/hv/vmbus_drv.c)
+establishes the VMBus control path with the Hyper-V host, then
registers itself as a Linux bus driver. It implements the standard
bus functions for adding and removing devices to/from the bus.
@@ -49,9 +49,9 @@ synthetic NIC is referred to as "netvsc" and the Linux driver for
the synthetic SCSI controller is "storvsc". These drivers contain
functions with names like "storvsc_connect_to_vsp".
-VMbus channels
+VMBus channels
--------------
-An instance of a synthetic device uses VMbus channels to communicate
+An instance of a synthetic device uses VMBus channels to communicate
between the VSP and the VSC. Channels are bi-directional and used
for passing messages. Most synthetic devices use a single channel,
but the synthetic SCSI controller and synthetic NIC may use multiple
@@ -73,7 +73,7 @@ write indices and some control flags, followed by the memory for the
actual ring. The size of the ring is determined by the VSC in the
guest and is specific to each synthetic device. The list of GPAs
making up the ring is communicated to the Hyper-V host over the
-VMbus control path as a GPA Descriptor List (GPADL). See function
+VMBus control path as a GPA Descriptor List (GPADL). See function
vmbus_establish_gpadl().
Each ring buffer is mapped into contiguous Linux kernel virtual
@@ -102,10 +102,10 @@ resources. For Windows Server 2019 and later, this limit is
approximately 1280 Mbytes. For versions prior to Windows Server
2019, the limit is approximately 384 Mbytes.
-VMbus messages
---------------
-All VMbus messages have a standard header that includes the message
-length, the offset of the message payload, some flags, and a
+VMBus channel messages
+----------------------
+All messages sent in a VMBus channel have a standard header that includes
+the message length, the offset of the message payload, some flags, and a
transactionID. The portion of the message after the header is
unique to each VSP/VSC pair.
@@ -137,7 +137,7 @@ control message contains a list of GPAs that describe the data
buffer. For example, the storvsc driver uses this approach to
specify the data buffers to/from which disk I/O is done.
-Three functions exist to send VMbus messages:
+Three functions exist to send VMBus channel messages:
1. vmbus_sendpacket(): Control-only messages and messages with
embedded data -- no GPAs
@@ -154,20 +154,51 @@ Historically, Linux guests have trusted Hyper-V to send well-formed
and valid messages, and Linux drivers for synthetic devices did not
fully validate messages. With the introduction of processor
technologies that fully encrypt guest memory and that allow the
-guest to not trust the hypervisor (AMD SNP-SEV, Intel TDX), trusting
+guest to not trust the hypervisor (AMD SEV-SNP, Intel TDX), trusting
the Hyper-V host is no longer a valid assumption. The drivers for
-VMbus synthetic devices are being updated to fully validate any
+VMBus synthetic devices are being updated to fully validate any
values read from memory that is shared with Hyper-V, which includes
-messages from VMbus devices. To facilitate such validation,
+messages from VMBus devices. To facilitate such validation,
messages read by the guest from the "in" ring buffer are copied to a
temporary buffer that is not shared with Hyper-V. Validation is
performed in this temporary buffer without the risk of Hyper-V
maliciously modifying the message after it is validated but before
it is used.
-VMbus interrupts
+Synthetic Interrupt Controller (synic)
+--------------------------------------
+Hyper-V provides each guest CPU with a synthetic interrupt controller
+that is used by VMBus for host-guest communication. While each synic
+defines 16 synthetic interrupts (SINT), Linux uses only one of the 16
+(VMBUS_MESSAGE_SINT). All interrupts related to communication between
+the Hyper-V host and a guest CPU use that SINT.
+
+The SINT is mapped to a single per-CPU architectural interrupt (i.e,
+an 8-bit x86/x64 interrupt vector, or an arm64 PPI INTID). Because
+each CPU in the guest has a synic and may receive VMBus interrupts,
+they are best modeled in Linux as per-CPU interrupts. This model works
+well on arm64 where a single per-CPU Linux IRQ is allocated for
+VMBUS_MESSAGE_SINT. This IRQ appears in /proc/interrupts as an IRQ labelled
+"Hyper-V VMbus". Since x86/x64 lacks support for per-CPU IRQs, an x86
+interrupt vector is statically allocated (HYPERVISOR_CALLBACK_VECTOR)
+across all CPUs and explicitly coded to call vmbus_isr(). In this case,
+there's no Linux IRQ, and the interrupts are visible in aggregate in
+/proc/interrupts on the "HYP" line.
+
+The synic provides the means to demultiplex the architectural interrupt into
+one or more logical interrupts and route the logical interrupt to the proper
+VMBus handler in Linux. This demultiplexing is done by vmbus_isr() and
+related functions that access synic data structures.
+
+The synic is not modeled in Linux as an irq chip or irq domain,
+and the demultiplexed logical interrupts are not Linux IRQs. As such,
+they don't appear in /proc/interrupts or /proc/irq. The CPU
+affinity for one of these logical interrupts is controlled via an
+entry under /sys/bus/vmbus as described below.
+
+VMBus interrupts
----------------
-VMbus provides a mechanism for the guest to interrupt the host when
+VMBus provides a mechanism for the guest to interrupt the host when
the guest has queued new messages in a ring buffer. The host
expects that the guest will send an interrupt only when an "out"
ring buffer transitions from empty to non-empty. If the guest sends
@@ -176,63 +207,55 @@ unnecessary. If a guest sends an excessive number of unnecessary
interrupts, the host may throttle that guest by suspending its
execution for a few seconds to prevent a denial-of-service attack.
-Similarly, the host will interrupt the guest when it sends a new
-message on the VMbus control path, or when a VMbus channel "in" ring
-buffer transitions from empty to non-empty. Each CPU in the guest
-may receive VMbus interrupts, so they are best modeled as per-CPU
-interrupts in Linux. This model works well on arm64 where a single
-per-CPU IRQ is allocated for VMbus. Since x86/x64 lacks support for
-per-CPU IRQs, an x86 interrupt vector is statically allocated (see
-HYPERVISOR_CALLBACK_VECTOR) across all CPUs and explicitly coded to
-call the VMbus interrupt service routine. These interrupts are
-visible in /proc/interrupts on the "HYP" line.
-
-The guest CPU that a VMbus channel will interrupt is selected by the
+Similarly, the host will interrupt the guest via the synic when
+it sends a new message on the VMBus control path, or when a VMBus
+channel "in" ring buffer transitions from empty to non-empty due to
+the host inserting a new VMBus channel message. The control message stream
+and each VMBus channel "in" ring buffer are separate logical interrupts
+that are demultiplexed by vmbus_isr(). It demultiplexes by first checking
+for channel interrupts by calling vmbus_chan_sched(), which looks at a synic
+bitmap to determine which channels have pending interrupts on this CPU.
+If multiple channels have pending interrupts for this CPU, they are
+processed sequentially. When all channel interrupts have been processed,
+vmbus_isr() checks for and processes any messages received on the VMBus
+control path.
+
+The guest CPU that a VMBus channel will interrupt is selected by the
guest when the channel is created, and the host is informed of that
-selection. VMbus devices are broadly grouped into two categories:
+selection. VMBus devices are broadly grouped into two categories:
-1. "Slow" devices that need only one VMbus channel. The devices
+1. "Slow" devices that need only one VMBus channel. The devices
(such as keyboard, mouse, heartbeat, and timesync) generate
- relatively few interrupts. Their VMbus channels are all
+ relatively few interrupts. Their VMBus channels are all
assigned to interrupt the VMBUS_CONNECT_CPU, which is always
CPU 0.
-2. "High speed" devices that may use multiple VMbus channels for
+2. "High speed" devices that may use multiple VMBus channels for
higher parallelism and performance. These devices include the
- synthetic SCSI controller and synthetic NIC. Their VMbus
+ synthetic SCSI controller and synthetic NIC. Their VMBus
channels interrupts are assigned to CPUs that are spread out
among the available CPUs in the VM so that interrupts on
multiple channels can be processed in parallel.
-The assignment of VMbus channel interrupts to CPUs is done in the
+The assignment of VMBus channel interrupts to CPUs is done in the
function init_vp_index(). This assignment is done outside of the
normal Linux interrupt affinity mechanism, so the interrupts are
neither "unmanaged" nor "managed" interrupts.
-The CPU that a VMbus channel will interrupt can be seen in
+The CPU that a VMBus channel will interrupt can be seen in
/sys/bus/vmbus/devices/<deviceGUID>/ channels/<channelRelID>/cpu.
When running on later versions of Hyper-V, the CPU can be changed
-by writing a new value to this sysfs entry. Because the interrupt
-assignment is done outside of the normal Linux affinity mechanism,
-there are no entries in /proc/irq corresponding to individual
-VMbus channel interrupts.
+by writing a new value to this sysfs entry. Because VMBus channel
+interrupts are not Linux IRQs, there are no entries in /proc/interrupts
+or /proc/irq corresponding to individual VMBus channel interrupts.
An online CPU in a Linux guest may not be taken offline if it has
-VMbus channel interrupts assigned to it. Any such channel
+VMBus channel interrupts assigned to it. Any such channel
interrupts must first be manually reassigned to another CPU as
described above. When no channel interrupts are assigned to the
CPU, it can be taken offline.
-When a guest CPU receives a VMbus interrupt from the host, the
-function vmbus_isr() handles the interrupt. It first checks for
-channel interrupts by calling vmbus_chan_sched(), which looks at a
-bitmap setup by the host to determine which channels have pending
-interrupts on this CPU. If multiple channels have pending
-interrupts for this CPU, they are processed sequentially. When all
-channel interrupts have been processed, vmbus_isr() checks for and
-processes any message received on the VMbus control path.
-
-The VMbus channel interrupt handling code is designed to work
+The VMBus channel interrupt handling code is designed to work
correctly even if an interrupt is received on a CPU other than the
CPU assigned to the channel. Specifically, the code does not use
CPU-based exclusion for correctness. In normal operation, Hyper-V
@@ -242,23 +265,23 @@ when Hyper-V will make the transition. The code must work correctly
even if there is a time lag before Hyper-V starts interrupting the
new CPU. See comments in target_cpu_store().
-VMbus device creation/deletion
+VMBus device creation/deletion
------------------------------
Hyper-V and the Linux guest have a separate message-passing path
that is used for synthetic device creation and deletion. This
-path does not use a VMbus channel. See vmbus_post_msg() and
+path does not use a VMBus channel. See vmbus_post_msg() and
vmbus_on_msg_dpc().
The first step is for the guest to connect to the generic
-Hyper-V VMbus mechanism. As part of establishing this connection,
-the guest and Hyper-V agree on a VMbus protocol version they will
+Hyper-V VMBus mechanism. As part of establishing this connection,
+the guest and Hyper-V agree on a VMBus protocol version they will
use. This negotiation allows newer Linux kernels to run on older
Hyper-V versions, and vice versa.
The guest then tells Hyper-V to "send offers". Hyper-V sends an
offer message to the guest for each synthetic device that the VM
-is configured to have. Each VMbus device type has a fixed GUID
-known as the "class ID", and each VMbus device instance is also
+is configured to have. Each VMBus device type has a fixed GUID
+known as the "class ID", and each VMBus device instance is also
identified by a GUID. The offer message from Hyper-V contains
both GUIDs to uniquely (within the VM) identify the device.
There is one offer message for each device instance, so a VM with
@@ -275,7 +298,7 @@ type based on the class ID, and invokes the correct driver to set up
the device. Driver/device matching is performed using the standard
Linux mechanism.
-The device driver probe function opens the primary VMbus channel to
+The device driver probe function opens the primary VMBus channel to
the corresponding VSP. It allocates guest memory for the channel
ring buffers and shares the ring buffer with the Hyper-V host by
giving the host a list of GPAs for the ring buffer memory. See
@@ -285,7 +308,7 @@ Once the ring buffer is set up, the device driver and VSP exchange
setup messages via the primary channel. These messages may include
negotiating the device protocol version to be used between the Linux
VSC and the VSP on the Hyper-V host. The setup messages may also
-include creating additional VMbus channels, which are somewhat
+include creating additional VMBus channels, which are somewhat
mis-named as "sub-channels" since they are functionally
equivalent to the primary channel once they are created.
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 0b5a33ee71ee..2b52eb77e29c 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -7,8 +7,19 @@ The Definitive KVM (Kernel-based Virtual Machine) API Documentation
1. General description
======================
-The kvm API is a set of ioctls that are issued to control various aspects
-of a virtual machine. The ioctls belong to the following classes:
+The kvm API is centered around different kinds of file descriptors
+and ioctls that can be issued to these file descriptors. An initial
+open("/dev/kvm") obtains a handle to the kvm subsystem; this handle
+can be used to issue system ioctls. A KVM_CREATE_VM ioctl on this
+handle will create a VM file descriptor which can be used to issue VM
+ioctls. A KVM_CREATE_VCPU or KVM_CREATE_DEVICE ioctl on a VM fd will
+create a virtual cpu or device and return a file descriptor pointing to
+the new resource.
+
+In other words, the kvm API is a set of ioctls that are issued to
+different kinds of file descriptor in order to control various aspects of
+a virtual machine. Depending on the file descriptor that accepts them,
+ioctls belong to the following classes:
- System ioctls: These query and set global attributes which affect the
whole kvm subsystem. In addition a system ioctl is used to create
@@ -35,18 +46,19 @@ of a virtual machine. The ioctls belong to the following classes:
device ioctls must be issued from the same process (address space) that
was used to create the VM.
-2. File descriptors
-===================
+While most ioctls are specific to one kind of file descriptor, in some
+cases the same ioctl can belong to more than one class.
+
+The KVM API grew over time. For this reason, KVM defines many constants
+of the form ``KVM_CAP_*``, each corresponding to a set of functionality
+provided by one or more ioctls. Availability of these "capabilities" can
+be checked with :ref:`KVM_CHECK_EXTENSION <KVM_CHECK_EXTENSION>`. Some
+capabilities also need to be enabled for VMs or VCPUs where their
+functionality is desired (see :ref:`cap_enable` and :ref:`cap_enable_vm`).
-The kvm API is centered around file descriptors. An initial
-open("/dev/kvm") obtains a handle to the kvm subsystem; this handle
-can be used to issue system ioctls. A KVM_CREATE_VM ioctl on this
-handle will create a VM file descriptor which can be used to issue VM
-ioctls. A KVM_CREATE_VCPU or KVM_CREATE_DEVICE ioctl on a VM fd will
-create a virtual cpu or device and return a file descriptor pointing to
-the new resource. Finally, ioctls on a vcpu or device fd can be used
-to control the vcpu or device. For vcpus, this includes the important
-task of actually running guest code.
+
+2. Restrictions
+===============
In general file descriptors can be migrated among processes by means
of fork() and the SCM_RIGHTS facility of unix domain socket. These
@@ -96,12 +108,9 @@ description:
Capability:
which KVM extension provides this ioctl. Can be 'basic',
which means that is will be provided by any kernel that supports
- API version 12 (see section 4.1), a KVM_CAP_xyz constant, which
- means availability needs to be checked with KVM_CHECK_EXTENSION
- (see section 4.4), or 'none' which means that while not all kernels
- support this ioctl, there's no capability bit to check its
- availability: for kernels that don't support the ioctl,
- the ioctl returns -ENOTTY.
+ API version 12 (see :ref:`KVM_GET_API_VERSION <KVM_GET_API_VERSION>`),
+ or a KVM_CAP_xyz constant that can be checked with
+ :ref:`KVM_CHECK_EXTENSION <KVM_CHECK_EXTENSION>`.
Architectures:
which instruction set architectures provide this ioctl.
@@ -118,6 +127,8 @@ description:
are not detailed, but errors with specific meanings are.
+.. _KVM_GET_API_VERSION:
+
4.1 KVM_GET_API_VERSION
-----------------------
@@ -246,6 +257,8 @@ This list also varies by kvm version and host processor, but does not change
otherwise.
+.. _KVM_CHECK_EXTENSION:
+
4.4 KVM_CHECK_EXTENSION
-----------------------
@@ -288,7 +301,7 @@ the VCPU file descriptor can be mmap-ed, including:
- if KVM_CAP_DIRTY_LOG_RING is available, a number of pages at
KVM_DIRTY_LOG_PAGE_OFFSET * PAGE_SIZE. For more information on
- KVM_CAP_DIRTY_LOG_RING, see section 8.3.
+ KVM_CAP_DIRTY_LOG_RING, see :ref:`KVM_CAP_DIRTY_LOG_RING`.
4.7 KVM_CREATE_VCPU
@@ -338,8 +351,8 @@ KVM_S390_SIE_PAGE_OFFSET in order to obtain a memory map of the virtual
cpu's hardware control block.
-4.8 KVM_GET_DIRTY_LOG (vm ioctl)
---------------------------------
+4.8 KVM_GET_DIRTY_LOG
+---------------------
:Capability: basic
:Architectures: all
@@ -891,12 +904,12 @@ like this::
The irq_type field has the following values:
-- irq_type[0]:
+- KVM_ARM_IRQ_TYPE_CPU:
out-of-kernel GIC: irq_id 0 is IRQ, irq_id 1 is FIQ
-- irq_type[1]:
+- KVM_ARM_IRQ_TYPE_SPI:
in-kernel GIC: SPI, irq_id between 32 and 1019 (incl.)
(the vcpu_index field is ignored)
-- irq_type[2]:
+- KVM_ARM_IRQ_TYPE_PPI:
in-kernel GIC: PPI, irq_id between 16 and 31 (incl.)
(The irq_id field thus corresponds nicely to the IRQ ID in the ARM GIC specs)
@@ -1298,7 +1311,7 @@ See KVM_GET_VCPU_EVENTS for the data structure.
:Capability: KVM_CAP_DEBUGREGS
:Architectures: x86
-:Type: vm ioctl
+:Type: vcpu ioctl
:Parameters: struct kvm_debugregs (out)
:Returns: 0 on success, -1 on error
@@ -1320,7 +1333,7 @@ Reads debug registers from the vcpu.
:Capability: KVM_CAP_DEBUGREGS
:Architectures: x86
-:Type: vm ioctl
+:Type: vcpu ioctl
:Parameters: struct kvm_debugregs (in)
:Returns: 0 on success, -1 on error
@@ -1403,6 +1416,12 @@ Instead, an abort (data abort if the cause of the page-table update
was a load or a store, instruction abort if it was an instruction
fetch) is injected in the guest.
+S390:
+^^^^^
+
+Returns -EINVAL or -EEXIST if the VM has the KVM_VM_S390_UCONTROL flag set.
+Returns -EINVAL if called on a protected VM.
+
4.36 KVM_SET_TSS_ADDR
---------------------
@@ -1423,6 +1442,8 @@ because of a quirk in the virtualization implementation (see the internals
documentation when it pops into existence).
+.. _KVM_ENABLE_CAP:
+
4.37 KVM_ENABLE_CAP
-------------------
@@ -1804,15 +1825,18 @@ emulate them efficiently. The fields in each entry are defined as follows:
the values returned by the cpuid instruction for
this function/index combination
-The TSC deadline timer feature (CPUID leaf 1, ecx[24]) is always returned
-as false, since the feature depends on KVM_CREATE_IRQCHIP for local APIC
-support. Instead it is reported via::
+x2APIC (CPUID leaf 1, ecx[21) and TSC deadline timer (CPUID leaf 1, ecx[24])
+may be returned as true, but they depend on KVM_CREATE_IRQCHIP for in-kernel
+emulation of the local APIC. TSC deadline timer support is also reported via::
ioctl(KVM_CHECK_EXTENSION, KVM_CAP_TSC_DEADLINE_TIMER)
if that returns true and you use KVM_CREATE_IRQCHIP, or if you emulate the
feature in userspace, then you can enable the feature for KVM_SET_CPUID2.
+Enabling x2APIC in KVM_SET_CPUID2 requires KVM_CREATE_IRQCHIP as KVM doesn't
+support forwarding x2APIC MSR accesses to userspace, i.e. KVM does not support
+emulating x2APIC in userspace.
4.47 KVM_PPC_GET_PVINFO
-----------------------
@@ -1893,6 +1917,9 @@ No flags are specified so far, the corresponding field must be set to zero.
#define KVM_IRQ_ROUTING_HV_SINT 4
#define KVM_IRQ_ROUTING_XEN_EVTCHN 5
+On s390, adding a KVM_IRQ_ROUTING_S390_ADAPTER is rejected on ucontrol VMs with
+error -EINVAL.
+
flags:
- KVM_MSI_VALID_DEVID: used along with KVM_IRQ_ROUTING_MSI routing entry
@@ -1921,7 +1948,7 @@ flags:
If KVM_MSI_VALID_DEVID is set, devid contains a unique device identifier
for the device that wrote the MSI message. For PCI, this is usually a
-BFD identifier in the lower 16 bits.
+BDF identifier in the lower 16 bits.
On x86, address_hi is ignored unless the KVM_X2APIC_API_USE_32BIT_IDS
feature of KVM_CAP_X2APIC_API capability is enabled. If it is enabled,
@@ -2110,8 +2137,8 @@ TLB, prior to calling KVM_RUN on the associated vcpu.
The "bitmap" field is the userspace address of an array. This array
consists of a number of bits, equal to the total number of TLB entries as
-determined by the last successful call to KVM_CONFIG_TLB, rounded up to the
-nearest multiple of 64.
+determined by the last successful call to ``KVM_ENABLE_CAP(KVM_CAP_SW_TLB)``,
+rounded up to the nearest multiple of 64.
Each bit corresponds to one TLB entry, ordered the same as in the shared TLB
array.
@@ -2164,42 +2191,6 @@ userspace update the TCE table directly which is useful in some
circumstances.
-4.63 KVM_ALLOCATE_RMA
----------------------
-
-:Capability: KVM_CAP_PPC_RMA
-:Architectures: powerpc
-:Type: vm ioctl
-:Parameters: struct kvm_allocate_rma (out)
-:Returns: file descriptor for mapping the allocated RMA
-
-This allocates a Real Mode Area (RMA) from the pool allocated at boot
-time by the kernel. An RMA is a physically-contiguous, aligned region
-of memory used on older POWER processors to provide the memory which
-will be accessed by real-mode (MMU off) accesses in a KVM guest.
-POWER processors support a set of sizes for the RMA that usually
-includes 64MB, 128MB, 256MB and some larger powers of two.
-
-::
-
- /* for KVM_ALLOCATE_RMA */
- struct kvm_allocate_rma {
- __u64 rma_size;
- };
-
-The return value is a file descriptor which can be passed to mmap(2)
-to map the allocated RMA into userspace. The mapped area can then be
-passed to the KVM_SET_USER_MEMORY_REGION ioctl to establish it as the
-RMA for a virtual machine. The size of the RMA in bytes (which is
-fixed at host kernel boot time) is returned in the rma_size field of
-the argument structure.
-
-The KVM_CAP_PPC_RMA capability is 1 or 2 if the KVM_ALLOCATE_RMA ioctl
-is supported; 2 if the processor requires all virtual machines to have
-an RMA, or 1 if the processor can use an RMA but doesn't require it,
-because it supports the Virtual RMA (VRMA) facility.
-
-
4.64 KVM_NMI
------------
@@ -2439,8 +2430,11 @@ registers, find a list below:
PPC KVM_REG_PPC_PSSCR 64
PPC KVM_REG_PPC_DEC_EXPIRY 64
PPC KVM_REG_PPC_PTCR 64
+ PPC KVM_REG_PPC_HASHKEYR 64
+ PPC KVM_REG_PPC_HASHPKEYR 64
PPC KVM_REG_PPC_DAWR1 64
PPC KVM_REG_PPC_DAWRX1 64
+ PPC KVM_REG_PPC_DEXCR 64
PPC KVM_REG_PPC_TM_GPR0 64
...
PPC KVM_REG_PPC_TM_GPR31 64
@@ -2583,7 +2577,7 @@ Specifically:
0x6030 0000 0010 004a SPSR_ABT 64 spsr[KVM_SPSR_ABT]
0x6030 0000 0010 004c SPSR_UND 64 spsr[KVM_SPSR_UND]
0x6030 0000 0010 004e SPSR_IRQ 64 spsr[KVM_SPSR_IRQ]
- 0x6060 0000 0010 0050 SPSR_FIQ 64 spsr[KVM_SPSR_FIQ]
+ 0x6030 0000 0010 0050 SPSR_FIQ 64 spsr[KVM_SPSR_FIQ]
0x6040 0000 0010 0054 V0 128 fp_regs.vregs[0] [1]_
0x6040 0000 0010 0058 V1 128 fp_regs.vregs[1] [1]_
...
@@ -2593,7 +2587,7 @@ Specifically:
======================= ========= ===== =======================================
.. [1] These encodings are not accepted for SVE-enabled vcpus. See
- KVM_ARM_VCPU_INIT.
+ :ref:`KVM_ARM_VCPU_INIT`.
The equivalent register content can be accessed via bits [127:0] of
the corresponding SVE Zn registers instead for vcpus that have SVE
@@ -2986,7 +2980,7 @@ flags:
If KVM_MSI_VALID_DEVID is set, devid contains a unique device identifier
for the device that wrote the MSI message. For PCI, this is usually a
-BFD identifier in the lower 16 bits.
+BDF identifier in the lower 16 bits.
On x86, address_hi is ignored unless the KVM_X2APIC_API_USE_32BIT_IDS
feature of KVM_CAP_X2APIC_API capability is enabled. If it is enabled,
@@ -3584,6 +3578,27 @@ Errors:
This ioctl returns the guest registers that are supported for the
KVM_GET_ONE_REG/KVM_SET_ONE_REG calls.
+Note that s390 does not support KVM_GET_REG_LIST for historical reasons
+(read: nobody cared). The set of registers in kernels 4.x and newer is:
+
+- KVM_REG_S390_TODPR
+
+- KVM_REG_S390_EPOCHDIFF
+
+- KVM_REG_S390_CPU_TIMER
+
+- KVM_REG_S390_CLOCK_COMP
+
+- KVM_REG_S390_PFTOKEN
+
+- KVM_REG_S390_PFCOMPARE
+
+- KVM_REG_S390_PFSELECT
+
+- KVM_REG_S390_PP
+
+- KVM_REG_S390_GBEA
+
4.85 KVM_ARM_SET_DEVICE_ADDR (deprecated)
-----------------------------------------
@@ -4205,7 +4220,9 @@ whether or not KVM_CAP_X86_USER_SPACE_MSR's KVM_MSR_EXIT_REASON_FILTER is
enabled. If KVM_MSR_EXIT_REASON_FILTER is enabled, KVM will exit to userspace
on denied accesses, i.e. userspace effectively intercepts the MSR access. If
KVM_MSR_EXIT_REASON_FILTER is not enabled, KVM will inject a #GP into the guest
-on denied accesses.
+on denied accesses. Note, if an MSR access is denied during emulation of MSR
+load/stores during VMX transitions, KVM ignores KVM_MSR_EXIT_REASON_FILTER.
+See the below warning for full details.
If an MSR access is allowed by userspace, KVM will emulate and/or virtualize
the access in accordance with the vCPU model. Note, KVM may still ultimately
@@ -4220,9 +4237,22 @@ filtering. In that mode, ``KVM_MSR_FILTER_DEFAULT_DENY`` is invalid and causes
an error.
.. warning::
- MSR accesses as part of nested VM-Enter/VM-Exit are not filtered.
- This includes both writes to individual VMCS fields and reads/writes
- through the MSR lists pointed to by the VMCS.
+ MSR accesses that are side effects of instruction execution (emulated or
+ native) are not filtered as hardware does not honor MSR bitmaps outside of
+ RDMSR and WRMSR, and KVM mimics that behavior when emulating instructions
+ to avoid pointless divergence from hardware. E.g. RDPID reads MSR_TSC_AUX,
+ SYSENTER reads the SYSENTER MSRs, etc.
+
+ MSRs that are loaded/stored via dedicated VMCS fields are not filtered as
+ part of VM-Enter/VM-Exit emulation.
+
+ MSRs that are loaded/store via VMX's load/store lists _are_ filtered as part
+ of VM-Enter/VM-Exit emulation. If an MSR access is denied on VM-Enter, KVM
+ synthesizes a consistency check VM-Exit(EXIT_REASON_MSR_LOAD_FAIL). If an
+ MSR access is denied on VM-Exit, KVM synthesizes a VM-Abort. In short, KVM
+ extends Intel's architectural list of MSRs that cannot be loaded/saved via
+ the VM-Enter/VM-Exit MSR list. It is platform owner's responsibility to
+ to communicate any such restrictions to their end users.
x2APIC MSR accesses cannot be filtered (KVM silently ignores filters that
cover any x2APIC MSRs).
@@ -4300,7 +4330,7 @@ operating system that uses the PIT for timing (e.g. Linux 2.4.x).
4.100 KVM_PPC_CONFIGURE_V3_MMU
------------------------------
-:Capability: KVM_CAP_PPC_RADIX_MMU or KVM_CAP_PPC_HASH_MMU_V3
+:Capability: KVM_CAP_PPC_MMU_RADIX or KVM_CAP_PPC_MMU_HASH_V3
:Architectures: ppc
:Type: vm ioctl
:Parameters: struct kvm_ppc_mmuv3_cfg (in)
@@ -4334,7 +4364,7 @@ the Power ISA V3.00, Book III section 5.7.6.1.
4.101 KVM_PPC_GET_RMMU_INFO
---------------------------
-:Capability: KVM_CAP_PPC_RADIX_MMU
+:Capability: KVM_CAP_PPC_MMU_RADIX
:Architectures: ppc
:Type: vm ioctl
:Parameters: struct kvm_ppc_rmmu_info (out)
@@ -4932,8 +4962,8 @@ Coalesced pio is based on coalesced mmio. There is little difference
between coalesced mmio and pio except that coalesced pio records accesses
to I/O ports.
-4.117 KVM_CLEAR_DIRTY_LOG (vm ioctl)
-------------------------------------
+4.117 KVM_CLEAR_DIRTY_LOG
+-------------------------
:Capability: KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2
:Architectures: x86, arm64, mips
@@ -5069,8 +5099,8 @@ Recognised values for feature:
Finalizes the configuration of the specified vcpu feature.
The vcpu must already have been initialised, enabling the affected feature, by
-means of a successful KVM_ARM_VCPU_INIT call with the appropriate flag set in
-features[].
+means of a successful :ref:`KVM_ARM_VCPU_INIT <KVM_ARM_VCPU_INIT>` call with the
+appropriate flag set in features[].
For affected vcpu features, this is a mandatory step that must be performed
before the vcpu is fully usable.
@@ -5242,7 +5272,7 @@ the cpu reset definition in the POP (Principles Of Operation).
4.123 KVM_S390_INITIAL_RESET
----------------------------
-:Capability: none
+:Capability: basic
:Architectures: s390
:Type: vcpu ioctl
:Parameters: none
@@ -5550,7 +5580,7 @@ KVM_XEN_ATTR_TYPE_SHARED_INFO_HVA
in guest physical address space. This attribute should be used in
preference to KVM_XEN_ATTR_TYPE_SHARED_INFO as it avoids
unnecessary invalidation of an internal cache when the page is
- re-mapped in guest physcial address space.
+ re-mapped in guest physical address space.
Setting the hva to zero will disable the shared_info page.
@@ -6181,7 +6211,7 @@ applied.
.. _KVM_ARM_GET_REG_WRITABLE_MASKS:
4.139 KVM_ARM_GET_REG_WRITABLE_MASKS
--------------------------------------------
+------------------------------------
:Capability: KVM_CAP_ARM_SUPPORTED_REG_MASK_RANGES
:Architectures: arm64
@@ -6273,6 +6303,12 @@ state. At VM creation time, all memory is shared, i.e. the PRIVATE attribute
is '0' for all gfns. Userspace can control whether memory is shared/private by
toggling KVM_MEMORY_ATTRIBUTE_PRIVATE via KVM_SET_MEMORY_ATTRIBUTES as needed.
+S390:
+^^^^^
+
+Returns -EINVAL if the VM has the KVM_VM_S390_UCONTROL flag set.
+Returns -EINVAL if called on a protected VM.
+
4.141 KVM_SET_MEMORY_ATTRIBUTES
-------------------------------
@@ -6316,7 +6352,7 @@ The "flags" field is reserved for future extensions and must be '0'.
:Architectures: none
:Type: vm ioctl
:Parameters: struct kvm_create_guest_memfd(in)
-:Returns: 0 on success, <0 on error
+:Returns: A file descriptor on success, <0 on error
KVM_CREATE_GUEST_MEMFD creates an anonymous file and returns a file descriptor
that refers to it. guest_memfd files are roughly analogous to files created
@@ -6352,6 +6388,69 @@ a single guest_memfd file, but the bound ranges must not overlap).
See KVM_SET_USER_MEMORY_REGION2 for additional details.
+4.143 KVM_PRE_FAULT_MEMORY
+---------------------------
+
+:Capability: KVM_CAP_PRE_FAULT_MEMORY
+:Architectures: none
+:Type: vcpu ioctl
+:Parameters: struct kvm_pre_fault_memory (in/out)
+:Returns: 0 if at least one page is processed, < 0 on error
+
+Errors:
+
+ ========== ===============================================================
+ EINVAL The specified `gpa` and `size` were invalid (e.g. not
+ page aligned, causes an overflow, or size is zero).
+ ENOENT The specified `gpa` is outside defined memslots.
+ EINTR An unmasked signal is pending and no page was processed.
+ EFAULT The parameter address was invalid.
+ EOPNOTSUPP Mapping memory for a GPA is unsupported by the
+ hypervisor, and/or for the current vCPU state/mode.
+ EIO unexpected error conditions (also causes a WARN)
+ ========== ===============================================================
+
+::
+
+ struct kvm_pre_fault_memory {
+ /* in/out */
+ __u64 gpa;
+ __u64 size;
+ /* in */
+ __u64 flags;
+ __u64 padding[5];
+ };
+
+KVM_PRE_FAULT_MEMORY populates KVM's stage-2 page tables used to map memory
+for the current vCPU state. KVM maps memory as if the vCPU generated a
+stage-2 read page fault, e.g. faults in memory as needed, but doesn't break
+CoW. However, KVM does not mark any newly created stage-2 PTE as Accessed.
+
+In the case of confidential VM types where there is an initial set up of
+private guest memory before the guest is 'finalized'/measured, this ioctl
+should only be issued after completing all the necessary setup to put the
+guest into a 'finalized' state so that the above semantics can be reliably
+ensured.
+
+In some cases, multiple vCPUs might share the page tables. In this
+case, the ioctl can be called in parallel.
+
+When the ioctl returns, the input values are updated to point to the
+remaining range. If `size` > 0 on return, the caller can just issue
+the ioctl again with the same `struct kvm_map_memory` argument.
+
+Shadow page tables cannot support this ioctl because they
+are indexed by virtual address or nested guest physical address.
+Calling this ioctl when the guest is using shadow page tables (for
+example because it is running a nested guest with nested page tables)
+will fail with `EOPNOTSUPP` even if `KVM_CHECK_EXTENSION` reports
+the capability to be present.
+
+`flags` must currently be zero.
+
+
+.. _kvm_run:
+
5. The kvm_run structure
========================
@@ -6416,9 +6515,12 @@ More architecture-specific flags detailing state of the VCPU that may
affect the device's behavior. Current defined flags::
/* x86, set if the VCPU is in system management mode */
- #define KVM_RUN_X86_SMM (1 << 0)
+ #define KVM_RUN_X86_SMM (1 << 0)
/* x86, set if bus lock detected in VM */
- #define KVM_RUN_BUS_LOCK (1 << 1)
+ #define KVM_RUN_X86_BUS_LOCK (1 << 1)
+ /* x86, set if the VCPU is executing a nested (L2) guest */
+ #define KVM_RUN_X86_GUEST_MODE (1 << 2)
+
/* arm64, set for KVM_EXIT_DEBUG */
#define KVM_DEBUG_ARCH_HSR_HIGH_VALID (1 << 0)
@@ -6761,6 +6863,10 @@ the first `ndata` items (possibly zero) of the data array are valid.
the guest issued a SYSTEM_RESET2 call according to v1.1 of the PSCI
specification.
+ - for arm64, data[0] is set to KVM_SYSTEM_EVENT_SHUTDOWN_FLAG_PSCI_OFF2
+ if the guest issued a SYSTEM_OFF2 call according to v1.3 of the PSCI
+ specification.
+
- for RISC-V, data[0] is set to the value of the second argument of the
``sbi_system_reset`` call.
@@ -6794,6 +6900,12 @@ either:
- Deny the guest request to suspend the VM. See ARM DEN0022D.b 5.19.2
"Caller responsibilities" for possible return values.
+Hibernation using the PSCI SYSTEM_OFF2 call is enabled when PSCI v1.3
+is enabled. If a guest invokes the PSCI SYSTEM_OFF2 function, KVM will
+exit to userspace with the KVM_SYSTEM_EVENT_SHUTDOWN event type and with
+data[0] set to KVM_SYSTEM_EVENT_SHUTDOWN_FLAG_PSCI_OFF2. The only
+supported hibernate type for the SYSTEM_OFF2 function is HIBERNATE_OFF.
+
::
/* KVM_EXIT_IOAPIC_EOI */
@@ -6894,6 +7006,13 @@ Note that KVM does not skip the faulting instruction as it does for
KVM_EXIT_MMIO, but userspace has to emulate any change to the processing state
if it decides to decode and emulate the instruction.
+This feature isn't available to protected VMs, as userspace does not
+have access to the state that is required to perform the emulation.
+Instead, a data abort exception is directly injected in the guest.
+Note that although KVM_CAP_ARM_NISV_TO_USER will be reported if
+queried outside of a protected VM context, the feature will not be
+exposed if queried on a protected VM file descriptor.
+
::
/* KVM_EXIT_X86_RDMSR / KVM_EXIT_X86_WRMSR */
@@ -7061,11 +7180,15 @@ primary storage for certain register types. Therefore, the kernel may use the
values in kvm_run even if the corresponding bit in kvm_dirty_regs is not set.
+.. _cap_enable:
+
6. Capabilities that can be enabled on vCPUs
============================================
There are certain capabilities that change the behavior of the virtual CPU or
-the virtual machine when enabled. To enable them, please see section 4.37.
+the virtual machine when enabled. To enable them, please see
+:ref:`KVM_ENABLE_CAP`.
+
Below you can find a list of capabilities and what their effect on the vCPU or
the virtual machine is when enabling them.
@@ -7274,7 +7397,7 @@ KVM API and also from the guest.
sets are supported
(bitfields defined in arch/x86/include/uapi/asm/kvm.h).
-As described above in the kvm_sync_regs struct info in section 5 (kvm_run):
+As described above in the kvm_sync_regs struct info in section :ref:`kvm_run`,
KVM_CAP_SYNC_REGS "allow[s] userspace to access certain guest registers
without having to call SET/GET_*REGS". This reduces overhead by eliminating
repeated ioctl calls for setting and/or getting register values. This is
@@ -7320,13 +7443,15 @@ Unused bitfields in the bitarrays must be set to zero.
This capability connects the vcpu to an in-kernel XIVE device.
+.. _cap_enable_vm:
+
7. Capabilities that can be enabled on VMs
==========================================
There are certain capabilities that change the behavior of the virtual
-machine when enabled. To enable them, please see section 4.37. Below
-you can find a list of capabilities and what their effect on the VM
-is when enabling them.
+machine when enabled. To enable them, please see section
+:ref:`KVM_ENABLE_CAP`. Below you can find a list of capabilities and
+what their effect on the VM is when enabling them.
The following information is provided along with the description:
@@ -7551,6 +7676,7 @@ branch to guests' 0x200 interrupt vector.
:Architectures: x86
:Parameters: args[0] defines which exits are disabled
:Returns: 0 on success, -EINVAL when args[0] contains invalid exits
+ or if any vCPUs have already been created
Valid bits in args[0] are::
@@ -7757,29 +7883,31 @@ Valid bits in args[0] are::
#define KVM_BUS_LOCK_DETECTION_OFF (1 << 0)
#define KVM_BUS_LOCK_DETECTION_EXIT (1 << 1)
-Enabling this capability on a VM provides userspace with a way to select
-a policy to handle the bus locks detected in guest. Userspace can obtain
-the supported modes from the result of KVM_CHECK_EXTENSION and define it
-through the KVM_ENABLE_CAP.
+Enabling this capability on a VM provides userspace with a way to select a
+policy to handle the bus locks detected in guest. Userspace can obtain the
+supported modes from the result of KVM_CHECK_EXTENSION and define it through
+the KVM_ENABLE_CAP. The supported modes are mutually-exclusive.
-KVM_BUS_LOCK_DETECTION_OFF and KVM_BUS_LOCK_DETECTION_EXIT are supported
-currently and mutually exclusive with each other. More bits can be added in
-the future.
+This capability allows userspace to force VM exits on bus locks detected in the
+guest, irrespective whether or not the host has enabled split-lock detection
+(which triggers an #AC exception that KVM intercepts). This capability is
+intended to mitigate attacks where a malicious/buggy guest can exploit bus
+locks to degrade the performance of the whole system.
-With KVM_BUS_LOCK_DETECTION_OFF set, bus locks in guest will not cause vm exits
-so that no additional actions are needed. This is the default mode.
+If KVM_BUS_LOCK_DETECTION_OFF is set, KVM doesn't force guest bus locks to VM
+exit, although the host kernel's split-lock #AC detection still applies, if
+enabled.
-With KVM_BUS_LOCK_DETECTION_EXIT set, vm exits happen when bus lock detected
-in VM. KVM just exits to userspace when handling them. Userspace can enforce
-its own throttling or other policy based mitigations.
+If KVM_BUS_LOCK_DETECTION_EXIT is set, KVM enables a CPU feature that ensures
+bus locks in the guest trigger a VM exit, and KVM exits to userspace for all
+such VM exits, e.g. to allow userspace to throttle the offending guest and/or
+apply some other policy-based mitigation. When exiting to userspace, KVM sets
+KVM_RUN_X86_BUS_LOCK in vcpu-run->flags, and conditionally sets the exit_reason
+to KVM_EXIT_X86_BUS_LOCK.
-This capability is aimed to address the thread that VM can exploit bus locks to
-degree the performance of the whole system. Once the userspace enable this
-capability and select the KVM_BUS_LOCK_DETECTION_EXIT mode, KVM will set the
-KVM_RUN_BUS_LOCK flag in vcpu-run->flags field and exit to userspace. Concerning
-the bus lock vm exit can be preempted by a higher priority VM exit, the exit
-notifications to userspace can be KVM_EXIT_BUS_LOCK or other reasons.
-KVM_RUN_BUS_LOCK flag is used to distinguish between them.
+Note! Detected bus locks may be coincident with other exits to userspace, i.e.
+KVM_RUN_X86_BUS_LOCK should be checked regardless of the primary exit reason if
+userspace wants to take action on all detected bus locks.
7.23 KVM_CAP_PPC_DAWR1
----------------------
@@ -7895,10 +8023,10 @@ perform a bulk copy of tags to/from the guest.
7.29 KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM
-------------------------------------
-Architectures: x86 SEV enabled
-Type: vm
-Parameters: args[0] is the fd of the source vm
-Returns: 0 on success
+:Architectures: x86 SEV enabled
+:Type: vm
+:Parameters: args[0] is the fd of the source vm
+:Returns: 0 on success
This capability enables userspace to migrate the encryption context from the VM
indicated by the fd to the VM this is called on.
@@ -7946,7 +8074,11 @@ The valid bits in cap.args[0] are:
When this quirk is disabled, the reset value
is 0x10000 (APIC_LVT_MASKED).
- KVM_X86_QUIRK_CD_NW_CLEARED By default, KVM clears CR0.CD and CR0.NW.
+ KVM_X86_QUIRK_CD_NW_CLEARED By default, KVM clears CR0.CD and CR0.NW on
+ AMD CPUs to workaround buggy guest firmware
+ that runs in perpetuity with CR0.CD, i.e.
+ with caches in "no fill" mode.
+
When this quirk is disabled, KVM does not
change the value of CR0.CD and CR0.NW.
@@ -7990,6 +8122,38 @@ KVM_X86_QUIRK_MWAIT_NEVER_UD_FAULTS By default, KVM emulates MONITOR/MWAIT (if
guest CPUID on writes to MISC_ENABLE if
KVM_X86_QUIRK_MISC_ENABLE_NO_MWAIT is
disabled.
+
+KVM_X86_QUIRK_SLOT_ZAP_ALL By default, for KVM_X86_DEFAULT_VM VMs, KVM
+ invalidates all SPTEs in all memslots and
+ address spaces when a memslot is deleted or
+ moved. When this quirk is disabled (or the
+ VM type isn't KVM_X86_DEFAULT_VM), KVM only
+ ensures the backing memory of the deleted
+ or moved memslot isn't reachable, i.e KVM
+ _may_ invalidate only SPTEs related to the
+ memslot.
+
+KVM_X86_QUIRK_STUFF_FEATURE_MSRS By default, at vCPU creation, KVM sets the
+ vCPU's MSR_IA32_PERF_CAPABILITIES (0x345),
+ MSR_IA32_ARCH_CAPABILITIES (0x10a),
+ MSR_PLATFORM_INFO (0xce), and all VMX MSRs
+ (0x480..0x492) to the maximal capabilities
+ supported by KVM. KVM also sets
+ MSR_IA32_UCODE_REV (0x8b) to an arbitrary
+ value (which is different for Intel vs.
+ AMD). Lastly, when guest CPUID is set (by
+ userspace), KVM modifies select VMX MSR
+ fields to force consistency between guest
+ CPUID and L2's effective ISA. When this
+ quirk is disabled, KVM zeroes the vCPU's MSR
+ values (with two exceptions, see below),
+ i.e. treats the feature MSRs like CPUID
+ leaves and gives userspace full control of
+ the vCPU model definition. This quirk does
+ not affect VMX MSRs CR0/CR4_FIXED1 (0x487
+ and 0x489), as KVM does now allow them to
+ be set by userspace (KVM sets them based on
+ guest CPUID, for safety purposes).
=================================== ============================================
7.32 KVM_CAP_MAX_VCPU_ID
@@ -8063,6 +8227,37 @@ error/annotated fault.
See KVM_EXIT_MEMORY_FAULT for more information.
+7.35 KVM_CAP_X86_APIC_BUS_CYCLES_NS
+-----------------------------------
+
+:Architectures: x86
+:Target: VM
+:Parameters: args[0] is the desired APIC bus clock rate, in nanoseconds
+:Returns: 0 on success, -EINVAL if args[0] contains an invalid value for the
+ frequency or if any vCPUs have been created, -ENXIO if a virtual
+ local APIC has not been created using KVM_CREATE_IRQCHIP.
+
+This capability sets the VM's APIC bus clock frequency, used by KVM's in-kernel
+virtual APIC when emulating APIC timers. KVM's default value can be retrieved
+by KVM_CHECK_EXTENSION.
+
+Note: Userspace is responsible for correctly configuring CPUID 0x15, a.k.a. the
+core crystal clock frequency, if a non-zero CPUID 0x15 is exposed to the guest.
+
+7.36 KVM_CAP_X86_GUEST_MODE
+------------------------------
+
+:Architectures: x86
+:Returns: Informational only, -EINVAL on direct KVM_ENABLE_CAP.
+
+The presence of this capability indicates that KVM_RUN will update the
+KVM_RUN_X86_GUEST_MODE bit in kvm_run.flags to indicate whether the
+vCPU was executing nested guest code when it exited.
+
+KVM exits with the register state of either the L1 or L2 guest
+depending on which executed at the time of an exit. Userspace must
+take care to differentiate between these cases.
+
8. Other capabilities.
======================
@@ -8095,7 +8290,7 @@ capability via KVM_ENABLE_CAP ioctl on the vcpu fd. Note that this
will disable the use of APIC hardware virtualization even if supported
by the CPU, as it's incompatible with SynIC auto-EOI behavior.
-8.3 KVM_CAP_PPC_RADIX_MMU
+8.3 KVM_CAP_PPC_MMU_RADIX
-------------------------
:Architectures: ppc
@@ -8105,7 +8300,7 @@ available, means that the kernel can support guests using the
radix MMU defined in Power ISA V3.00 (as implemented in the POWER9
processor).
-8.4 KVM_CAP_PPC_HASH_MMU_V3
+8.4 KVM_CAP_PPC_MMU_HASH_V3
---------------------------
:Architectures: ppc
@@ -8440,6 +8635,8 @@ guest according to the bits in the KVM_CPUID_FEATURES CPUID leaf
(0x40000001). Otherwise, a guest may use the paravirtual features
regardless of what has actually been exposed through the CPUID leaf.
+.. _KVM_CAP_DIRTY_LOG_RING:
+
8.29 KVM_CAP_DIRTY_LOG_RING/KVM_CAP_DIRTY_LOG_RING_ACQ_REL
----------------------------------------------------------
@@ -8819,6 +9016,8 @@ means the VM type with value @n is supported. Possible values of @n are::
#define KVM_X86_DEFAULT_VM 0
#define KVM_X86_SW_PROTECTED_VM 1
+ #define KVM_X86_SEV_VM 2
+ #define KVM_X86_SEV_ES_VM 3
Note, KVM_X86_SW_PROTECTED_VM is currently only for development and testing.
Do not use KVM_X86_SW_PROTECTED_VM for "real" VMs, and especially not in
diff --git a/Documentation/virt/kvm/arm/fw-pseudo-registers.rst b/Documentation/virt/kvm/arm/fw-pseudo-registers.rst
new file mode 100644
index 000000000000..b90fd0b0fa66
--- /dev/null
+++ b/Documentation/virt/kvm/arm/fw-pseudo-registers.rst
@@ -0,0 +1,138 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=======================================
+ARM firmware pseudo-registers interface
+=======================================
+
+KVM handles the hypercall services as requested by the guests. New hypercall
+services are regularly made available by the ARM specification or by KVM (as
+vendor services) if they make sense from a virtualization point of view.
+
+This means that a guest booted on two different versions of KVM can observe
+two different "firmware" revisions. This could cause issues if a given guest
+is tied to a particular version of a hypercall service, or if a migration
+causes a different version to be exposed out of the blue to an unsuspecting
+guest.
+
+In order to remedy this situation, KVM exposes a set of "firmware
+pseudo-registers" that can be manipulated using the GET/SET_ONE_REG
+interface. These registers can be saved/restored by userspace, and set
+to a convenient value as required.
+
+The following registers are defined:
+
+* KVM_REG_ARM_PSCI_VERSION:
+
+ KVM implements the PSCI (Power State Coordination Interface)
+ specification in order to provide services such as CPU on/off, reset
+ and power-off to the guest.
+
+ - Only valid if the vcpu has the KVM_ARM_VCPU_PSCI_0_2 feature set
+ (and thus has already been initialized)
+ - Returns the current PSCI version on GET_ONE_REG (defaulting to the
+ highest PSCI version implemented by KVM and compatible with v0.2)
+ - Allows any PSCI version implemented by KVM and compatible with
+ v0.2 to be set with SET_ONE_REG
+ - Affects the whole VM (even if the register view is per-vcpu)
+
+* KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_1:
+ Holds the state of the firmware support to mitigate CVE-2017-5715, as
+ offered by KVM to the guest via a HVC call. The workaround is described
+ under SMCCC_ARCH_WORKAROUND_1 in [1].
+
+ Accepted values are:
+
+ KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_1_NOT_AVAIL:
+ KVM does not offer
+ firmware support for the workaround. The mitigation status for the
+ guest is unknown.
+ KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_1_AVAIL:
+ The workaround HVC call is
+ available to the guest and required for the mitigation.
+ KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_1_NOT_REQUIRED:
+ The workaround HVC call
+ is available to the guest, but it is not needed on this VCPU.
+
+* KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2:
+ Holds the state of the firmware support to mitigate CVE-2018-3639, as
+ offered by KVM to the guest via a HVC call. The workaround is described
+ under SMCCC_ARCH_WORKAROUND_2 in [1]_.
+
+ Accepted values are:
+
+ KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_NOT_AVAIL:
+ A workaround is not
+ available. KVM does not offer firmware support for the workaround.
+ KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_UNKNOWN:
+ The workaround state is
+ unknown. KVM does not offer firmware support for the workaround.
+ KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_AVAIL:
+ The workaround is available,
+ and can be disabled by a vCPU. If
+ KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_ENABLED is set, it is active for
+ this vCPU.
+ KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_NOT_REQUIRED:
+ The workaround is always active on this vCPU or it is not needed.
+
+
+Bitmap Feature Firmware Registers
+---------------------------------
+
+Contrary to the above registers, the following registers exposes the
+hypercall services in the form of a feature-bitmap to the userspace. This
+bitmap is translated to the services that are available to the guest.
+There is a register defined per service call owner and can be accessed via
+GET/SET_ONE_REG interface.
+
+By default, these registers are set with the upper limit of the features
+that are supported. This way userspace can discover all the usable
+hypercall services via GET_ONE_REG. The user-space can write-back the
+desired bitmap back via SET_ONE_REG. The features for the registers that
+are untouched, probably because userspace isn't aware of them, will be
+exposed as is to the guest.
+
+Note that KVM will not allow the userspace to configure the registers
+anymore once any of the vCPUs has run at least once. Instead, it will
+return a -EBUSY.
+
+The pseudo-firmware bitmap register are as follows:
+
+* KVM_REG_ARM_STD_BMAP:
+ Controls the bitmap of the ARM Standard Secure Service Calls.
+
+ The following bits are accepted:
+
+ Bit-0: KVM_REG_ARM_STD_BIT_TRNG_V1_0:
+ The bit represents the services offered under v1.0 of ARM True Random
+ Number Generator (TRNG) specification, ARM DEN0098.
+
+* KVM_REG_ARM_STD_HYP_BMAP:
+ Controls the bitmap of the ARM Standard Hypervisor Service Calls.
+
+ The following bits are accepted:
+
+ Bit-0: KVM_REG_ARM_STD_HYP_BIT_PV_TIME:
+ The bit represents the Paravirtualized Time service as represented by
+ ARM DEN0057A.
+
+* KVM_REG_ARM_VENDOR_HYP_BMAP:
+ Controls the bitmap of the Vendor specific Hypervisor Service Calls.
+
+ The following bits are accepted:
+
+ Bit-0: KVM_REG_ARM_VENDOR_HYP_BIT_FUNC_FEAT
+ The bit represents the ARM_SMCCC_VENDOR_HYP_KVM_FEATURES_FUNC_ID
+ and ARM_SMCCC_VENDOR_HYP_CALL_UID_FUNC_ID function-ids.
+
+ Bit-1: KVM_REG_ARM_VENDOR_HYP_BIT_PTP:
+ The bit represents the Precision Time Protocol KVM service.
+
+Errors:
+
+ ======= =============================================================
+ -ENOENT Unknown register accessed.
+ -EBUSY Attempt a 'write' to the register after the VM has started.
+ -EINVAL Invalid bitmap written to the register.
+ ======= =============================================================
+
+.. [1] https://developer.arm.com/-/media/developer/pdf/ARM_DEN_0070A_Firmware_interfaces_for_mitigating_CVE-2017-5715.pdf
diff --git a/Documentation/virt/kvm/arm/hypercalls.rst b/Documentation/virt/kvm/arm/hypercalls.rst
index 3e23084644ba..af7bc2c2e0cb 100644
--- a/Documentation/virt/kvm/arm/hypercalls.rst
+++ b/Documentation/virt/kvm/arm/hypercalls.rst
@@ -1,138 +1,144 @@
.. SPDX-License-Identifier: GPL-2.0
-=======================
-ARM Hypercall Interface
-=======================
-
-KVM handles the hypercall services as requested by the guests. New hypercall
-services are regularly made available by the ARM specification or by KVM (as
-vendor services) if they make sense from a virtualization point of view.
-
-This means that a guest booted on two different versions of KVM can observe
-two different "firmware" revisions. This could cause issues if a given guest
-is tied to a particular version of a hypercall service, or if a migration
-causes a different version to be exposed out of the blue to an unsuspecting
-guest.
-
-In order to remedy this situation, KVM exposes a set of "firmware
-pseudo-registers" that can be manipulated using the GET/SET_ONE_REG
-interface. These registers can be saved/restored by userspace, and set
-to a convenient value as required.
-
-The following registers are defined:
-
-* KVM_REG_ARM_PSCI_VERSION:
-
- KVM implements the PSCI (Power State Coordination Interface)
- specification in order to provide services such as CPU on/off, reset
- and power-off to the guest.
-
- - Only valid if the vcpu has the KVM_ARM_VCPU_PSCI_0_2 feature set
- (and thus has already been initialized)
- - Returns the current PSCI version on GET_ONE_REG (defaulting to the
- highest PSCI version implemented by KVM and compatible with v0.2)
- - Allows any PSCI version implemented by KVM and compatible with
- v0.2 to be set with SET_ONE_REG
- - Affects the whole VM (even if the register view is per-vcpu)
-
-* KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_1:
- Holds the state of the firmware support to mitigate CVE-2017-5715, as
- offered by KVM to the guest via a HVC call. The workaround is described
- under SMCCC_ARCH_WORKAROUND_1 in [1].
-
- Accepted values are:
-
- KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_1_NOT_AVAIL:
- KVM does not offer
- firmware support for the workaround. The mitigation status for the
- guest is unknown.
- KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_1_AVAIL:
- The workaround HVC call is
- available to the guest and required for the mitigation.
- KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_1_NOT_REQUIRED:
- The workaround HVC call
- is available to the guest, but it is not needed on this VCPU.
-
-* KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2:
- Holds the state of the firmware support to mitigate CVE-2018-3639, as
- offered by KVM to the guest via a HVC call. The workaround is described
- under SMCCC_ARCH_WORKAROUND_2 in [1]_.
-
- Accepted values are:
-
- KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_NOT_AVAIL:
- A workaround is not
- available. KVM does not offer firmware support for the workaround.
- KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_UNKNOWN:
- The workaround state is
- unknown. KVM does not offer firmware support for the workaround.
- KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_AVAIL:
- The workaround is available,
- and can be disabled by a vCPU. If
- KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_ENABLED is set, it is active for
- this vCPU.
- KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_NOT_REQUIRED:
- The workaround is always active on this vCPU or it is not needed.
-
-
-Bitmap Feature Firmware Registers
----------------------------------
-
-Contrary to the above registers, the following registers exposes the
-hypercall services in the form of a feature-bitmap to the userspace. This
-bitmap is translated to the services that are available to the guest.
-There is a register defined per service call owner and can be accessed via
-GET/SET_ONE_REG interface.
-
-By default, these registers are set with the upper limit of the features
-that are supported. This way userspace can discover all the usable
-hypercall services via GET_ONE_REG. The user-space can write-back the
-desired bitmap back via SET_ONE_REG. The features for the registers that
-are untouched, probably because userspace isn't aware of them, will be
-exposed as is to the guest.
-
-Note that KVM will not allow the userspace to configure the registers
-anymore once any of the vCPUs has run at least once. Instead, it will
-return a -EBUSY.
-
-The pseudo-firmware bitmap register are as follows:
-
-* KVM_REG_ARM_STD_BMAP:
- Controls the bitmap of the ARM Standard Secure Service Calls.
-
- The following bits are accepted:
-
- Bit-0: KVM_REG_ARM_STD_BIT_TRNG_V1_0:
- The bit represents the services offered under v1.0 of ARM True Random
- Number Generator (TRNG) specification, ARM DEN0098.
-
-* KVM_REG_ARM_STD_HYP_BMAP:
- Controls the bitmap of the ARM Standard Hypervisor Service Calls.
-
- The following bits are accepted:
-
- Bit-0: KVM_REG_ARM_STD_HYP_BIT_PV_TIME:
- The bit represents the Paravirtualized Time service as represented by
- ARM DEN0057A.
-
-* KVM_REG_ARM_VENDOR_HYP_BMAP:
- Controls the bitmap of the Vendor specific Hypervisor Service Calls.
-
- The following bits are accepted:
-
- Bit-0: KVM_REG_ARM_VENDOR_HYP_BIT_FUNC_FEAT
- The bit represents the ARM_SMCCC_VENDOR_HYP_KVM_FEATURES_FUNC_ID
- and ARM_SMCCC_VENDOR_HYP_CALL_UID_FUNC_ID function-ids.
-
- Bit-1: KVM_REG_ARM_VENDOR_HYP_BIT_PTP:
- The bit represents the Precision Time Protocol KVM service.
-
-Errors:
-
- ======= =============================================================
- -ENOENT Unknown register accessed.
- -EBUSY Attempt a 'write' to the register after the VM has started.
- -EINVAL Invalid bitmap written to the register.
- ======= =============================================================
-
-.. [1] https://developer.arm.com/-/media/developer/pdf/ARM_DEN_0070A_Firmware_interfaces_for_mitigating_CVE-2017-5715.pdf
+===============================================
+KVM/arm64-specific hypercalls exposed to guests
+===============================================
+
+This file documents the KVM/arm64-specific hypercalls which may be
+exposed by KVM/arm64 to guest operating systems. These hypercalls are
+issued using the HVC instruction according to version 1.1 of the Arm SMC
+Calling Convention (DEN0028/C):
+
+https://developer.arm.com/docs/den0028/c
+
+All KVM/arm64-specific hypercalls are allocated within the "Vendor
+Specific Hypervisor Service Call" range with a UID of
+``28b46fb6-2ec5-11e9-a9ca-4b564d003a74``. This UID should be queried by the
+guest using the standard "Call UID" function for the service range in
+order to determine that the KVM/arm64-specific hypercalls are available.
+
+``ARM_SMCCC_VENDOR_HYP_KVM_FEATURES_FUNC_ID``
+---------------------------------------------
+
+Provides a discovery mechanism for other KVM/arm64 hypercalls.
+
++---------------------+-------------------------------------------------------------+
+| Presence: | Mandatory for the KVM/arm64 UID |
++---------------------+-------------------------------------------------------------+
+| Calling convention: | HVC32 |
++---------------------+----------+--------------------------------------------------+
+| Function ID: | (uint32) | 0x86000000 |
++---------------------+----------+--------------------------------------------------+
+| Arguments: | None |
++---------------------+----------+----+---------------------------------------------+
+| Return Values: | (uint32) | R0 | Bitmap of available function numbers 0-31 |
+| +----------+----+---------------------------------------------+
+| | (uint32) | R1 | Bitmap of available function numbers 32-63 |
+| +----------+----+---------------------------------------------+
+| | (uint32) | R2 | Bitmap of available function numbers 64-95 |
+| +----------+----+---------------------------------------------+
+| | (uint32) | R3 | Bitmap of available function numbers 96-127 |
++---------------------+----------+----+---------------------------------------------+
+
+``ARM_SMCCC_VENDOR_HYP_KVM_PTP_FUNC_ID``
+----------------------------------------
+
+See ptp_kvm.rst
+
+``ARM_SMCCC_KVM_FUNC_HYP_MEMINFO``
+----------------------------------
+
+Query the memory protection parameters for a pKVM protected virtual machine.
+
++---------------------+-------------------------------------------------------------+
+| Presence: | Optional; pKVM protected guests only. |
++---------------------+-------------------------------------------------------------+
+| Calling convention: | HVC64 |
++---------------------+----------+--------------------------------------------------+
+| Function ID: | (uint32) | 0xC6000002 |
++---------------------+----------+----+---------------------------------------------+
+| Arguments: | (uint64) | R1 | Reserved / Must be zero |
+| +----------+----+---------------------------------------------+
+| | (uint64) | R2 | Reserved / Must be zero |
+| +----------+----+---------------------------------------------+
+| | (uint64) | R3 | Reserved / Must be zero |
++---------------------+----------+----+---------------------------------------------+
+| Return Values: | (int64) | R0 | ``INVALID_PARAMETER (-3)`` on error, else |
+| | | | memory protection granule in bytes |
++---------------------+----------+----+---------------------------------------------+
+
+``ARM_SMCCC_KVM_FUNC_MEM_SHARE``
+--------------------------------
+
+Share a region of memory with the KVM host, granting it read, write and execute
+permissions. The size of the region is equal to the memory protection granule
+advertised by ``ARM_SMCCC_KVM_FUNC_HYP_MEMINFO``.
+
++---------------------+-------------------------------------------------------------+
+| Presence: | Optional; pKVM protected guests only. |
++---------------------+-------------------------------------------------------------+
+| Calling convention: | HVC64 |
++---------------------+----------+--------------------------------------------------+
+| Function ID: | (uint32) | 0xC6000003 |
++---------------------+----------+----+---------------------------------------------+
+| Arguments: | (uint64) | R1 | Base IPA of memory region to share |
+| +----------+----+---------------------------------------------+
+| | (uint64) | R2 | Reserved / Must be zero |
+| +----------+----+---------------------------------------------+
+| | (uint64) | R3 | Reserved / Must be zero |
++---------------------+----------+----+---------------------------------------------+
+| Return Values: | (int64) | R0 | ``SUCCESS (0)`` |
+| | | +---------------------------------------------+
+| | | | ``INVALID_PARAMETER (-3)`` |
++---------------------+----------+----+---------------------------------------------+
+
+``ARM_SMCCC_KVM_FUNC_MEM_UNSHARE``
+----------------------------------
+
+Revoke access permission from the KVM host to a memory region previously shared
+with ``ARM_SMCCC_KVM_FUNC_MEM_SHARE``. The size of the region is equal to the
+memory protection granule advertised by ``ARM_SMCCC_KVM_FUNC_HYP_MEMINFO``.
+
++---------------------+-------------------------------------------------------------+
+| Presence: | Optional; pKVM protected guests only. |
++---------------------+-------------------------------------------------------------+
+| Calling convention: | HVC64 |
++---------------------+----------+--------------------------------------------------+
+| Function ID: | (uint32) | 0xC6000004 |
++---------------------+----------+----+---------------------------------------------+
+| Arguments: | (uint64) | R1 | Base IPA of memory region to unshare |
+| +----------+----+---------------------------------------------+
+| | (uint64) | R2 | Reserved / Must be zero |
+| +----------+----+---------------------------------------------+
+| | (uint64) | R3 | Reserved / Must be zero |
++---------------------+----------+----+---------------------------------------------+
+| Return Values: | (int64) | R0 | ``SUCCESS (0)`` |
+| | | +---------------------------------------------+
+| | | | ``INVALID_PARAMETER (-3)`` |
++---------------------+----------+----+---------------------------------------------+
+
+``ARM_SMCCC_KVM_FUNC_MMIO_GUARD``
+----------------------------------
+
+Request that a given memory region is handled as MMIO by the hypervisor,
+allowing accesses to this region to be emulated by the KVM host. The size of the
+region is equal to the memory protection granule advertised by
+``ARM_SMCCC_KVM_FUNC_HYP_MEMINFO``.
+
++---------------------+-------------------------------------------------------------+
+| Presence: | Optional; pKVM protected guests only. |
++---------------------+-------------------------------------------------------------+
+| Calling convention: | HVC64 |
++---------------------+----------+--------------------------------------------------+
+| Function ID: | (uint32) | 0xC6000007 |
++---------------------+----------+----+---------------------------------------------+
+| Arguments: | (uint64) | R1 | Base IPA of MMIO memory region |
+| +----------+----+---------------------------------------------+
+| | (uint64) | R2 | Reserved / Must be zero |
+| +----------+----+---------------------------------------------+
+| | (uint64) | R3 | Reserved / Must be zero |
++---------------------+----------+----+---------------------------------------------+
+| Return Values: | (int64) | R0 | ``SUCCESS (0)`` |
+| | | +---------------------------------------------+
+| | | | ``INVALID_PARAMETER (-3)`` |
++---------------------+----------+----+---------------------------------------------+
diff --git a/Documentation/virt/kvm/arm/index.rst b/Documentation/virt/kvm/arm/index.rst
index 7f231c724e16..ec09881de4cf 100644
--- a/Documentation/virt/kvm/arm/index.rst
+++ b/Documentation/virt/kvm/arm/index.rst
@@ -7,6 +7,7 @@ ARM
.. toctree::
:maxdepth: 2
+ fw-pseudo-registers
hyp-abi
hypercalls
pvtime
diff --git a/Documentation/virt/kvm/arm/ptp_kvm.rst b/Documentation/virt/kvm/arm/ptp_kvm.rst
index aecdc80ddcd8..7c0960970a0e 100644
--- a/Documentation/virt/kvm/arm/ptp_kvm.rst
+++ b/Documentation/virt/kvm/arm/ptp_kvm.rst
@@ -7,19 +7,29 @@ PTP_KVM is used for high precision time sync between host and guests.
It relies on transferring the wall clock and counter value from the
host to the guest using a KVM-specific hypercall.
-* ARM_SMCCC_VENDOR_HYP_KVM_PTP_FUNC_ID: 0x86000001
+``ARM_SMCCC_VENDOR_HYP_KVM_PTP_FUNC_ID``
+----------------------------------------
-This hypercall uses the SMC32/HVC32 calling convention:
+Retrieve current time information for the specific counter. There are no
+endianness restrictions.
-ARM_SMCCC_VENDOR_HYP_KVM_PTP_FUNC_ID
- ============== ======== =====================================
- Function ID: (uint32) 0x86000001
- Arguments: (uint32) KVM_PTP_VIRT_COUNTER(0)
- KVM_PTP_PHYS_COUNTER(1)
- Return Values: (int32) NOT_SUPPORTED(-1) on error, or
- (uint32) Upper 32 bits of wall clock time (r0)
- (uint32) Lower 32 bits of wall clock time (r1)
- (uint32) Upper 32 bits of counter (r2)
- (uint32) Lower 32 bits of counter (r3)
- Endianness: No Restrictions.
- ============== ======== =====================================
++---------------------+-------------------------------------------------------+
+| Presence: | Optional |
++---------------------+-------------------------------------------------------+
+| Calling convention: | HVC32 |
++---------------------+----------+--------------------------------------------+
+| Function ID: | (uint32) | 0x86000001 |
++---------------------+----------+----+---------------------------------------+
+| Arguments: | (uint32) | R1 | ``KVM_PTP_VIRT_COUNTER (0)`` |
+| | | +---------------------------------------+
+| | | | ``KVM_PTP_PHYS_COUNTER (1)`` |
++---------------------+----------+----+---------------------------------------+
+| Return Values: | (int32) | R0 | ``NOT_SUPPORTED (-1)`` on error, else |
+| | | | upper 32 bits of wall clock time |
+| +----------+----+---------------------------------------+
+| | (uint32) | R1 | Lower 32 bits of wall clock time |
+| +----------+----+---------------------------------------+
+| | (uint32) | R2 | Upper 32 bits of counter |
+| +----------+----+---------------------------------------+
+| | (uint32) | R3 | Lower 32 bits of counter |
++---------------------+----------+----+---------------------------------------+
diff --git a/Documentation/virt/kvm/devices/arm-vgic.rst b/Documentation/virt/kvm/devices/arm-vgic.rst
index 40bdeea1d86e..19f0c6756891 100644
--- a/Documentation/virt/kvm/devices/arm-vgic.rst
+++ b/Documentation/virt/kvm/devices/arm-vgic.rst
@@ -31,7 +31,7 @@ Groups:
KVM_VGIC_V2_ADDR_TYPE_CPU (rw, 64-bit)
Base address in the guest physical address space of the GIC virtual cpu
interface register mappings. Only valid for KVM_DEV_TYPE_ARM_VGIC_V2.
- This address needs to be 4K aligned and the region covers 4 KByte.
+ This address needs to be 4K aligned and the region covers 8 KByte.
Errors:
diff --git a/Documentation/virt/kvm/devices/s390_flic.rst b/Documentation/virt/kvm/devices/s390_flic.rst
index ea96559ba501..b784f8016748 100644
--- a/Documentation/virt/kvm/devices/s390_flic.rst
+++ b/Documentation/virt/kvm/devices/s390_flic.rst
@@ -58,11 +58,15 @@ Groups:
Enables async page faults for the guest. So in case of a major page fault
the host is allowed to handle this async and continues the guest.
+ -EINVAL is returned when called on the FLIC of a ucontrol VM.
+
KVM_DEV_FLIC_APF_DISABLE_WAIT
Disables async page faults for the guest and waits until already pending
async page faults are done. This is necessary to trigger a completion interrupt
for every init interrupt before migrating the interrupt list.
+ -EINVAL is returned when called on the FLIC of a ucontrol VM.
+
KVM_DEV_FLIC_ADAPTER_REGISTER
Register an I/O adapter interrupt source. Takes a kvm_s390_io_adapter
describing the adapter to register::
diff --git a/Documentation/virt/kvm/devices/vcpu.rst b/Documentation/virt/kvm/devices/vcpu.rst
index 31f14ec4a65b..31a9576c07af 100644
--- a/Documentation/virt/kvm/devices/vcpu.rst
+++ b/Documentation/virt/kvm/devices/vcpu.rst
@@ -142,8 +142,8 @@ the cpu field to the processor id.
:Architectures: ARM64
-2.1. ATTRIBUTES: KVM_ARM_VCPU_TIMER_IRQ_VTIMER, KVM_ARM_VCPU_TIMER_IRQ_PTIMER
------------------------------------------------------------------------------
+2.1. ATTRIBUTES: KVM_ARM_VCPU_TIMER_IRQ_{VTIMER,PTIMER,HVTIMER,HPTIMER}
+-----------------------------------------------------------------------
:Parameters: in kvm_device_attr.addr the address for the timer interrupt is a
pointer to an int
@@ -159,10 +159,12 @@ A value describing the architected timer interrupt number when connected to an
in-kernel virtual GIC. These must be a PPI (16 <= intid < 32). Setting the
attribute overrides the default values (see below).
-============================= ==========================================
-KVM_ARM_VCPU_TIMER_IRQ_VTIMER The EL1 virtual timer intid (default: 27)
-KVM_ARM_VCPU_TIMER_IRQ_PTIMER The EL1 physical timer intid (default: 30)
-============================= ==========================================
+============================== ==========================================
+KVM_ARM_VCPU_TIMER_IRQ_VTIMER The EL1 virtual timer intid (default: 27)
+KVM_ARM_VCPU_TIMER_IRQ_PTIMER The EL1 physical timer intid (default: 30)
+KVM_ARM_VCPU_TIMER_IRQ_HVTIMER The EL2 virtual timer intid (default: 28)
+KVM_ARM_VCPU_TIMER_IRQ_HPTIMER The EL2 physical timer intid (default: 26)
+============================== ==========================================
Setting the same PPI for different timers will prevent the VCPUs from running.
Setting the interrupt number on a VCPU configures all VCPUs created at that
diff --git a/Documentation/virt/kvm/halt-polling.rst b/Documentation/virt/kvm/halt-polling.rst
index c82a04b709b4..a6790a67e205 100644
--- a/Documentation/virt/kvm/halt-polling.rst
+++ b/Documentation/virt/kvm/halt-polling.rst
@@ -79,11 +79,11 @@ adjustment of the polling interval.
Module Parameters
=================
-The kvm module has 3 tuneable module parameters to adjust the global max
-polling interval as well as the rate at which the polling interval is grown and
-shrunk. These variables are defined in include/linux/kvm_host.h and as module
-parameters in virt/kvm/kvm_main.c, or arch/powerpc/kvm/book3s_hv.c in the
-powerpc kvm-hv case.
+The kvm module has 4 tunable module parameters to adjust the global max polling
+interval, the initial value (to grow from 0), and the rate at which the polling
+interval is grown and shrunk. These variables are defined in
+include/linux/kvm_host.h and as module parameters in virt/kvm/kvm_main.c, or
+arch/powerpc/kvm/book3s_hv.c in the powerpc kvm-hv case.
+-----------------------+---------------------------+-------------------------+
|Module Parameter | Description | Default Value |
@@ -105,7 +105,7 @@ powerpc kvm-hv case.
| | grow_halt_poll_ns() | |
| | function. | |
+-----------------------+---------------------------+-------------------------+
-|halt_poll_ns_shrink | The value by which the | 0 |
+|halt_poll_ns_shrink | The value by which the | 2 |
| | halt polling interval is | |
| | divided in the | |
| | shrink_halt_poll_ns() | |
diff --git a/Documentation/virt/kvm/index.rst b/Documentation/virt/kvm/index.rst
index ad13ec55ddfe..9ca5a45c2140 100644
--- a/Documentation/virt/kvm/index.rst
+++ b/Documentation/virt/kvm/index.rst
@@ -14,6 +14,7 @@ KVM
s390/index
ppc-pv
x86/index
+ loongarch/index
locking
vcpu-requests
diff --git a/Documentation/virt/kvm/locking.rst b/Documentation/virt/kvm/locking.rst
index 02880d5552d5..c56d5f26c750 100644
--- a/Documentation/virt/kvm/locking.rst
+++ b/Documentation/virt/kvm/locking.rst
@@ -11,6 +11,8 @@ The acquisition orders for mutexes are as follows:
- cpus_read_lock() is taken outside kvm_lock
+- kvm_usage_lock is taken outside cpus_read_lock()
+
- kvm->lock is taken outside vcpu->mutex
- kvm->lock is taken outside kvm->slots_lock and kvm->irq_lock
@@ -24,6 +26,13 @@ The acquisition orders for mutexes are as follows:
are taken on the waiting side when modifying memslots, so MMU notifiers
must not take either kvm->slots_lock or kvm->slots_arch_lock.
+cpus_read_lock() vs kvm_lock:
+
+- Taking cpus_read_lock() outside of kvm_lock is problematic, despite that
+ being the official ordering, as it is quite easy to unknowingly trigger
+ cpus_read_lock() while holding kvm_lock. Use caution when walking vm_list,
+ e.g. avoid complex operations when possible.
+
For SRCU:
- ``synchronize_srcu(&kvm->srcu)`` is called inside critical sections
@@ -126,8 +135,8 @@ We dirty-log for gfn1, that means gfn2 is lost in dirty-bitmap.
For direct sp, we can easily avoid it since the spte of direct sp is fixed
to gfn. For indirect sp, we disabled fast page fault for simplicity.
-A solution for indirect sp could be to pin the gfn, for example via
-kvm_vcpu_gfn_to_pfn_atomic, before the cmpxchg. After the pinning:
+A solution for indirect sp could be to pin the gfn before the cmpxchg. After
+the pinning:
- We have held the refcount of pfn; that means the pfn can not be freed and
be reused for another gfn.
@@ -138,49 +147,51 @@ Then, we can ensure the dirty bitmaps is correctly set for a gfn.
2) Dirty bit tracking
-In the origin code, the spte can be fast updated (non-atomically) if the
+In the original code, the spte can be fast updated (non-atomically) if the
spte is read-only and the Accessed bit has already been set since the
Accessed bit and Dirty bit can not be lost.
But it is not true after fast page fault since the spte can be marked
writable between reading spte and updating spte. Like below case:
-+------------------------------------------------------------------------+
-| At the beginning:: |
-| |
-| spte.W = 0 |
-| spte.Accessed = 1 |
-+------------------------------------+-----------------------------------+
-| CPU 0: | CPU 1: |
-+------------------------------------+-----------------------------------+
-| In mmu_spte_clear_track_bits():: | |
-| | |
-| old_spte = *spte; | |
-| | |
-| | |
-| /* 'if' condition is satisfied. */| |
-| if (old_spte.Accessed == 1 && | |
-| old_spte.W == 0) | |
-| spte = 0ull; | |
-+------------------------------------+-----------------------------------+
-| | on fast page fault path:: |
-| | |
-| | spte.W = 1 |
-| | |
-| | memory write on the spte:: |
-| | |
-| | spte.Dirty = 1 |
-+------------------------------------+-----------------------------------+
-| :: | |
-| | |
-| else | |
-| old_spte = xchg(spte, 0ull) | |
-| if (old_spte.Accessed == 1) | |
-| kvm_set_pfn_accessed(spte.pfn);| |
-| if (old_spte.Dirty == 1) | |
-| kvm_set_pfn_dirty(spte.pfn); | |
-| OOPS!!! | |
-+------------------------------------+-----------------------------------+
++-------------------------------------------------------------------------+
+| At the beginning:: |
+| |
+| spte.W = 0 |
+| spte.Accessed = 1 |
++-------------------------------------+-----------------------------------+
+| CPU 0: | CPU 1: |
++-------------------------------------+-----------------------------------+
+| In mmu_spte_update():: | |
+| | |
+| old_spte = *spte; | |
+| | |
+| | |
+| /* 'if' condition is satisfied. */ | |
+| if (old_spte.Accessed == 1 && | |
+| old_spte.W == 0) | |
+| spte = new_spte; | |
++-------------------------------------+-----------------------------------+
+| | on fast page fault path:: |
+| | |
+| | spte.W = 1 |
+| | |
+| | memory write on the spte:: |
+| | |
+| | spte.Dirty = 1 |
++-------------------------------------+-----------------------------------+
+| :: | |
+| | |
+| else | |
+| old_spte = xchg(spte, new_spte);| |
+| if (old_spte.Accessed && | |
+| !new_spte.Accessed) | |
+| flush = true; | |
+| if (old_spte.Dirty && | |
+| !new_spte.Dirty) | |
+| flush = true; | |
+| OOPS!!! | |
++-------------------------------------+-----------------------------------+
The Dirty bit is lost in this case.
@@ -227,10 +238,16 @@ time it will be set using the Dirty tracking mechanism described above.
:Type: mutex
:Arch: any
:Protects: - vm_list
- - kvm_usage_count
+
+``kvm_usage_lock``
+^^^^^^^^^^^^^^^^^^
+
+:Type: mutex
+:Arch: any
+:Protects: - kvm_usage_count
- hardware virtualization enable/disable
-:Comment: KVM also disables CPU hotplug via cpus_read_lock() during
- enable/disable.
+:Comment: Exists to allow taking cpus_read_lock() while kvm_usage_count is
+ protected, which simplifies the virtualization enabling logic.
``kvm->mn_invalidate_lock``
^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -290,11 +307,12 @@ time it will be set using the Dirty tracking mechanism described above.
wakeup.
``vendor_module_lock``
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^
:Type: mutex
:Arch: x86
:Protects: loading a vendor module (kvm_amd or kvm_intel)
-:Comment: Exists because using kvm_lock leads to deadlock. cpu_hotplug_lock is
- taken outside of kvm_lock, e.g. in KVM's CPU online/offline callbacks, and
- many operations need to take cpu_hotplug_lock when loading a vendor module,
- e.g. updating static calls.
+:Comment: Exists because using kvm_lock leads to deadlock. kvm_lock is taken
+ in notifiers, e.g. __kvmclock_cpufreq_notifier(), that may be invoked while
+ cpu_hotplug_lock is held, e.g. from cpufreq_boost_trigger_state(), and many
+ operations need to take cpu_hotplug_lock when loading a vendor module, e.g.
+ updating static calls.
diff --git a/Documentation/virt/kvm/loongarch/hypercalls.rst b/Documentation/virt/kvm/loongarch/hypercalls.rst
new file mode 100644
index 000000000000..2d6b94031f1b
--- /dev/null
+++ b/Documentation/virt/kvm/loongarch/hypercalls.rst
@@ -0,0 +1,89 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===================================
+The LoongArch paravirtual interface
+===================================
+
+KVM hypercalls use the HVCL instruction with code 0x100 and the hypercall
+number is put in a0. Up to five arguments may be placed in registers a1 - a5.
+The return value is placed in v0 (an alias of a0).
+
+Source code for this interface can be found in arch/loongarch/kvm*.
+
+Querying for existence
+======================
+
+To determine if the host is running on KVM, we can utilize the cpucfg()
+function at index CPUCFG_KVM_BASE (0x40000000).
+
+The CPUCFG_KVM_BASE range, spanning from 0x40000000 to 0x400000FF, The
+CPUCFG_KVM_BASE range between 0x40000000 - 0x400000FF is marked as reserved.
+Consequently, all current and future processors will not implement any
+feature within this range.
+
+On a KVM-virtualized Linux system, a read operation on cpucfg() at index
+CPUCFG_KVM_BASE (0x40000000) returns the magic string 'KVM\0'.
+
+Once you have determined that your host is running on a paravirtualization-
+capable KVM, you may now use hypercalls as described below.
+
+KVM hypercall ABI
+=================
+
+The KVM hypercall ABI is simple, with one scratch register a0 (v0) and at most
+five generic registers (a1 - a5) used as input parameters. The FP (Floating-
+point) and vector registers are not utilized as input registers and must
+remain unmodified during a hypercall.
+
+Hypercall functions can be inlined as it only uses one scratch register.
+
+The parameters are as follows:
+
+ ======== ================= ================
+ Register IN OUT
+ ======== ================= ================
+ a0 function number Return code
+ a1 1st parameter -
+ a2 2nd parameter -
+ a3 3rd parameter -
+ a4 4th parameter -
+ a5 5th parameter -
+ ======== ================= ================
+
+The return codes may be one of the following:
+
+ ==== =========================
+ Code Meaning
+ ==== =========================
+ 0 Success
+ -1 Hypercall not implemented
+ -2 Bad Hypercall parameter
+ ==== =========================
+
+KVM Hypercalls Documentation
+============================
+
+The template for each hypercall is as follows:
+
+1. Hypercall name
+2. Purpose
+
+1. KVM_HCALL_FUNC_IPI
+------------------------
+
+:Purpose: Send IPIs to multiple vCPUs.
+
+- a0: KVM_HCALL_FUNC_IPI
+- a1: Lower part of the bitmap for destination physical CPUIDs
+- a2: Higher part of the bitmap for destination physical CPUIDs
+- a3: The lowest physical CPUID in the bitmap
+
+The hypercall lets a guest send multiple IPIs (Inter-Process Interrupts) with
+at most 128 destinations per hypercall. The destinations are represented in a
+bitmap contained in the first two input registers (a1 and a2).
+
+Bit 0 of a1 corresponds to the physical CPUID in the third input register (a3)
+and bit 1 corresponds to the physical CPUID in a3+1, and so on.
+
+PV IPI on LoongArch includes both PV IPI multicast sending and PV IPI receiving,
+and SWI is used for PV IPI inject since there is no VM-exits accessing SWI registers.
diff --git a/Documentation/virt/kvm/loongarch/index.rst b/Documentation/virt/kvm/loongarch/index.rst
new file mode 100644
index 000000000000..83387b4c5345
--- /dev/null
+++ b/Documentation/virt/kvm/loongarch/index.rst
@@ -0,0 +1,10 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=========================
+KVM for LoongArch systems
+=========================
+
+.. toctree::
+ :maxdepth: 2
+
+ hypercalls.rst
diff --git a/Documentation/virt/kvm/s390/s390-diag.rst b/Documentation/virt/kvm/s390/s390-diag.rst
index ca85f030eb0b..3e4f9e3bef81 100644
--- a/Documentation/virt/kvm/s390/s390-diag.rst
+++ b/Documentation/virt/kvm/s390/s390-diag.rst
@@ -35,20 +35,24 @@ DIAGNOSE function codes not specific to KVM, please refer to the
documentation for the s390 hypervisors defining them.
-DIAGNOSE function code 'X'500' - KVM virtio functions
------------------------------------------------------
+DIAGNOSE function code 'X'500' - KVM functions
+----------------------------------------------
-If the function code specifies 0x500, various virtio-related functions
-are performed.
+If the function code specifies 0x500, various KVM-specific functions
+are performed, including virtio functions.
-General register 1 contains the virtio subfunction code. Supported
-virtio subfunctions depend on KVM's userspace. Generally, userspace
-provides either s390-virtio (subcodes 0-2) or virtio-ccw (subcode 3).
+General register 1 contains the subfunction code. Supported subfunctions
+depend on KVM's userspace. Regarding virtio subfunctions, generally
+userspace provides either s390-virtio (subcodes 0-2) or virtio-ccw
+(subcode 3).
Upon completion of the DIAGNOSE instruction, general register 2 contains
the function's return code, which is either a return code or a subcode
specific value.
+If the specified subfunction is not supported, a SPECIFICATION exception
+will be triggered.
+
Subcode 0 - s390-virtio notification and early console printk
Handled by userspace.
@@ -76,6 +80,23 @@ Subcode 3 - virtio-ccw notification
See also the virtio standard for a discussion of this hypercall.
+Subcode 4 - storage-limit
+ Handled by userspace.
+
+ After completion of the DIAGNOSE call, general register 2 will
+ contain the storage limit: the maximum physical address that might be
+ used for storage throughout the lifetime of the VM.
+
+ The storage limit does not indicate currently usable storage, it may
+ include holes, standby storage and areas reserved for other means, such
+ as memory hotplug or virtio-mem devices. Other interfaces for detecting
+ actually usable storage, such as SCLP, must be used in conjunction with
+ this subfunction.
+
+ Note that the storage limit can be larger, but never smaller than the
+ maximum storage address indicated by SCLP via the "maximum storage
+ increment" and the "increment size".
+
DIAGNOSE function code 'X'501 - KVM breakpoint
----------------------------------------------
diff --git a/Documentation/virt/kvm/x86/amd-memory-encryption.rst b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
index 84335d119ff1..1ddb6a86ce7f 100644
--- a/Documentation/virt/kvm/x86/amd-memory-encryption.rst
+++ b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
@@ -76,15 +76,56 @@ are defined in ``<linux/psp-dev.h>``.
KVM implements the following commands to support common lifecycle events of SEV
guests, such as launching, running, snapshotting, migrating and decommissioning.
-1. KVM_SEV_INIT
----------------
+1. KVM_SEV_INIT2
+----------------
-The KVM_SEV_INIT command is used by the hypervisor to initialize the SEV platform
+The KVM_SEV_INIT2 command is used by the hypervisor to initialize the SEV platform
context. In a typical workflow, this command should be the first command issued.
+For this command to be accepted, either KVM_X86_SEV_VM or KVM_X86_SEV_ES_VM
+must have been passed to the KVM_CREATE_VM ioctl. A virtual machine created
+with those machine types in turn cannot be run until KVM_SEV_INIT2 is invoked.
+
+Parameters: struct kvm_sev_init (in)
Returns: 0 on success, -negative on error
+::
+
+ struct kvm_sev_init {
+ __u64 vmsa_features; /* initial value of features field in VMSA */
+ __u32 flags; /* must be 0 */
+ __u16 ghcb_version; /* maximum guest GHCB version allowed */
+ __u16 pad1;
+ __u32 pad2[8];
+ };
+
+It is an error if the hypervisor does not support any of the bits that
+are set in ``flags`` or ``vmsa_features``. ``vmsa_features`` must be
+0 for SEV virtual machines, as they do not have a VMSA.
+
+``ghcb_version`` must be 0 for SEV virtual machines, as they do not issue GHCB
+requests. If ``ghcb_version`` is 0 for any other guest type, then the maximum
+allowed guest GHCB protocol will default to version 2.
+
+This command replaces the deprecated KVM_SEV_INIT and KVM_SEV_ES_INIT commands.
+The commands did not have any parameters (the ```data``` field was unused) and
+only work for the KVM_X86_DEFAULT_VM machine type (0).
+
+They behave as if:
+
+* the VM type is KVM_X86_SEV_VM for KVM_SEV_INIT, or KVM_X86_SEV_ES_VM for
+ KVM_SEV_ES_INIT
+
+* the ``flags`` and ``vmsa_features`` fields of ``struct kvm_sev_init`` are
+ set to zero, and ``ghcb_version`` is set to 0 for KVM_SEV_INIT and 1 for
+ KVM_SEV_ES_INIT.
+
+If the ``KVM_X86_SEV_VMSA_FEATURES`` attribute does not exist, the hypervisor only
+supports KVM_SEV_INIT and KVM_SEV_ES_INIT. In that case, note that KVM_SEV_ES_INIT
+might set the debug swap VMSA feature (bit 5) depending on the value of the
+``debug_swap`` parameter of ``kvm-amd.ko``.
+
2. KVM_SEV_LAUNCH_START
-----------------------
@@ -425,6 +466,124 @@ issued by the hypervisor to make the guest ready for execution.
Returns: 0 on success, -negative on error
+18. KVM_SEV_SNP_LAUNCH_START
+----------------------------
+
+The KVM_SNP_LAUNCH_START command is used for creating the memory encryption
+context for the SEV-SNP guest. It must be called prior to issuing
+KVM_SEV_SNP_LAUNCH_UPDATE or KVM_SEV_SNP_LAUNCH_FINISH;
+
+Parameters (in): struct kvm_sev_snp_launch_start
+
+Returns: 0 on success, -negative on error
+
+::
+
+ struct kvm_sev_snp_launch_start {
+ __u64 policy; /* Guest policy to use. */
+ __u8 gosvw[16]; /* Guest OS visible workarounds. */
+ __u16 flags; /* Must be zero. */
+ __u8 pad0[6];
+ __u64 pad1[4];
+ };
+
+See SNP_LAUNCH_START in the SEV-SNP specification [snp-fw-abi]_ for further
+details on the input parameters in ``struct kvm_sev_snp_launch_start``.
+
+19. KVM_SEV_SNP_LAUNCH_UPDATE
+-----------------------------
+
+The KVM_SEV_SNP_LAUNCH_UPDATE command is used for loading userspace-provided
+data into a guest GPA range, measuring the contents into the SNP guest context
+created by KVM_SEV_SNP_LAUNCH_START, and then encrypting/validating that GPA
+range so that it will be immediately readable using the encryption key
+associated with the guest context once it is booted, after which point it can
+attest the measurement associated with its context before unlocking any
+secrets.
+
+It is required that the GPA ranges initialized by this command have had the
+KVM_MEMORY_ATTRIBUTE_PRIVATE attribute set in advance. See the documentation
+for KVM_SET_MEMORY_ATTRIBUTES for more details on this aspect.
+
+Upon success, this command is not guaranteed to have processed the entire
+range requested. Instead, the ``gfn_start``, ``uaddr``, and ``len`` fields of
+``struct kvm_sev_snp_launch_update`` will be updated to correspond to the
+remaining range that has yet to be processed. The caller should continue
+calling this command until those fields indicate the entire range has been
+processed, e.g. ``len`` is 0, ``gfn_start`` is equal to the last GFN in the
+range plus 1, and ``uaddr`` is the last byte of the userspace-provided source
+buffer address plus 1. In the case where ``type`` is KVM_SEV_SNP_PAGE_TYPE_ZERO,
+``uaddr`` will be ignored completely.
+
+Parameters (in): struct kvm_sev_snp_launch_update
+
+Returns: 0 on success, < 0 on error, -EAGAIN if caller should retry
+
+::
+
+ struct kvm_sev_snp_launch_update {
+ __u64 gfn_start; /* Guest page number to load/encrypt data into. */
+ __u64 uaddr; /* Userspace address of data to be loaded/encrypted. */
+ __u64 len; /* 4k-aligned length in bytes to copy into guest memory.*/
+ __u8 type; /* The type of the guest pages being initialized. */
+ __u8 pad0;
+ __u16 flags; /* Must be zero. */
+ __u32 pad1;
+ __u64 pad2[4];
+
+ };
+
+where the allowed values for page_type are #define'd as::
+
+ KVM_SEV_SNP_PAGE_TYPE_NORMAL
+ KVM_SEV_SNP_PAGE_TYPE_ZERO
+ KVM_SEV_SNP_PAGE_TYPE_UNMEASURED
+ KVM_SEV_SNP_PAGE_TYPE_SECRETS
+ KVM_SEV_SNP_PAGE_TYPE_CPUID
+
+See the SEV-SNP spec [snp-fw-abi]_ for further details on how each page type is
+used/measured.
+
+20. KVM_SEV_SNP_LAUNCH_FINISH
+-----------------------------
+
+After completion of the SNP guest launch flow, the KVM_SEV_SNP_LAUNCH_FINISH
+command can be issued to make the guest ready for execution.
+
+Parameters (in): struct kvm_sev_snp_launch_finish
+
+Returns: 0 on success, -negative on error
+
+::
+
+ struct kvm_sev_snp_launch_finish {
+ __u64 id_block_uaddr;
+ __u64 id_auth_uaddr;
+ __u8 id_block_en;
+ __u8 auth_key_en;
+ __u8 vcek_disabled;
+ __u8 host_data[32];
+ __u8 pad0[3];
+ __u16 flags; /* Must be zero */
+ __u64 pad1[4];
+ };
+
+
+See SNP_LAUNCH_FINISH in the SEV-SNP specification [snp-fw-abi]_ for further
+details on the input parameters in ``struct kvm_sev_snp_launch_finish``.
+
+Device attribute API
+====================
+
+Attributes of the SEV implementation can be retrieved through the
+``KVM_HAS_DEVICE_ATTR`` and ``KVM_GET_DEVICE_ATTR`` ioctls on the ``/dev/kvm``
+device node, using group ``KVM_X86_GRP_SEV``.
+
+Currently only one attribute is implemented:
+
+* ``KVM_X86_SEV_VMSA_FEATURES``: return the set of all bits that
+ are accepted in the ``vmsa_features`` of ``KVM_SEV_INIT2``.
+
Firmware Management
===================
@@ -444,9 +603,11 @@ References
==========
-See [white-paper]_, [api-spec]_, [amd-apm]_ and [kvm-forum]_ for more info.
+See [white-paper]_, [api-spec]_, [amd-apm]_, [kvm-forum]_, and [snp-fw-abi]_
+for more info.
.. [white-paper] https://developer.amd.com/wordpress/media/2013/12/AMD_Memory_Encryption_Whitepaper_v7-Public.pdf
.. [api-spec] https://support.amd.com/TechDocs/55766_SEV-KM_API_Specification.pdf
.. [amd-apm] https://support.amd.com/TechDocs/24593.pdf (section 15.34)
.. [kvm-forum] https://www.linux-kvm.org/images/7/74/02x08A-Thomas_Lendacky-AMDs_Virtualizatoin_Memory_Encryption_Technology.pdf
+.. [snp-fw-abi] https://www.amd.com/system/files/TechDocs/56860.pdf
diff --git a/Documentation/virt/kvm/x86/errata.rst b/Documentation/virt/kvm/x86/errata.rst
index 49a05f24747b..37c79362a48f 100644
--- a/Documentation/virt/kvm/x86/errata.rst
+++ b/Documentation/virt/kvm/x86/errata.rst
@@ -33,6 +33,18 @@ Note however that any software (e.g ``WIN87EM.DLL``) expecting these features
to be present likely predates these CPUID feature bits, and therefore
doesn't know to check for them anyway.
+``KVM_SET_VCPU_EVENTS`` issue
+-----------------------------
+
+Invalid KVM_SET_VCPU_EVENTS input with respect to error codes *may* result in
+failed VM-Entry on Intel CPUs. Pre-CET Intel CPUs require that exception
+injection through the VMCS correctly set the "error code valid" flag, e.g.
+require the flag be set when injecting a #GP, clear when injecting a #UD,
+clear when injecting a soft exception, etc. Intel CPUs that enumerate
+IA32_VMX_BASIC[56] as '1' relax VMX's consistency checks, and AMD CPUs have no
+restrictions whatsoever. KVM_SET_VCPU_EVENTS doesn't sanity check the vector
+versus "has_error_code", i.e. KVM's ABI follows AMD behavior.
+
Nested virtualization features
------------------------------
@@ -48,3 +60,21 @@ have the same physical APIC ID, KVM will deliver events targeting that APIC ID
only to the vCPU with the lowest vCPU ID. If KVM_X2APIC_API_USE_32BIT_IDS is
not enabled, KVM follows x86 architecture when processing interrupts (all vCPUs
matching the target APIC ID receive the interrupt).
+
+MTRRs
+-----
+KVM does not virtualize guest MTRR memory types. KVM emulates accesses to MTRR
+MSRs, i.e. {RD,WR}MSR in the guest will behave as expected, but KVM does not
+honor guest MTRRs when determining the effective memory type, and instead
+treats all of guest memory as having Writeback (WB) MTRRs.
+
+CR0.CD
+------
+KVM does not virtualize CR0.CD on Intel CPUs. Similar to MTRR MSRs, KVM
+emulates CR0.CD accesses so that loads and stores from/to CR0 behave as
+expected, but setting CR0.CD=1 has no impact on the cachaeability of guest
+memory.
+
+Note, this erratum does not affect AMD CPUs, which fully virtualize CR0.CD in
+hardware, i.e. put the CPU caches into "no fill" mode when CR0.CD=1, even when
+running in the guest. \ No newline at end of file
diff --git a/Documentation/virt/uml/user_mode_linux_howto_v2.rst b/Documentation/virt/uml/user_mode_linux_howto_v2.rst
index d1cfe415e4c4..584000b743f3 100644
--- a/Documentation/virt/uml/user_mode_linux_howto_v2.rst
+++ b/Documentation/virt/uml/user_mode_linux_howto_v2.rst
@@ -217,14 +217,14 @@ remote UML and other VM instances.
+-----------+--------+------------------------------------+------------+
| fd | vector | dependent on fd type | varies |
+-----------+--------+------------------------------------+------------+
+| vde | vector | dep. on VDE VPN: Virt.Net Locator | varies |
++-----------+--------+------------------------------------+------------+
| tuntap | legacy | none | ~ 500Mbit |
+-----------+--------+------------------------------------+------------+
| daemon | legacy | none | ~ 450Mbit |
+-----------+--------+------------------------------------+------------+
| socket | legacy | none | ~ 450Mbit |
+-----------+--------+------------------------------------+------------+
-| pcap | legacy | rx only | ~ 450Mbit |
-+-----------+--------+------------------------------------+------------+
| ethertap | legacy | obsolete | ~ 500Mbit |
+-----------+--------+------------------------------------+------------+
| vde | legacy | obsolete | ~ 500Mbit |
@@ -575,6 +575,41 @@ https://github.com/NetSys/bess/wiki/Built-In-Modules-and-Ports
BESS transport does not require any special privileges.
+VDE vector transport
+--------------------
+
+Virtual Distributed Ethernet (VDE) is a project whose main goal is to provide a
+highly flexible support for virtual networking.
+
+http://wiki.virtualsquare.org/#/tutorials/vdebasics
+
+Common usages of VDE include fast prototyping and teaching.
+
+Examples:
+
+ ``vecX:transport=vde,vnl=tap://tap0``
+
+use tap0
+
+ ``vecX:transport=vde,vnl=slirp://``
+
+use slirp
+
+ ``vec0:transport=vde,vnl=vde:///tmp/switch``
+
+connect to a vde switch
+
+ ``vecX:transport=\"vde,vnl=cmd://ssh remote.host //tmp/sshlirp\"``
+
+connect to a remote slirp (instant VPN: convert ssh to VPN, it uses sshlirp)
+https://github.com/virtualsquare/sshlirp
+
+ ``vec0:transport=vde,vnl=vxvde://234.0.0.1``
+
+connect to a local area cloud (all the UML nodes using the same
+multicast address running on hosts in the same multicast domain (LAN)
+will be automagically connected together to a virtual LAN.
+
Configuring Legacy transports
=============================