diff options
Diffstat (limited to 'Documentation/virt/hyperv')
-rw-r--r-- | Documentation/virt/hyperv/clocks.rst | 21 | ||||
-rw-r--r-- | Documentation/virt/hyperv/coco.rst | 260 | ||||
-rw-r--r-- | Documentation/virt/hyperv/hibernation.rst | 336 | ||||
-rw-r--r-- | Documentation/virt/hyperv/index.rst | 2 | ||||
-rw-r--r-- | Documentation/virt/hyperv/overview.rst | 22 | ||||
-rw-r--r-- | Documentation/virt/hyperv/vmbus.rst | 143 |
6 files changed, 707 insertions, 77 deletions
diff --git a/Documentation/virt/hyperv/clocks.rst b/Documentation/virt/hyperv/clocks.rst index a56f4837d443..176043265803 100644 --- a/Documentation/virt/hyperv/clocks.rst +++ b/Documentation/virt/hyperv/clocks.rst @@ -62,12 +62,21 @@ shared page with scale and offset values into user space. User space code performs the same algorithm of reading the TSC and applying the scale and offset to get the constant 10 MHz clock. -Linux clockevents are based on Hyper-V synthetic timer 0. While -Hyper-V offers 4 synthetic timers for each CPU, Linux only uses -timer 0. Interrupts from stimer0 are recorded on the "HVS" line in -/proc/interrupts. Clockevents based on the virtualized PIT and -local APIC timer also work, but the Hyper-V synthetic timer is -preferred. +Linux clockevents are based on Hyper-V synthetic timer 0 (stimer0). +While Hyper-V offers 4 synthetic timers for each CPU, Linux only uses +timer 0. In older versions of Hyper-V, an interrupt from stimer0 +results in a VMBus control message that is demultiplexed by +vmbus_isr() as described in the Documentation/virt/hyperv/vmbus.rst +documentation. In newer versions of Hyper-V, stimer0 interrupts can +be mapped to an architectural interrupt, which is referred to as +"Direct Mode". Linux prefers to use Direct Mode when available. Since +x86/x64 doesn't support per-CPU interrupts, Direct Mode statically +allocates an x86 interrupt vector (HYPERV_STIMER0_VECTOR) across all CPUs +and explicitly codes it to call the stimer0 interrupt handler. Hence +interrupts from stimer0 are recorded on the "HVS" line in /proc/interrupts +rather than being associated with a Linux IRQ. Clockevents based on the +virtualized PIT and local APIC timer also work, but Hyper-V stimer0 +is preferred. The driver for the Hyper-V synthetic system clock and timers is drivers/clocksource/hyperv_timer.c. diff --git a/Documentation/virt/hyperv/coco.rst b/Documentation/virt/hyperv/coco.rst new file mode 100644 index 000000000000..c15d6fe34b4e --- /dev/null +++ b/Documentation/virt/hyperv/coco.rst @@ -0,0 +1,260 @@ +.. SPDX-License-Identifier: GPL-2.0 + +Confidential Computing VMs +========================== +Hyper-V can create and run Linux guests that are Confidential Computing +(CoCo) VMs. Such VMs cooperate with the physical processor to better protect +the confidentiality and integrity of data in the VM's memory, even in the +face of a hypervisor/VMM that has been compromised and may behave maliciously. +CoCo VMs on Hyper-V share the generic CoCo VM threat model and security +objectives described in Documentation/security/snp-tdx-threat-model.rst. Note +that Hyper-V specific code in Linux refers to CoCo VMs as "isolated VMs" or +"isolation VMs". + +A Linux CoCo VM on Hyper-V requires the cooperation and interaction of the +following: + +* Physical hardware with a processor that supports CoCo VMs + +* The hardware runs a version of Windows/Hyper-V with support for CoCo VMs + +* The VM runs a version of Linux that supports being a CoCo VM + +The physical hardware requirements are as follows: + +* AMD processor with SEV-SNP. Hyper-V does not run guest VMs with AMD SME, + SEV, or SEV-ES encryption, and such encryption is not sufficient for a CoCo + VM on Hyper-V. + +* Intel processor with TDX + +To create a CoCo VM, the "Isolated VM" attribute must be specified to Hyper-V +when the VM is created. A VM cannot be changed from a CoCo VM to a normal VM, +or vice versa, after it is created. + +Operational Modes +----------------- +Hyper-V CoCo VMs can run in two modes. The mode is selected when the VM is +created and cannot be changed during the life of the VM. + +* Fully-enlightened mode. In this mode, the guest operating system is + enlightened to understand and manage all aspects of running as a CoCo VM. + +* Paravisor mode. In this mode, a paravisor layer between the guest and the + host provides some operations needed to run as a CoCo VM. The guest operating + system can have fewer CoCo enlightenments than is required in the + fully-enlightened case. + +Conceptually, fully-enlightened mode and paravisor mode may be treated as +points on a spectrum spanning the degree of guest enlightenment needed to run +as a CoCo VM. Fully-enlightened mode is one end of the spectrum. A full +implementation of paravisor mode is the other end of the spectrum, where all +aspects of running as a CoCo VM are handled by the paravisor, and a normal +guest OS with no knowledge of memory encryption or other aspects of CoCo VMs +can run successfully. However, the Hyper-V implementation of paravisor mode +does not go this far, and is somewhere in the middle of the spectrum. Some +aspects of CoCo VMs are handled by the Hyper-V paravisor while the guest OS +must be enlightened for other aspects. Unfortunately, there is no +standardized enumeration of feature/functions that might be provided in the +paravisor, and there is no standardized mechanism for a guest OS to query the +paravisor for the feature/functions it provides. The understanding of what +the paravisor provides is hard-coded in the guest OS. + +Paravisor mode has similarities to the `Coconut project`_, which aims to provide +a limited paravisor to provide services to the guest such as a virtual TPM. +However, the Hyper-V paravisor generally handles more aspects of CoCo VMs +than is currently envisioned for Coconut, and so is further toward the "no +guest enlightenments required" end of the spectrum. + +.. _Coconut project: https://github.com/coconut-svsm/svsm + +In the CoCo VM threat model, the paravisor is in the guest security domain +and must be trusted by the guest OS. By implication, the hypervisor/VMM must +protect itself against a potentially malicious paravisor just like it +protects against a potentially malicious guest. + +The hardware architectural approach to fully-enlightened vs. paravisor mode +varies depending on the underlying processor. + +* With AMD SEV-SNP processors, in fully-enlightened mode the guest OS runs in + VMPL 0 and has full control of the guest context. In paravisor mode, the + guest OS runs in VMPL 2 and the paravisor runs in VMPL 0. The paravisor + running in VMPL 0 has privileges that the guest OS in VMPL 2 does not have. + Certain operations require the guest to invoke the paravisor. Furthermore, in + paravisor mode the guest OS operates in "virtual Top Of Memory" (vTOM) mode + as defined by the SEV-SNP architecture. This mode simplifies guest management + of memory encryption when a paravisor is used. + +* With Intel TDX processor, in fully-enlightened mode the guest OS runs in an + L1 VM. In paravisor mode, TD partitioning is used. The paravisor runs in the + L1 VM, and the guest OS runs in a nested L2 VM. + +Hyper-V exposes a synthetic MSR to guests that describes the CoCo mode. This +MSR indicates if the underlying processor uses AMD SEV-SNP or Intel TDX, and +whether a paravisor is being used. It is straightforward to build a single +kernel image that can boot and run properly on either architecture, and in +either mode. + +Paravisor Effects +----------------- +Running in paravisor mode affects the following areas of generic Linux kernel +CoCo VM functionality: + +* Initial guest memory setup. When a new VM is created in paravisor mode, the + paravisor runs first and sets up the guest physical memory as encrypted. The + guest Linux does normal memory initialization, except for explicitly marking + appropriate ranges as decrypted (shared). In paravisor mode, Linux does not + perform the early boot memory setup steps that are particularly tricky with + AMD SEV-SNP in fully-enlightened mode. + +* #VC/#VE exception handling. In paravisor mode, Hyper-V configures the guest + CoCo VM to route #VC and #VE exceptions to VMPL 0 and the L1 VM, + respectively, and not the guest Linux. Consequently, these exception handlers + do not run in the guest Linux and are not a required enlightenment for a + Linux guest in paravisor mode. + +* CPUID flags. Both AMD SEV-SNP and Intel TDX provide a CPUID flag in the + guest indicating that the VM is operating with the respective hardware + support. While these CPUID flags are visible in fully-enlightened CoCo VMs, + the paravisor filters out these flags and the guest Linux does not see them. + Throughout the Linux kernel, explicitly testing these flags has mostly been + eliminated in favor of the cc_platform_has() function, with the goal of + abstracting the differences between SEV-SNP and TDX. But the + cc_platform_has() abstraction also allows the Hyper-V paravisor configuration + to selectively enable aspects of CoCo VM functionality even when the CPUID + flags are not set. The exception is early boot memory setup on SEV-SNP, which + tests the CPUID SEV-SNP flag. But not having the flag in Hyper-V paravisor + mode VM achieves the desired effect or not running SEV-SNP specific early + boot memory setup. + +* Device emulation. In paravisor mode, the Hyper-V paravisor provides + emulation of devices such as the IO-APIC and TPM. Because the emulation + happens in the paravisor in the guest context (instead of the hypervisor/VMM + context), MMIO accesses to these devices must be encrypted references instead + of the decrypted references that would be used in a fully-enlightened CoCo + VM. The __ioremap_caller() function has been enhanced to make a callback to + check whether a particular address range should be treated as encrypted + (private). See the "is_private_mmio" callback. + +* Encrypt/decrypt memory transitions. In a CoCo VM, transitioning guest + memory between encrypted and decrypted requires coordinating with the + hypervisor/VMM. This is done via callbacks invoked from + __set_memory_enc_pgtable(). In fully-enlightened mode, the normal SEV-SNP and + TDX implementations of these callbacks are used. In paravisor mode, a Hyper-V + specific set of callbacks is used. These callbacks invoke the paravisor so + that the paravisor can coordinate the transitions and inform the hypervisor + as necessary. See hv_vtom_init() where these callback are set up. + +* Interrupt injection. In fully enlightened mode, a malicious hypervisor + could inject interrupts into the guest OS at times that violate x86/x64 + architectural rules. For full protection, the guest OS should include + enlightenments that use the interrupt injection management features provided + by CoCo-capable processors. In paravisor mode, the paravisor mediates + interrupt injection into the guest OS, and ensures that the guest OS only + sees interrupts that are "legal". The paravisor uses the interrupt injection + management features provided by the CoCo-capable physical processor, thereby + masking these complexities from the guest OS. + +Hyper-V Hypercalls +------------------ +When in fully-enlightened mode, hypercalls made by the Linux guest are routed +directly to the hypervisor, just as in a non-CoCo VM. But in paravisor mode, +normal hypercalls trap to the paravisor first, which may in turn invoke the +hypervisor. But the paravisor is idiosyncratic in this regard, and a few +hypercalls made by the Linux guest must always be routed directly to the +hypervisor. These hypercall sites test for a paravisor being present, and use +a special invocation sequence. See hv_post_message(), for example. + +Guest communication with Hyper-V +-------------------------------- +Separate from the generic Linux kernel handling of memory encryption in Linux +CoCo VMs, Hyper-V has VMBus and VMBus devices that communicate using memory +shared between the Linux guest and the host. This shared memory must be +marked decrypted to enable communication. Furthermore, since the threat model +includes a compromised and potentially malicious host, the guest must guard +against leaking any unintended data to the host through this shared memory. + +These Hyper-V and VMBus memory pages are marked as decrypted: + +* VMBus monitor pages + +* Synthetic interrupt controller (synic) related pages (unless supplied by + the paravisor) + +* Per-cpu hypercall input and output pages (unless running with a paravisor) + +* VMBus ring buffers. The direct mapping is marked decrypted in + __vmbus_establish_gpadl(). The secondary mapping created in + hv_ringbuffer_init() must also include the "decrypted" attribute. + +When the guest writes data to memory that is shared with the host, it must +ensure that only the intended data is written. Padding or unused fields must +be initialized to zeros before copying into the shared memory so that random +kernel data is not inadvertently given to the host. + +Similarly, when the guest reads memory that is shared with the host, it must +validate the data before acting on it so that a malicious host cannot induce +the guest to expose unintended data. Doing such validation can be tricky +because the host can modify the shared memory areas even while or after +validation is performed. For messages passed from the host to the guest in a +VMBus ring buffer, the length of the message is validated, and the message is +copied into a temporary (encrypted) buffer for further validation and +processing. The copying adds a small amount of overhead, but is the only way +to protect against a malicious host. See hv_pkt_iter_first(). + +Many drivers for VMBus devices have been "hardened" by adding code to fully +validate messages received over VMBus, instead of assuming that Hyper-V is +acting cooperatively. Such drivers are marked as "allowed_in_isolated" in the +vmbus_devs[] table. Other drivers for VMBus devices that are not needed in a +CoCo VM have not been hardened, and they are not allowed to load in a CoCo +VM. See vmbus_is_valid_offer() where such devices are excluded. + +Two VMBus devices depend on the Hyper-V host to do DMA data transfers: +storvsc for disk I/O and netvsc for network I/O. storvsc uses the normal +Linux kernel DMA APIs, and so bounce buffering through decrypted swiotlb +memory is done implicitly. netvsc has two modes for data transfers. The first +mode goes through send and receive buffer space that is explicitly allocated +by the netvsc driver, and is used for most smaller packets. These send and +receive buffers are marked decrypted by __vmbus_establish_gpadl(). Because +the netvsc driver explicitly copies packets to/from these buffers, the +equivalent of bounce buffering between encrypted and decrypted memory is +already part of the data path. The second mode uses the normal Linux kernel +DMA APIs, and is bounce buffered through swiotlb memory implicitly like in +storvsc. + +Finally, the VMBus virtual PCI driver needs special handling in a CoCo VM. +Linux PCI device drivers access PCI config space using standard APIs provided +by the Linux PCI subsystem. On Hyper-V, these functions directly access MMIO +space, and the access traps to Hyper-V for emulation. But in CoCo VMs, memory +encryption prevents Hyper-V from reading the guest instruction stream to +emulate the access. So in a CoCo VM, these functions must make a hypercall +with arguments explicitly describing the access. See +_hv_pcifront_read_config() and _hv_pcifront_write_config() and the +"use_calls" flag indicating to use hypercalls. + +load_unaligned_zeropad() +------------------------ +When transitioning memory between encrypted and decrypted, the caller of +set_memory_encrypted() or set_memory_decrypted() is responsible for ensuring +the memory isn't in use and isn't referenced while the transition is in +progress. The transition has multiple steps, and includes interaction with +the Hyper-V host. The memory is in an inconsistent state until all steps are +complete. A reference while the state is inconsistent could result in an +exception that can't be cleanly fixed up. + +However, the kernel load_unaligned_zeropad() mechanism may make stray +references that can't be prevented by the caller of set_memory_encrypted() or +set_memory_decrypted(), so there's specific code in the #VC or #VE exception +handler to fixup this case. But a CoCo VM running on Hyper-V may be +configured to run with a paravisor, with the #VC or #VE exception routed to +the paravisor. There's no architectural way to forward the exceptions back to +the guest kernel, and in such a case, the load_unaligned_zeropad() fixup code +in the #VC/#VE handlers doesn't run. + +To avoid this problem, the Hyper-V specific functions for notifying the +hypervisor of the transition mark pages as "not present" while a transition +is in progress. If load_unaligned_zeropad() causes a stray reference, a +normal page fault is generated instead of #VC or #VE, and the page-fault- +based handlers for load_unaligned_zeropad() fixup the reference. When the +encrypted/decrypted transition is complete, the pages are marked as "present" +again. See hv_vtom_clear_present() and hv_vtom_set_host_visibility(). diff --git a/Documentation/virt/hyperv/hibernation.rst b/Documentation/virt/hyperv/hibernation.rst new file mode 100644 index 000000000000..4ff27f4a317a --- /dev/null +++ b/Documentation/virt/hyperv/hibernation.rst @@ -0,0 +1,336 @@ +.. SPDX-License-Identifier: GPL-2.0 + +Hibernating Guest VMs +===================== + +Background +---------- +Linux supports the ability to hibernate itself in order to save power. +Hibernation is sometimes called suspend-to-disk, as it writes a memory +image to disk and puts the hardware into the lowest possible power +state. Upon resume from hibernation, the hardware is restarted and the +memory image is restored from disk so that it can resume execution +where it left off. See the "Hibernation" section of +Documentation/admin-guide/pm/sleep-states.rst. + +Hibernation is usually done on devices with a single user, such as a +personal laptop. For example, the laptop goes into hibernation when +the cover is closed, and resumes when the cover is opened again. +Hibernation and resume happen on the same hardware, and Linux kernel +code orchestrating the hibernation steps assumes that the hardware +configuration is not changed while in the hibernated state. + +Hibernation can be initiated within Linux by writing "disk" to +/sys/power/state or by invoking the reboot system call with the +appropriate arguments. This functionality may be wrapped by user space +commands such "systemctl hibernate" that are run directly from a +command line or in response to events such as the laptop lid closing. + +Considerations for Guest VM Hibernation +--------------------------------------- +Linux guests on Hyper-V can also be hibernated, in which case the +hardware is the virtual hardware provided by Hyper-V to the guest VM. +Only the targeted guest VM is hibernated, while other guest VMs and +the underlying Hyper-V host continue to run normally. While the +underlying Windows Hyper-V and physical hardware on which it is +running might also be hibernated using hibernation functionality in +the Windows host, host hibernation and its impact on guest VMs is not +in scope for this documentation. + +Resuming a hibernated guest VM can be more challenging than with +physical hardware because VMs make it very easy to change the hardware +configuration between the hibernation and resume. Even when the resume +is done on the same VM that hibernated, the memory size might be +changed, or virtual NICs or SCSI controllers might be added or +removed. Virtual PCI devices assigned to the VM might be added or +removed. Most such changes cause the resume steps to fail, though +adding a new virtual NIC, SCSI controller, or vPCI device should work. + +Additional complexity can ensue because the disks of the hibernated VM +can be moved to another newly created VM that otherwise has the same +virtual hardware configuration. While it is desirable for resume from +hibernation to succeed after such a move, there are challenges. See +details on this scenario and its limitations in the "Resuming on a +Different VM" section below. + +Hyper-V also provides ways to move a VM from one Hyper-V host to +another. Hyper-V tries to ensure processor model and Hyper-V version +compatibility using VM Configuration Versions, and prevents moves to +a host that isn't compatible. Linux adapts to host and processor +differences by detecting them at boot time, but such detection is not +done when resuming execution in the hibernation image. If a VM is +hibernated on one host, then resumed on a host with a different processor +model or Hyper-V version, settings recorded in the hibernation image +may not match the new host. Because Linux does not detect such +mismatches when resuming the hibernation image, undefined behavior +and failures could result. + + +Enabling Guest VM Hibernation +----------------------------- +Hibernation of a Hyper-V guest VM is disabled by default because +hibernation is incompatible with memory hot-add, as provided by the +Hyper-V balloon driver. If hot-add is used and the VM hibernates, it +hibernates with more memory than it started with. But when the VM +resumes from hibernation, Hyper-V gives the VM only the originally +assigned memory, and the memory size mismatch causes resume to fail. + +To enable a Hyper-V VM for hibernation, the Hyper-V administrator must +enable the ACPI virtual S4 sleep state in the ACPI configuration that +Hyper-V provides to the guest VM. Such enablement is accomplished by +modifying a WMI property of the VM, the steps for which are outside +the scope of this documentation but are available on the web. +Enablement is treated as the indicator that the administrator +prioritizes Linux hibernation in the VM over hot-add, so the Hyper-V +balloon driver in Linux disables hot-add. Enablement is indicated if +the contents of /sys/power/disk contains "platform" as an option. The +enablement is also visible in /sys/bus/vmbus/hibernation. See function +hv_is_hibernation_supported(). + +Linux supports ACPI sleep states on x86, but not on arm64. So Linux +guest VM hibernation is not available on Hyper-V for arm64. + +Initiating Guest VM Hibernation +------------------------------- +Guest VMs can self-initiate hibernation using the standard Linux +methods of writing "disk" to /sys/power/state or the reboot system +call. As an additional layer, Linux guests on Hyper-V support the +"Shutdown" integration service, via which a Hyper-V administrator can +tell a Linux VM to hibernate using a command outside the VM. The +command generates a request to the Hyper-V shutdown driver in Linux, +which sends the uevent "EVENT=hibernate". See kernel functions +shutdown_onchannelcallback() and send_hibernate_uevent(). A udev rule +must be provided in the VM that handles this event and initiates +hibernation. + +Handling VMBus Devices During Hibernation & Resume +-------------------------------------------------- +The VMBus bus driver, and the individual VMBus device drivers, +implement suspend and resume functions that are called as part of the +Linux orchestration of hibernation and of resuming from hibernation. +The overall approach is to leave in place the data structures for the +primary VMBus channels and their associated Linux devices, such as +SCSI controllers and others, so that they are captured in the +hibernation image. This approach allows any state associated with the +device to be persisted across the hibernation/resume. When the VM +resumes, the devices are re-offered by Hyper-V and are connected to +the data structures that already exist in the resumed hibernation +image. + +VMBus devices are identified by class and instance GUID. (See section +"VMBus device creation/deletion" in +Documentation/virt/hyperv/vmbus.rst.) Upon resume from hibernation, +the resume functions expect that the devices offered by Hyper-V have +the same class/instance GUIDs as the devices present at the time of +hibernation. Having the same class/instance GUIDs allows the offered +devices to be matched to the primary VMBus channel data structures in +the memory of the now resumed hibernation image. If any devices are +offered that don't match primary VMBus channel data structures that +already exist, they are processed normally as newly added devices. If +primary VMBus channels that exist in the resumed hibernation image are +not matched with a device offered in the resumed VM, the resume +sequence waits for 10 seconds, then proceeds. But the unmatched device +is likely to cause errors in the resumed VM. + +When resuming existing primary VMBus channels, the newly offered +relids might be different because relids can change on each VM boot, +even if the VM configuration hasn't changed. The VMBus bus driver +resume function matches the class/instance GUIDs, and updates the +relids in case they have changed. + +VMBus sub-channels are not persisted in the hibernation image. Each +VMBus device driver's suspend function must close any sub-channels +prior to hibernation. Closing a sub-channel causes Hyper-V to send a +RESCIND_CHANNELOFFER message, which Linux processes by freeing the +channel data structures so that all vestiges of the sub-channel are +removed. By contrast, primary channels are marked closed and their +ring buffers are freed, but Hyper-V does not send a rescind message, +so the channel data structure continues to exist. Upon resume, the +device driver's resume function re-allocates the ring buffer and +re-opens the existing channel. It then communicates with Hyper-V to +re-open sub-channels from scratch. + +The Linux ends of Hyper-V sockets are forced closed at the time of +hibernation. The guest can't force closing the host end of the socket, +but any host-side actions on the host end will produce an error. + +VMBus devices use the same suspend function for the "freeze" and the +"poweroff" phases, and the same resume function for the "thaw" and +"restore" phases. See the "Entering Hibernation" section of +Documentation/driver-api/pm/devices.rst for the sequencing of the +phases. + +Detailed Hibernation Sequence +----------------------------- +1. The Linux power management (PM) subsystem prepares for + hibernation by freezing user space processes and allocating + memory to hold the hibernation image. +2. As part of the "freeze" phase, Linux PM calls the "suspend" + function for each VMBus device in turn. As described above, this + function removes sub-channels, and leaves the primary channel in + a closed state. +3. Linux PM calls the "suspend" function for the VMBus bus, which + closes any Hyper-V socket channels and unloads the top-level + VMBus connection with the Hyper-V host. +4. Linux PM disables non-boot CPUs, creates the hibernation image in + the previously allocated memory, then re-enables non-boot CPUs. + The hibernation image contains the memory data structures for the + closed primary channels, but no sub-channels. +5. As part of the "thaw" phase, Linux PM calls the "resume" function + for the VMBus bus, which re-establishes the top-level VMBus + connection and requests that Hyper-V re-offer the VMBus devices. + As offers are received for the primary channels, the relids are + updated as previously described. +6. Linux PM calls the "resume" function for each VMBus device. Each + device re-opens its primary channel, and communicates with Hyper-V + to re-establish sub-channels if appropriate. The sub-channels + are re-created as new channels since they were previously removed + entirely in Step 2. +7. With VMBus devices now working again, Linux PM writes the + hibernation image from memory to disk. +8. Linux PM repeats Steps 2 and 3 above as part of the "poweroff" + phase. VMBus channels are closed and the top-level VMBus + connection is unloaded. +9. Linux PM disables non-boot CPUs, and then enters ACPI sleep state + S4. Hibernation is now complete. + +Detailed Resume Sequence +------------------------ +1. The guest VM boots into a fresh Linux OS instance. During boot, + the top-level VMBus connection is established, and synthetic + devices are enabled. This happens via the normal paths that don't + involve hibernation. +2. Linux PM hibernation code reads swap space is to find and read + the hibernation image into memory. If there is no hibernation + image, then this boot becomes a normal boot. +3. If this is a resume from hibernation, the "freeze" phase is used + to shutdown VMBus devices and unload the top-level VMBus + connection in the running fresh OS instance, just like Steps 2 + and 3 in the hibernation sequence. +4. Linux PM disables non-boot CPUs, and transfers control to the + read-in hibernation image. In the now-running hibernation image, + non-boot CPUs are restarted. +5. As part of the "resume" phase, Linux PM repeats Steps 5 and 6 + from the hibernation sequence. The top-level VMBus connection is + re-established, and offers are received and matched to primary + channels in the image. Relids are updated. VMBus device resume + functions re-open primary channels and re-create sub-channels. +6. Linux PM exits the hibernation resume sequence and the VM is now + running normally from the hibernation image. + +Key-Value Pair (KVP) Pseudo-Device Anomalies +-------------------------------------------- +The VMBus KVP device behaves differently from other pseudo-devices +offered by Hyper-V. When the KVP primary channel is closed, Hyper-V +sends a rescind message, which causes all vestiges of the device to be +removed. But Hyper-V then re-offers the device, causing it to be newly +re-created. The removal and re-creation occurs during the "freeze" +phase of hibernation, so the hibernation image contains the re-created +KVP device. Similar behavior occurs during the "freeze" phase of the +resume sequence while still in the fresh OS instance. But in both +cases, the top-level VMBus connection is subsequently unloaded, which +causes the device to be discarded on the Hyper-V side. So no harm is +done and everything still works. + +Virtual PCI devices +------------------- +Virtual PCI devices are physical PCI devices that are mapped directly +into the VM's physical address space so the VM can interact directly +with the hardware. vPCI devices include those accessed via what Hyper-V +calls "Discrete Device Assignment" (DDA), as well as SR-IOV NIC +Virtual Functions (VF) devices. See Documentation/virt/hyperv/vpci.rst. + +Hyper-V DDA devices are offered to guest VMs after the top-level VMBus +connection is established, just like VMBus synthetic devices. They are +statically assigned to the VM, and their instance GUIDs don't change +unless the Hyper-V administrator makes changes to the configuration. +DDA devices are represented in Linux as virtual PCI devices that have +a VMBus identity as well as a PCI identity. Consequently, Linux guest +hibernation first handles DDA devices as VMBus devices in order to +manage the VMBus channel. But then they are also handled as PCI +devices using the hibernation functions implemented by their native +PCI driver. + +SR-IOV NIC VFs also have a VMBus identity as well as a PCI +identity, and overall are processed similarly to DDA devices. A +difference is that VFs are not offered to the VM during initial boot +of the VM. Instead, the VMBus synthetic NIC driver first starts +operating and communicates to Hyper-V that it is prepared to accept a +VF, and then the VF offer is made. However, the VMBus connection +might later be unloaded and then re-established without the VM being +rebooted, as happens in Steps 3 and 5 in the Detailed Hibernation +Sequence above and in the Detailed Resume Sequence. In such a case, +the VFs likely became part of the VM during initial boot, so when the +VMBus connection is re-established, the VFs are offered on the +re-established connection without intervention by the synthetic NIC driver. + +UIO Devices +----------- +A VMBus device can be exposed to user space using the Hyper-V UIO +driver (uio_hv_generic.c) so that a user space driver can control and +operate the device. However, the VMBus UIO driver does not support the +suspend and resume operations needed for hibernation. If a VMBus +device is configured to use the UIO driver, hibernating the VM fails +and Linux continues to run normally. The most common use of the Hyper-V +UIO driver is for DPDK networking, but there are other uses as well. + +Resuming on a Different VM +-------------------------- +This scenario occurs in the Azure public cloud in that a hibernated +customer VM only exists as saved configuration and disks -- the VM no +longer exists on any Hyper-V host. When the customer VM is resumed, a +new Hyper-V VM with identical configuration is created, likely on a +different Hyper-V host. That new Hyper-V VM becomes the resumed +customer VM, and the steps the Linux kernel takes to resume from the +hibernation image must work in that new VM. + +While the disks and their contents are preserved from the original VM, +the Hyper-V-provided VMBus instance GUIDs of the disk controllers and +other synthetic devices would typically be different. The difference +would cause the resume from hibernation to fail, so several things are +done to solve this problem: + +* For VMBus synthetic devices that support only a single instance, + Hyper-V always assigns the same instance GUIDs. For example, the + Hyper-V mouse, the shutdown pseudo-device, the time sync pseudo + device, etc., always have the same instance GUID, both for local + Hyper-V installs as well as in the Azure cloud. + +* VMBus synthetic SCSI controllers may have multiple instances in a + VM, and in the general case instance GUIDs vary from VM to VM. + However, Azure VMs always have exactly two synthetic SCSI + controllers, and Azure code overrides the normal Hyper-V behavior + so these controllers are always assigned the same two instance + GUIDs. Consequently, when a customer VM is resumed on a newly + created VM, the instance GUIDs match. But this guarantee does not + hold for local Hyper-V installs. + +* Similarly, VMBus synthetic NICs may have multiple instances in a + VM, and the instance GUIDs vary from VM to VM. Again, Azure code + overrides the normal Hyper-V behavior so that the instance GUID + of a synthetic NIC in a customer VM does not change, even if the + customer VM is deallocated or hibernated, and then re-constituted + on a newly created VM. As with SCSI controllers, this behavior + does not hold for local Hyper-V installs. + +* vPCI devices do not have the same instance GUIDs when resuming + from hibernation on a newly created VM. Consequently, Azure does + not support hibernation for VMs that have DDA devices such as + NVMe controllers or GPUs. For SR-IOV NIC VFs, Azure removes the + VF from the VM before it hibernates so that the hibernation image + does not contain a VF device. When the VM is resumed it + instantiates a new VF, rather than trying to match against a VF + that is present in the hibernation image. Because Azure must + remove any VFs before initiating hibernation, Azure VM + hibernation must be initiated externally from the Azure Portal or + Azure CLI, which in turn uses the Shutdown integration service to + tell Linux to do the hibernation. If hibernation is self-initiated + within the Azure VM, VFs remain in the hibernation image, and are + not resumed properly. + +In summary, Azure takes special actions to remove VFs and to ensure +that VMBus device instance GUIDs match on a new/different VM, allowing +hibernation to work for most general-purpose Azure VMs sizes. While +similar special actions could be taken when resuming on a different VM +on a local Hyper-V install, orchestrating such actions is not provided +out-of-the-box by local Hyper-V and so requires custom scripting. diff --git a/Documentation/virt/hyperv/index.rst b/Documentation/virt/hyperv/index.rst index de447e11b4a5..c84c40fd61c9 100644 --- a/Documentation/virt/hyperv/index.rst +++ b/Documentation/virt/hyperv/index.rst @@ -11,3 +11,5 @@ Hyper-V Enlightenments vmbus clocks vpci + hibernation + coco diff --git a/Documentation/virt/hyperv/overview.rst b/Documentation/virt/hyperv/overview.rst index cd493332c88a..77408a89d1a4 100644 --- a/Documentation/virt/hyperv/overview.rst +++ b/Documentation/virt/hyperv/overview.rst @@ -40,7 +40,7 @@ Linux guests communicate with Hyper-V in four different ways: arm64, these synthetic registers must be accessed using explicit hypercalls. -* VMbus: VMbus is a higher-level software construct that is built on +* VMBus: VMBus is a higher-level software construct that is built on the other 3 mechanisms. It is a message passing interface between the Hyper-V host and the Linux guest. It uses memory that is shared between Hyper-V and the guest, along with various signaling @@ -54,8 +54,8 @@ x86/x64 architecture only. .. _Hyper-V Top Level Functional Spec (TLFS): https://docs.microsoft.com/en-us/virtualization/hyper-v-on-windows/tlfs/tlfs -VMbus is not documented. This documentation provides a high-level -overview of VMbus and how it works, but the details can be discerned +VMBus is not documented. This documentation provides a high-level +overview of VMBus and how it works, but the details can be discerned only from the code. Sharing Memory @@ -74,7 +74,7 @@ follows: physical address space. How Hyper-V is told about the GPA or list of GPAs varies. In some cases, a single GPA is written to a synthetic register. In other cases, a GPA or list of GPAs is sent - in a VMbus message. + in a VMBus message. * Hyper-V translates the GPAs into "real" physical memory addresses, and creates a virtual mapping that it can use to access the memory. @@ -133,9 +133,9 @@ only the CPUs actually present in the VM, so Linux does not report any hot-add CPUs. A Linux guest CPU may be taken offline using the normal Linux -mechanisms, provided no VMbus channel interrupts are assigned to -the CPU. See the section on VMbus Interrupts for more details -on how VMbus channel interrupts can be re-assigned to permit +mechanisms, provided no VMBus channel interrupts are assigned to +the CPU. See the section on VMBus Interrupts for more details +on how VMBus channel interrupts can be re-assigned to permit taking a CPU offline. 32-bit and 64-bit @@ -169,14 +169,14 @@ and functionality. Hyper-V indicates feature/function availability via flags in synthetic MSRs that Hyper-V provides to the guest, and the guest code tests these flags. -VMbus has its own protocol version that is negotiated during the -initial VMbus connection from the guest to Hyper-V. This version +VMBus has its own protocol version that is negotiated during the +initial VMBus connection from the guest to Hyper-V. This version number is also output to dmesg during boot. This version number is checked in a few places in the code to determine if specific functionality is present. -Furthermore, each synthetic device on VMbus also has a protocol -version that is separate from the VMbus protocol version. Device +Furthermore, each synthetic device on VMBus also has a protocol +version that is separate from the VMBus protocol version. Device drivers for these synthetic devices typically negotiate the device protocol version, and may test that protocol version to determine if specific device functionality is present. diff --git a/Documentation/virt/hyperv/vmbus.rst b/Documentation/virt/hyperv/vmbus.rst index d2012d9022c5..1dcef6a7fda3 100644 --- a/Documentation/virt/hyperv/vmbus.rst +++ b/Documentation/virt/hyperv/vmbus.rst @@ -1,8 +1,8 @@ .. SPDX-License-Identifier: GPL-2.0 -VMbus +VMBus ===== -VMbus is a software construct provided by Hyper-V to guest VMs. It +VMBus is a software construct provided by Hyper-V to guest VMs. It consists of a control path and common facilities used by synthetic devices that Hyper-V presents to guest VMs. The control path is used to offer synthetic devices to the guest VM and, in some cases, @@ -12,9 +12,9 @@ and the synthetic device implementation that is part of Hyper-V, and signaling primitives to allow Hyper-V and the guest to interrupt each other. -VMbus is modeled in Linux as a bus, with the expected /sys/bus/vmbus -entry in a running Linux guest. The VMbus driver (drivers/hv/vmbus_drv.c) -establishes the VMbus control path with the Hyper-V host, then +VMBus is modeled in Linux as a bus, with the expected /sys/bus/vmbus +entry in a running Linux guest. The VMBus driver (drivers/hv/vmbus_drv.c) +establishes the VMBus control path with the Hyper-V host, then registers itself as a Linux bus driver. It implements the standard bus functions for adding and removing devices to/from the bus. @@ -49,9 +49,9 @@ synthetic NIC is referred to as "netvsc" and the Linux driver for the synthetic SCSI controller is "storvsc". These drivers contain functions with names like "storvsc_connect_to_vsp". -VMbus channels +VMBus channels -------------- -An instance of a synthetic device uses VMbus channels to communicate +An instance of a synthetic device uses VMBus channels to communicate between the VSP and the VSC. Channels are bi-directional and used for passing messages. Most synthetic devices use a single channel, but the synthetic SCSI controller and synthetic NIC may use multiple @@ -73,7 +73,7 @@ write indices and some control flags, followed by the memory for the actual ring. The size of the ring is determined by the VSC in the guest and is specific to each synthetic device. The list of GPAs making up the ring is communicated to the Hyper-V host over the -VMbus control path as a GPA Descriptor List (GPADL). See function +VMBus control path as a GPA Descriptor List (GPADL). See function vmbus_establish_gpadl(). Each ring buffer is mapped into contiguous Linux kernel virtual @@ -102,10 +102,10 @@ resources. For Windows Server 2019 and later, this limit is approximately 1280 Mbytes. For versions prior to Windows Server 2019, the limit is approximately 384 Mbytes. -VMbus messages --------------- -All VMbus messages have a standard header that includes the message -length, the offset of the message payload, some flags, and a +VMBus channel messages +---------------------- +All messages sent in a VMBus channel have a standard header that includes +the message length, the offset of the message payload, some flags, and a transactionID. The portion of the message after the header is unique to each VSP/VSC pair. @@ -137,7 +137,7 @@ control message contains a list of GPAs that describe the data buffer. For example, the storvsc driver uses this approach to specify the data buffers to/from which disk I/O is done. -Three functions exist to send VMbus messages: +Three functions exist to send VMBus channel messages: 1. vmbus_sendpacket(): Control-only messages and messages with embedded data -- no GPAs @@ -154,20 +154,51 @@ Historically, Linux guests have trusted Hyper-V to send well-formed and valid messages, and Linux drivers for synthetic devices did not fully validate messages. With the introduction of processor technologies that fully encrypt guest memory and that allow the -guest to not trust the hypervisor (AMD SNP-SEV, Intel TDX), trusting +guest to not trust the hypervisor (AMD SEV-SNP, Intel TDX), trusting the Hyper-V host is no longer a valid assumption. The drivers for -VMbus synthetic devices are being updated to fully validate any +VMBus synthetic devices are being updated to fully validate any values read from memory that is shared with Hyper-V, which includes -messages from VMbus devices. To facilitate such validation, +messages from VMBus devices. To facilitate such validation, messages read by the guest from the "in" ring buffer are copied to a temporary buffer that is not shared with Hyper-V. Validation is performed in this temporary buffer without the risk of Hyper-V maliciously modifying the message after it is validated but before it is used. -VMbus interrupts +Synthetic Interrupt Controller (synic) +-------------------------------------- +Hyper-V provides each guest CPU with a synthetic interrupt controller +that is used by VMBus for host-guest communication. While each synic +defines 16 synthetic interrupts (SINT), Linux uses only one of the 16 +(VMBUS_MESSAGE_SINT). All interrupts related to communication between +the Hyper-V host and a guest CPU use that SINT. + +The SINT is mapped to a single per-CPU architectural interrupt (i.e, +an 8-bit x86/x64 interrupt vector, or an arm64 PPI INTID). Because +each CPU in the guest has a synic and may receive VMBus interrupts, +they are best modeled in Linux as per-CPU interrupts. This model works +well on arm64 where a single per-CPU Linux IRQ is allocated for +VMBUS_MESSAGE_SINT. This IRQ appears in /proc/interrupts as an IRQ labelled +"Hyper-V VMbus". Since x86/x64 lacks support for per-CPU IRQs, an x86 +interrupt vector is statically allocated (HYPERVISOR_CALLBACK_VECTOR) +across all CPUs and explicitly coded to call vmbus_isr(). In this case, +there's no Linux IRQ, and the interrupts are visible in aggregate in +/proc/interrupts on the "HYP" line. + +The synic provides the means to demultiplex the architectural interrupt into +one or more logical interrupts and route the logical interrupt to the proper +VMBus handler in Linux. This demultiplexing is done by vmbus_isr() and +related functions that access synic data structures. + +The synic is not modeled in Linux as an irq chip or irq domain, +and the demultiplexed logical interrupts are not Linux IRQs. As such, +they don't appear in /proc/interrupts or /proc/irq. The CPU +affinity for one of these logical interrupts is controlled via an +entry under /sys/bus/vmbus as described below. + +VMBus interrupts ---------------- -VMbus provides a mechanism for the guest to interrupt the host when +VMBus provides a mechanism for the guest to interrupt the host when the guest has queued new messages in a ring buffer. The host expects that the guest will send an interrupt only when an "out" ring buffer transitions from empty to non-empty. If the guest sends @@ -176,63 +207,55 @@ unnecessary. If a guest sends an excessive number of unnecessary interrupts, the host may throttle that guest by suspending its execution for a few seconds to prevent a denial-of-service attack. -Similarly, the host will interrupt the guest when it sends a new -message on the VMbus control path, or when a VMbus channel "in" ring -buffer transitions from empty to non-empty. Each CPU in the guest -may receive VMbus interrupts, so they are best modeled as per-CPU -interrupts in Linux. This model works well on arm64 where a single -per-CPU IRQ is allocated for VMbus. Since x86/x64 lacks support for -per-CPU IRQs, an x86 interrupt vector is statically allocated (see -HYPERVISOR_CALLBACK_VECTOR) across all CPUs and explicitly coded to -call the VMbus interrupt service routine. These interrupts are -visible in /proc/interrupts on the "HYP" line. - -The guest CPU that a VMbus channel will interrupt is selected by the +Similarly, the host will interrupt the guest via the synic when +it sends a new message on the VMBus control path, or when a VMBus +channel "in" ring buffer transitions from empty to non-empty due to +the host inserting a new VMBus channel message. The control message stream +and each VMBus channel "in" ring buffer are separate logical interrupts +that are demultiplexed by vmbus_isr(). It demultiplexes by first checking +for channel interrupts by calling vmbus_chan_sched(), which looks at a synic +bitmap to determine which channels have pending interrupts on this CPU. +If multiple channels have pending interrupts for this CPU, they are +processed sequentially. When all channel interrupts have been processed, +vmbus_isr() checks for and processes any messages received on the VMBus +control path. + +The guest CPU that a VMBus channel will interrupt is selected by the guest when the channel is created, and the host is informed of that -selection. VMbus devices are broadly grouped into two categories: +selection. VMBus devices are broadly grouped into two categories: -1. "Slow" devices that need only one VMbus channel. The devices +1. "Slow" devices that need only one VMBus channel. The devices (such as keyboard, mouse, heartbeat, and timesync) generate - relatively few interrupts. Their VMbus channels are all + relatively few interrupts. Their VMBus channels are all assigned to interrupt the VMBUS_CONNECT_CPU, which is always CPU 0. -2. "High speed" devices that may use multiple VMbus channels for +2. "High speed" devices that may use multiple VMBus channels for higher parallelism and performance. These devices include the - synthetic SCSI controller and synthetic NIC. Their VMbus + synthetic SCSI controller and synthetic NIC. Their VMBus channels interrupts are assigned to CPUs that are spread out among the available CPUs in the VM so that interrupts on multiple channels can be processed in parallel. -The assignment of VMbus channel interrupts to CPUs is done in the +The assignment of VMBus channel interrupts to CPUs is done in the function init_vp_index(). This assignment is done outside of the normal Linux interrupt affinity mechanism, so the interrupts are neither "unmanaged" nor "managed" interrupts. -The CPU that a VMbus channel will interrupt can be seen in +The CPU that a VMBus channel will interrupt can be seen in /sys/bus/vmbus/devices/<deviceGUID>/ channels/<channelRelID>/cpu. When running on later versions of Hyper-V, the CPU can be changed -by writing a new value to this sysfs entry. Because the interrupt -assignment is done outside of the normal Linux affinity mechanism, -there are no entries in /proc/irq corresponding to individual -VMbus channel interrupts. +by writing a new value to this sysfs entry. Because VMBus channel +interrupts are not Linux IRQs, there are no entries in /proc/interrupts +or /proc/irq corresponding to individual VMBus channel interrupts. An online CPU in a Linux guest may not be taken offline if it has -VMbus channel interrupts assigned to it. Any such channel +VMBus channel interrupts assigned to it. Any such channel interrupts must first be manually reassigned to another CPU as described above. When no channel interrupts are assigned to the CPU, it can be taken offline. -When a guest CPU receives a VMbus interrupt from the host, the -function vmbus_isr() handles the interrupt. It first checks for -channel interrupts by calling vmbus_chan_sched(), which looks at a -bitmap setup by the host to determine which channels have pending -interrupts on this CPU. If multiple channels have pending -interrupts for this CPU, they are processed sequentially. When all -channel interrupts have been processed, vmbus_isr() checks for and -processes any message received on the VMbus control path. - -The VMbus channel interrupt handling code is designed to work +The VMBus channel interrupt handling code is designed to work correctly even if an interrupt is received on a CPU other than the CPU assigned to the channel. Specifically, the code does not use CPU-based exclusion for correctness. In normal operation, Hyper-V @@ -242,23 +265,23 @@ when Hyper-V will make the transition. The code must work correctly even if there is a time lag before Hyper-V starts interrupting the new CPU. See comments in target_cpu_store(). -VMbus device creation/deletion +VMBus device creation/deletion ------------------------------ Hyper-V and the Linux guest have a separate message-passing path that is used for synthetic device creation and deletion. This -path does not use a VMbus channel. See vmbus_post_msg() and +path does not use a VMBus channel. See vmbus_post_msg() and vmbus_on_msg_dpc(). The first step is for the guest to connect to the generic -Hyper-V VMbus mechanism. As part of establishing this connection, -the guest and Hyper-V agree on a VMbus protocol version they will +Hyper-V VMBus mechanism. As part of establishing this connection, +the guest and Hyper-V agree on a VMBus protocol version they will use. This negotiation allows newer Linux kernels to run on older Hyper-V versions, and vice versa. The guest then tells Hyper-V to "send offers". Hyper-V sends an offer message to the guest for each synthetic device that the VM -is configured to have. Each VMbus device type has a fixed GUID -known as the "class ID", and each VMbus device instance is also +is configured to have. Each VMBus device type has a fixed GUID +known as the "class ID", and each VMBus device instance is also identified by a GUID. The offer message from Hyper-V contains both GUIDs to uniquely (within the VM) identify the device. There is one offer message for each device instance, so a VM with @@ -275,7 +298,7 @@ type based on the class ID, and invokes the correct driver to set up the device. Driver/device matching is performed using the standard Linux mechanism. -The device driver probe function opens the primary VMbus channel to +The device driver probe function opens the primary VMBus channel to the corresponding VSP. It allocates guest memory for the channel ring buffers and shares the ring buffer with the Hyper-V host by giving the host a list of GPAs for the ring buffer memory. See @@ -285,7 +308,7 @@ Once the ring buffer is set up, the device driver and VSP exchange setup messages via the primary channel. These messages may include negotiating the device protocol version to be used between the Linux VSC and the VSP on the Hyper-V host. The setup messages may also -include creating additional VMbus channels, which are somewhat +include creating additional VMBus channels, which are somewhat mis-named as "sub-channels" since they are functionally equivalent to the primary channel once they are created. |