summaryrefslogtreecommitdiff
path: root/Documentation/driver-api/vfio.rst
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/driver-api/vfio.rst')
-rw-r--r--Documentation/driver-api/vfio.rst245
1 files changed, 216 insertions, 29 deletions
diff --git a/Documentation/driver-api/vfio.rst b/Documentation/driver-api/vfio.rst
index f1a4d3c3ba0b..633d11c7fa71 100644
--- a/Documentation/driver-api/vfio.rst
+++ b/Documentation/driver-api/vfio.rst
@@ -2,7 +2,7 @@
VFIO - "Virtual Function I/O" [1]_
==================================
-Many modern system now provide DMA and interrupt remapping facilities
+Many modern systems now provide DMA and interrupt remapping facilities
to help ensure I/O devices behave within the boundaries they've been
allotted. This includes x86 hardware with AMD-Vi and Intel VT-d,
POWER systems with Partitionable Endpoints (PEs) and embedded PowerPC
@@ -239,51 +239,230 @@ group and can access them as follows::
/* Gratuitous device reset and go... */
ioctl(device, VFIO_DEVICE_RESET);
+IOMMUFD and vfio_iommu_type1
+----------------------------
+
+IOMMUFD is the new user API to manage I/O page tables from userspace.
+It intends to be the portal of delivering advanced userspace DMA
+features (nested translation [5]_, PASID [6]_, etc.) while also providing
+a backwards compatibility interface for existing VFIO_TYPE1v2_IOMMU use
+cases. Eventually the vfio_iommu_type1 driver, as well as the legacy
+vfio container and group model is intended to be deprecated.
+
+The IOMMUFD backwards compatibility interface can be enabled two ways.
+In the first method, the kernel can be configured with
+CONFIG_IOMMUFD_VFIO_CONTAINER, in which case the IOMMUFD subsystem
+transparently provides the entire infrastructure for the VFIO
+container and IOMMU backend interfaces. The compatibility mode can
+also be accessed if the VFIO container interface, ie. /dev/vfio/vfio is
+simply symlink'd to /dev/iommu. Note that at the time of writing, the
+compatibility mode is not entirely feature complete relative to
+VFIO_TYPE1v2_IOMMU (ex. DMA mapping MMIO) and does not attempt to
+provide compatibility to the VFIO_SPAPR_TCE_IOMMU interface. Therefore
+it is not generally advisable at this time to switch from native VFIO
+implementations to the IOMMUFD compatibility interfaces.
+
+Long term, VFIO users should migrate to device access through the cdev
+interface described below, and native access through the IOMMUFD
+provided interfaces.
+
+VFIO Device cdev
+----------------
+
+Traditionally user acquires a device fd via VFIO_GROUP_GET_DEVICE_FD
+in a VFIO group.
+
+With CONFIG_VFIO_DEVICE_CDEV=y the user can now acquire a device fd
+by directly opening a character device /dev/vfio/devices/vfioX where
+"X" is the number allocated uniquely by VFIO for registered devices.
+cdev interface does not support noiommu devices, so user should use
+the legacy group interface if noiommu is wanted.
+
+The cdev only works with IOMMUFD. Both VFIO drivers and applications
+must adapt to the new cdev security model which requires using
+VFIO_DEVICE_BIND_IOMMUFD to claim DMA ownership before starting to
+actually use the device. Once BIND succeeds then a VFIO device can
+be fully accessed by the user.
+
+VFIO device cdev doesn't rely on VFIO group/container/iommu drivers.
+Hence those modules can be fully compiled out in an environment
+where no legacy VFIO application exists.
+
+So far SPAPR does not support IOMMUFD yet. So it cannot support device
+cdev either.
+
+vfio device cdev access is still bound by IOMMU group semantics, ie. there
+can be only one DMA owner for the group. Devices belonging to the same
+group can not be bound to multiple iommufd_ctx or shared between native
+kernel and vfio bus driver or other driver supporting the driver_managed_dma
+flag. A violation of this ownership requirement will fail at the
+VFIO_DEVICE_BIND_IOMMUFD ioctl, which gates full device access.
+
+Device cdev Example
+-------------------
+
+Assume user wants to access PCI device 0000:6a:01.0::
+
+ $ ls /sys/bus/pci/devices/0000:6a:01.0/vfio-dev/
+ vfio0
+
+This device is therefore represented as vfio0. The user can verify
+its existence::
+
+ $ ls -l /dev/vfio/devices/vfio0
+ crw------- 1 root root 511, 0 Feb 16 01:22 /dev/vfio/devices/vfio0
+ $ cat /sys/bus/pci/devices/0000:6a:01.0/vfio-dev/vfio0/dev
+ 511:0
+ $ ls -l /dev/char/511\:0
+ lrwxrwxrwx 1 root root 21 Feb 16 01:22 /dev/char/511:0 -> ../vfio/devices/vfio0
+
+Then provide the user with access to the device if unprivileged
+operation is desired::
+
+ $ chown user:user /dev/vfio/devices/vfio0
+
+Finally the user could get cdev fd by::
+
+ cdev_fd = open("/dev/vfio/devices/vfio0", O_RDWR);
+
+An opened cdev_fd doesn't give the user any permission of accessing
+the device except binding the cdev_fd to an iommufd. After that point
+then the device is fully accessible including attaching it to an
+IOMMUFD IOAS/HWPT to enable userspace DMA::
+
+ struct vfio_device_bind_iommufd bind = {
+ .argsz = sizeof(bind),
+ .flags = 0,
+ };
+ struct iommu_ioas_alloc alloc_data = {
+ .size = sizeof(alloc_data),
+ .flags = 0,
+ };
+ struct vfio_device_attach_iommufd_pt attach_data = {
+ .argsz = sizeof(attach_data),
+ .flags = 0,
+ };
+ struct iommu_ioas_map map = {
+ .size = sizeof(map),
+ .flags = IOMMU_IOAS_MAP_READABLE |
+ IOMMU_IOAS_MAP_WRITEABLE |
+ IOMMU_IOAS_MAP_FIXED_IOVA,
+ .__reserved = 0,
+ };
+
+ iommufd = open("/dev/iommu", O_RDWR);
+
+ bind.iommufd = iommufd;
+ ioctl(cdev_fd, VFIO_DEVICE_BIND_IOMMUFD, &bind);
+
+ ioctl(iommufd, IOMMU_IOAS_ALLOC, &alloc_data);
+ attach_data.pt_id = alloc_data.out_ioas_id;
+ ioctl(cdev_fd, VFIO_DEVICE_ATTACH_IOMMUFD_PT, &attach_data);
+
+ /* Allocate some space and setup a DMA mapping */
+ map.user_va = (int64_t)mmap(0, 1024 * 1024, PROT_READ | PROT_WRITE,
+ MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
+ map.iova = 0; /* 1MB starting at 0x0 from device view */
+ map.length = 1024 * 1024;
+ map.ioas_id = alloc_data.out_ioas_id;;
+
+ ioctl(iommufd, IOMMU_IOAS_MAP, &map);
+
+ /* Other device operations as stated in "VFIO Usage Example" */
+
VFIO User API
-------------------------------------------------------------------------------
-Please see include/linux/vfio.h for complete API documentation.
+Please see include/uapi/linux/vfio.h for complete API documentation.
VFIO bus driver API
-------------------------------------------------------------------------------
VFIO bus drivers, such as vfio-pci make use of only a few interfaces
into VFIO core. When devices are bound and unbound to the driver,
-the driver should call vfio_add_group_dev() and vfio_del_group_dev()
-respectively::
-
- extern int vfio_add_group_dev(struct device *dev,
- const struct vfio_device_ops *ops,
- void *device_data);
-
- extern void *vfio_del_group_dev(struct device *dev);
-
-vfio_add_group_dev() indicates to the core to begin tracking the
-iommu_group of the specified dev and register the dev as owned by
-a VFIO bus driver. The driver provides an ops structure for callbacks
+Following interfaces are called when devices are bound to and
+unbound from the driver::
+
+ int vfio_register_group_dev(struct vfio_device *device);
+ int vfio_register_emulated_iommu_dev(struct vfio_device *device);
+ void vfio_unregister_group_dev(struct vfio_device *device);
+
+The driver should embed the vfio_device in its own structure and use
+vfio_alloc_device() to allocate the structure, and can register
+@init/@release callbacks to manage any private state wrapping the
+vfio_device::
+
+ vfio_alloc_device(dev_struct, member, dev, ops);
+ void vfio_put_device(struct vfio_device *device);
+
+vfio_register_group_dev() indicates to the core to begin tracking the
+iommu_group of the specified dev and register the dev as owned by a VFIO bus
+driver. Once vfio_register_group_dev() returns it is possible for userspace to
+start accessing the driver, thus the driver should ensure it is completely
+ready before calling it. The driver provides an ops structure for callbacks
similar to a file operations structure::
struct vfio_device_ops {
- int (*open)(void *device_data);
- void (*release)(void *device_data);
- ssize_t (*read)(void *device_data, char __user *buf,
+ char *name;
+ int (*init)(struct vfio_device *vdev);
+ void (*release)(struct vfio_device *vdev);
+ int (*bind_iommufd)(struct vfio_device *vdev,
+ struct iommufd_ctx *ictx, u32 *out_device_id);
+ void (*unbind_iommufd)(struct vfio_device *vdev);
+ int (*attach_ioas)(struct vfio_device *vdev, u32 *pt_id);
+ void (*detach_ioas)(struct vfio_device *vdev);
+ int (*open_device)(struct vfio_device *vdev);
+ void (*close_device)(struct vfio_device *vdev);
+ ssize_t (*read)(struct vfio_device *vdev, char __user *buf,
size_t count, loff_t *ppos);
- ssize_t (*write)(void *device_data, const char __user *buf,
- size_t size, loff_t *ppos);
- long (*ioctl)(void *device_data, unsigned int cmd,
+ ssize_t (*write)(struct vfio_device *vdev, const char __user *buf,
+ size_t count, loff_t *size);
+ long (*ioctl)(struct vfio_device *vdev, unsigned int cmd,
unsigned long arg);
- int (*mmap)(void *device_data, struct vm_area_struct *vma);
+ int (*mmap)(struct vfio_device *vdev, struct vm_area_struct *vma);
+ void (*request)(struct vfio_device *vdev, unsigned int count);
+ int (*match)(struct vfio_device *vdev, char *buf);
+ void (*dma_unmap)(struct vfio_device *vdev, u64 iova, u64 length);
+ int (*device_feature)(struct vfio_device *device, u32 flags,
+ void __user *arg, size_t argsz);
};
-Each function is passed the device_data that was originally registered
-in the vfio_add_group_dev() call above. This allows the bus driver
-an easy place to store its opaque, private data. The open/release
-callbacks are issued when a new file descriptor is created for a
-device (via VFIO_GROUP_GET_DEVICE_FD). The ioctl interface provides
-a direct pass through for VFIO_DEVICE_* ioctls. The read/write/mmap
-interfaces implement the device region access defined by the device's
-own VFIO_DEVICE_GET_REGION_INFO ioctl.
+Each function is passed the vdev that was originally registered
+in the vfio_register_group_dev() or vfio_register_emulated_iommu_dev()
+call above. This allows the bus driver to obtain its private data using
+container_of().
+
+::
+
+ - The init/release callbacks are issued when vfio_device is initialized
+ and released.
+
+ - The open/close device callbacks are issued when the first
+ instance of a file descriptor for the device is created (eg.
+ via VFIO_GROUP_GET_DEVICE_FD) for a user session.
+
+ - The ioctl callback provides a direct pass through for some VFIO_DEVICE_*
+ ioctls.
+ - The [un]bind_iommufd callbacks are issued when the device is bound to
+ and unbound from iommufd.
+
+ - The [de]attach_ioas callback is issued when the device is attached to
+ and detached from an IOAS managed by the bound iommufd. However, the
+ attached IOAS can also be automatically detached when the device is
+ unbound from iommufd.
+
+ - The read/write/mmap callbacks implement the device region access defined
+ by the device's own VFIO_DEVICE_GET_REGION_INFO ioctl.
+
+ - The request callback is issued when device is going to be unregistered,
+ such as when trying to unbind the device from the vfio bus driver.
+
+ - The dma_unmap callback is issued when a range of iovas are unmapped
+ in the container or IOAS attached by the device. Drivers which make
+ use of the vfio page pinning interface must implement this callback in
+ order to unpin pages within the dma_unmap range. Drivers must tolerate
+ this callback even before calls to open_device().
PPC64 sPAPR implementation note
-------------------------------
@@ -518,3 +697,11 @@ This implementation has some specifics:
\-0d.1
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 90)
+
+.. [5] Nested translation is an IOMMU feature which supports two stage
+ address translations. This improves the address translation efficiency
+ in IOMMU virtualization.
+
+.. [6] PASID stands for Process Address Space ID, introduced by PCI
+ Express. It is a prerequisite for Shared Virtual Addressing (SVA)
+ and Scalable I/O Virtualization (Scalable IOV).