From d424348b060d87f92cc59d8e6ea9c612c5b708f5 Mon Sep 17 00:00:00 2001 From: Dragos Tatulea Date: Thu, 28 Sep 2023 19:45:12 +0300 Subject: vdpa/mlx5: Expose descriptor group mkey hw capability Necessary for improved live migration flow. Actual support will be added in a downstream patch. Reviewed-by: Gal Pressman Signed-off-by: Dragos Tatulea Link: https://lore.kernel.org/r/20230928164550.980832-3-dtatulea@nvidia.com Signed-off-by: Leon Romanovsky --- include/linux/mlx5/mlx5_ifc.h | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) (limited to 'include/linux') diff --git a/include/linux/mlx5/mlx5_ifc.h b/include/linux/mlx5/mlx5_ifc.h index fc3db401f8a2..3388007c645f 100644 --- a/include/linux/mlx5/mlx5_ifc.h +++ b/include/linux/mlx5/mlx5_ifc.h @@ -1231,7 +1231,13 @@ struct mlx5_ifc_virtio_emulation_cap_bits { u8 max_emulated_devices[0x8]; u8 max_num_virtio_queues[0x18]; - u8 reserved_at_a0[0x60]; + u8 reserved_at_a0[0x20]; + + u8 reserved_at_c0[0x13]; + u8 desc_group_mkey_supported[0x1]; + u8 reserved_at_d4[0xc]; + + u8 reserved_at_e0[0x20]; u8 umem_1_buffer_param_a[0x20]; -- cgit From a72cac6067fdfde280ece0ec5c055659a8de6508 Mon Sep 17 00:00:00 2001 From: Si-Wei Liu Date: Wed, 18 Oct 2023 20:14:41 +0300 Subject: vdpa: introduce dedicated descriptor group for virtqueue MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit In some cases, the access to the virtqueue's descriptor area, device and driver areas (precluding indirect descriptor table in guest memory) may have to be confined to a different address space than where its buffers reside. Without loss of simplicity and generality with already established terminology, let's fold up these 3 areas and call them as a whole as descriptor table group, or descriptor group for short. Specifically, in case of split virtqueues, descriptor group consists of regions for Descriptor Table, Available Ring and Used Ring; for packed virtqueues layout, descriptor group contains Descriptor Ring, Driver and Device Event Suppression structures. The group ID for a dedicated descriptor group can be obtained through a new .get_vq_desc_group() op. If driver implements this op, it means that the descriptor, device and driver areas of the virtqueue may reside in a dedicated group than where its buffers reside, a.k.a the default virtqueue group through the .get_vq_group() op. In principle, the descriptor group may or may not have same group ID as the default group. Even if the descriptor group has a different ID, meaning the vq's descriptor group areas can optionally move to a separate address space than where guest memory resides, the descriptor group may still start from a default address space, same as where its buffers reside. To move the descriptor group to a different address space, .set_group_asid() has to be called to change the ASID binding for the group, which is no different than what needs to be done on any other virtqueue group. On the other hand, the .reset() semantics also applies on descriptor table group, meaning the device reset will clear all ASID bindings and move all virtqueue groups including descriptor group back to the default address space, i.e. in ASID 0. QEMU's shadow virtqueue is going to utilize dedicated descriptor group to speed up map and unmap operations, yielding tremendous downtime reduction by avoiding the full and slow remap cycle in SVQ switching. Signed-off-by: Si-Wei Liu Acked-by: Eugenio Pérez Acked-by: Jason Wang Message-Id: <20231018171456.1624030-4-dtatulea@nvidia.com> Signed-off-by: Michael S. Tsirkin Reviewed-by: Si-Wei Liu Tested-by: Si-Wei Liu Tested-by: Lei Yang --- include/linux/vdpa.h | 11 +++++++++++ 1 file changed, 11 insertions(+) (limited to 'include/linux') diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h index 0e652026b776..d376309b99cf 100644 --- a/include/linux/vdpa.h +++ b/include/linux/vdpa.h @@ -204,6 +204,16 @@ struct vdpa_map_file { * @vdev: vdpa device * @idx: virtqueue index * Returns u32: group id for this virtqueue + * @get_vq_desc_group: Get the group id for the descriptor table of + * a specific virtqueue (optional) + * @vdev: vdpa device + * @idx: virtqueue index + * Returns u32: group id for the descriptor table + * portion of this virtqueue. Could be different + * than the one from @get_vq_group, in which case + * the access to the descriptor table can be + * confined to a separate asid, isolating from + * the virtqueue's buffer address access. * @get_device_features: Get virtio features supported by the device * @vdev: vdpa device * Returns the virtio features support by the @@ -360,6 +370,7 @@ struct vdpa_config_ops { /* Device ops */ u32 (*get_vq_align)(struct vdpa_device *vdev); u32 (*get_vq_group)(struct vdpa_device *vdev, u16 idx); + u32 (*get_vq_desc_group)(struct vdpa_device *vdev, u16 idx); u64 (*get_device_features)(struct vdpa_device *vdev); u64 (*get_backend_features)(const struct vdpa_device *vdev); int (*set_driver_features)(struct vdpa_device *vdev, u64 features); -- cgit From 03dd63c8fae4599439a30e0dde91061789260dbe Mon Sep 17 00:00:00 2001 From: Dragos Tatulea Date: Wed, 18 Oct 2023 20:14:53 +0300 Subject: vdpa/mlx5: Enable hw support for vq descriptor mapping MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Vq descriptor mappings are supported in hardware by filling in an additional mkey which contains the descriptor mappings to the hw vq. A previous patch in this series added support for hw mkey (mr) creation for ASID 1. This patch fills in both the vq data and vq descriptor mkeys based on group ASID mapping. The feature is signaled to the vdpa core through the presence of the .get_vq_desc_group op. Acked-by: Jason Wang Acked-by: Eugenio Pérez Signed-off-by: Dragos Tatulea Message-Id: <20231018171456.1624030-16-dtatulea@nvidia.com> Signed-off-by: Michael S. Tsirkin Reviewed-by: Si-Wei Liu Tested-by: Si-Wei Liu Tested-by: Lei Yang --- include/linux/mlx5/mlx5_ifc_vdpa.h | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) (limited to 'include/linux') diff --git a/include/linux/mlx5/mlx5_ifc_vdpa.h b/include/linux/mlx5/mlx5_ifc_vdpa.h index 9becdc3fa503..b86d51a855f6 100644 --- a/include/linux/mlx5/mlx5_ifc_vdpa.h +++ b/include/linux/mlx5/mlx5_ifc_vdpa.h @@ -74,7 +74,11 @@ struct mlx5_ifc_virtio_q_bits { u8 reserved_at_320[0x8]; u8 pd[0x18]; - u8 reserved_at_340[0xc0]; + u8 reserved_at_340[0x20]; + + u8 desc_group_mkey[0x20]; + + u8 reserved_at_380[0x80]; }; struct mlx5_ifc_virtio_net_q_object_bits { @@ -141,6 +145,7 @@ enum { MLX5_VIRTQ_MODIFY_MASK_STATE = (u64)1 << 0, MLX5_VIRTQ_MODIFY_MASK_DIRTY_BITMAP_PARAMS = (u64)1 << 3, MLX5_VIRTQ_MODIFY_MASK_DIRTY_BITMAP_DUMP_ENABLE = (u64)1 << 4, + MLX5_VIRTQ_MODIFY_MASK_DESC_GROUP_MKEY = (u64)1 << 14, }; enum { -- cgit From a5d7df87843dd9025015e4038dd927e893cc8b8b Mon Sep 17 00:00:00 2001 From: Shannon Nelson Date: Mon, 11 Sep 2023 14:31:04 -0700 Subject: virtio: kdoc for struct virtio_pci_modern_device MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Finally following up to Simon's suggestion for some kdoc attention on struct virtio_pci_modern_device. Link: https://lore.kernel.org/netdev/ZE%2FQS0lnUvxFacjf@corigine.com/ Cc: Simon Horman Signed-off-by: Shannon Nelson Acked-by: Eugenio Pérez Message-Id: <20230911213104.14391-1-shannon.nelson@amd.com> Signed-off-by: Michael S. Tsirkin Acked-by: Jason Wang --- include/linux/virtio_pci_modern.h | 34 ++++++++++++++++++++++------------ 1 file changed, 22 insertions(+), 12 deletions(-) (limited to 'include/linux') diff --git a/include/linux/virtio_pci_modern.h b/include/linux/virtio_pci_modern.h index 067ac1d789bc..a38c729d1973 100644 --- a/include/linux/virtio_pci_modern.h +++ b/include/linux/virtio_pci_modern.h @@ -12,37 +12,47 @@ struct virtio_pci_modern_common_cfg { __le16 queue_reset; /* read-write */ }; +/** + * struct virtio_pci_modern_device - info for modern PCI virtio + * @pci_dev: Ptr to the PCI device struct + * @common: Position of the common capability in the PCI config + * @device: Device-specific data (non-legacy mode) + * @notify_base: Base of vq notifications (non-legacy mode) + * @notify_pa: Physical base of vq notifications + * @isr: Where to read and clear interrupt + * @notify_len: So we can sanity-check accesses + * @device_len: So we can sanity-check accesses + * @notify_map_cap: Capability for when we need to map notifications per-vq + * @notify_offset_multiplier: Multiply queue_notify_off by this value + * (non-legacy mode). + * @modern_bars: Bitmask of BARs + * @id: Device and vendor id + * @device_id_check: Callback defined before vp_modern_probe() to be used to + * verify the PCI device is a vendor's expected device rather + * than the standard virtio PCI device + * Returns the found device id or ERRNO + * @dma_mask: Optional mask instead of the traditional DMA_BIT_MASK(64), + * for vendor devices with DMA space address limitations + */ struct virtio_pci_modern_device { struct pci_dev *pci_dev; struct virtio_pci_common_cfg __iomem *common; - /* Device-specific data (non-legacy mode) */ void __iomem *device; - /* Base of vq notifications (non-legacy mode). */ void __iomem *notify_base; - /* Physical base of vq notifications */ resource_size_t notify_pa; - /* Where to read and clear interrupt */ u8 __iomem *isr; - /* So we can sanity-check accesses. */ size_t notify_len; size_t device_len; - /* Capability for when we need to map notifications per-vq. */ int notify_map_cap; - /* Multiply queue_notify_off by this value. (non-legacy mode). */ u32 notify_offset_multiplier; - int modern_bars; - struct virtio_device_id id; - /* optional check for vendor virtio device, returns dev_id or -ERRNO */ int (*device_id_check)(struct pci_dev *pdev); - - /* optional mask for devices with limited DMA space */ u64 dma_mask; }; -- cgit From e0592acd1ef2497b861ef7ed6eda14b092b1e667 Mon Sep 17 00:00:00 2001 From: Xuan Zhuo Date: Thu, 19 Oct 2023 11:49:02 +0800 Subject: virtio_pci: add check for common cfg size Some buggy devices, the common cfg size may not match the features. This patch checks the common cfg size for the features(VIRTIO_F_NOTIF_CONFIG_DATA, VIRTIO_F_RING_RESET). When the common cfg size does not match the corresponding feature, we fail the probe and print error message. Signed-off-by: Xuan Zhuo Acked-by: Jason Wang Message-Id: <20231019034902.7346-1-xuanzhuo@linux.alibaba.com> Signed-off-by: Michael S. Tsirkin --- include/linux/virtio_pci_modern.h | 1 + 1 file changed, 1 insertion(+) (limited to 'include/linux') diff --git a/include/linux/virtio_pci_modern.h b/include/linux/virtio_pci_modern.h index a38c729d1973..d0f2797420f7 100644 --- a/include/linux/virtio_pci_modern.h +++ b/include/linux/virtio_pci_modern.h @@ -45,6 +45,7 @@ struct virtio_pci_modern_device { size_t notify_len; size_t device_len; + size_t common_len; int notify_map_cap; -- cgit From d2cf1b6e3b85dcb2bb3e8cd7924beede34fbbf0e Mon Sep 17 00:00:00 2001 From: Si-Wei Liu Date: Sat, 21 Oct 2023 02:25:13 -0700 Subject: vdpa: introduce .reset_map operation callback MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Some device specific IOMMU parent drivers have long standing bogus behavior that mistakenly clean up the maps during .reset. By definition, this is violation to the on-chip IOMMU ops (i.e. .set_map, or .dma_map & .dma_unmap) in those offending drivers, as the removal of internal maps is completely agnostic to the upper layer, causing inconsistent view between the userspace and the kernel. Some userspace app like QEMU gets around of this brokenness by proactively removing and adding back all the maps around vdpa device reset, but such workaround actually penalize other well-behaved driver setup, where vdpa reset always comes with the associated mapping cost, especially for kernel vDPA devices (use_va=false) that have high cost on pinning. It's imperative to rectify this behavior and remove the problematic code from all those non-compliant parent drivers. The reason why a separate .reset_map op is introduced is because this allows a simple on-chip IOMMU model without exposing too much device implementation detail to the upper vdpa layer. The .dma_map/unmap or .set_map driver API is meant to be used to manipulate the IOTLB mappings, and has been abstracted in a way similar to how a real IOMMU device maps or unmaps pages for certain memory ranges. However, apart from this there also exists other mapping needs, in which case 1:1 passthrough mapping has to be used by other users (read virtio-vdpa). To ease parent/vendor driver implementation and to avoid abusing DMA ops in an unexpacted way, these on-chip IOMMU devices can start with 1:1 passthrough mapping mode initially at the time of creation. Then the .reset_map op can be used to switch iotlb back to this initial state without having to expose a complex two-dimensional IOMMU device model. The .reset_map is not a MUST for every parent that implements the .dma_map or .set_map API, because device may work with DMA ops directly by implement their own to manipulate system memory mappings, so don't have to use .reset_map to achieve a simple IOMMU device model for 1:1 passthrough mapping. Signed-off-by: Si-Wei Liu Acked-by: Eugenio Pérez Acked-by: Jason Wang Message-Id: <1697880319-4937-2-git-send-email-si-wei.liu@oracle.com> Signed-off-by: Michael S. Tsirkin Tested-by: Lei Yang --- include/linux/vdpa.h | 10 ++++++++++ 1 file changed, 10 insertions(+) (limited to 'include/linux') diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h index d376309b99cf..26ae6ae1eac3 100644 --- a/include/linux/vdpa.h +++ b/include/linux/vdpa.h @@ -327,6 +327,15 @@ struct vdpa_map_file { * @iova: iova to be unmapped * @size: size of the area * Returns integer: success (0) or error (< 0) + * @reset_map: Reset device memory mapping to the default + * state (optional) + * Needed for devices that are using device + * specific DMA translation and prefer mapping + * to be decoupled from the virtio life cycle, + * i.e. device .reset op does not reset mapping + * @vdev: vdpa device + * @asid: address space identifier + * Returns integer: success (0) or error (< 0) * @get_vq_dma_dev: Get the dma device for a specific * virtqueue (optional) * @vdev: vdpa device @@ -405,6 +414,7 @@ struct vdpa_config_ops { u64 iova, u64 size, u64 pa, u32 perm, void *opaque); int (*dma_unmap)(struct vdpa_device *vdev, unsigned int asid, u64 iova, u64 size); + int (*reset_map)(struct vdpa_device *vdev, unsigned int asid); int (*set_group_asid)(struct vdpa_device *vdev, unsigned int group, unsigned int asid); struct device *(*get_vq_dma_dev)(struct vdpa_device *vdev, u16 idx); -- cgit From a26f2e4e68ee3130e5d5acb4f58807041aaea905 Mon Sep 17 00:00:00 2001 From: Si-Wei Liu Date: Sat, 21 Oct 2023 02:25:16 -0700 Subject: vdpa: introduce .compat_reset operation callback Some device specific IOMMU parent drivers have long standing bogus behaviour that mistakenly clean up the maps during .reset. By definition, this is violation to the on-chip IOMMU ops (i.e. .set_map, or .dma_map & .dma_unmap) in those offending drivers, as the removal of internal maps is completely agnostic to the upper layer, causing inconsistent view between the userspace and the kernel. Some userspace app like QEMU gets around of this brokenness by proactively removing and adding back all the maps around vdpa device reset, but such workaround actually penaltize other well-behaved driver setup, where vdpa reset always comes with the associated mapping cost, especially for kernel vDPA devices (use_va=false) that have high cost on pinning. It's imperative to rectify this behaviour and remove the problematic code from all those non-compliant parent drivers. However, we cannot unconditionally remove the bogus map-cleaning code from the buggy .reset implementation, as there might exist userspace apps that already rely on the behaviour on some setup. Introduce a .compat_reset driver op to keep compatibility with older userspace. New and well behaved parent driver should not bother to implement such op, but only those drivers that are doing or used to do non-compliant map-cleaning reset will have to. Signed-off-by: Si-Wei Liu Message-Id: <1697880319-4937-5-git-send-email-si-wei.liu@oracle.com> Signed-off-by: Michael S. Tsirkin Tested-by: Lei Yang --- include/linux/vdpa.h | 13 +++++++++++++ 1 file changed, 13 insertions(+) (limited to 'include/linux') diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h index 26ae6ae1eac3..6b8cbf75712d 100644 --- a/include/linux/vdpa.h +++ b/include/linux/vdpa.h @@ -252,6 +252,17 @@ struct vdpa_map_file { * @reset: Reset device * @vdev: vdpa device * Returns integer: success (0) or error (< 0) + * @compat_reset: Reset device with compatibility quirks to + * accommodate older userspace. Only needed by + * parent driver which used to have bogus reset + * behaviour, and has to maintain such behaviour + * for compatibility with older userspace. + * Historically compliant driver only has to + * implement .reset, Historically non-compliant + * driver should implement both. + * @vdev: vdpa device + * @flags: compatibility quirks for reset + * Returns integer: success (0) or error (< 0) * @suspend: Suspend the device (optional) * @vdev: vdpa device * Returns integer: success (0) or error (< 0) @@ -393,6 +404,8 @@ struct vdpa_config_ops { u8 (*get_status)(struct vdpa_device *vdev); void (*set_status)(struct vdpa_device *vdev, u8 status); int (*reset)(struct vdpa_device *vdev); + int (*compat_reset)(struct vdpa_device *vdev, u32 flags); +#define VDPA_RESET_F_CLEAN_MAP 1 int (*suspend)(struct vdpa_device *vdev); int (*resume)(struct vdpa_device *vdev); size_t (*get_config_size)(struct vdpa_device *vdev); -- cgit From bc91df5c70ac720eca18bd1f4a288f2582713d3e Mon Sep 17 00:00:00 2001 From: Si-Wei Liu Date: Sat, 21 Oct 2023 02:25:17 -0700 Subject: vhost-vdpa: clean iotlb map during reset for older userspace Using .compat_reset op from the previous patch, the buggy .reset behaviour can be kept as-is on older userspace apps, which don't ack the IOTLB_PERSIST backend feature. As this compatibility quirk is limited to those drivers that used to be buggy in the past, it won't affect change the behaviour or affect ABI on the setups with API compliant driver. The separation of .compat_reset from the regular .reset allows vhost-vdpa able to know which driver had broken behaviour before, so it can apply the corresponding compatibility quirk to the individual driver whenever needed. Compared to overloading the existing .reset with flags, .compat_reset won't cause any extra burden to the implementation of every compliant driver. [mst: squashed in two fixup commits] Message-Id: <1697880319-4937-6-git-send-email-si-wei.liu@oracle.com> Message-Id: <1698102863-21122-1-git-send-email-si-wei.liu@oracle.com> Reported-by: Dragos Tatulea Tested-by: Dragos Tatulea Message-Id: <1698275594-19204-1-git-send-email-si-wei.liu@oracle.com> Reported-by: Lei Yang Signed-off-by: Si-Wei Liu Signed-off-by: Michael S. Tsirkin Tested-by: Lei Yang --- include/linux/vdpa.h | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) (limited to 'include/linux') diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h index 6b8cbf75712d..db15ac07f8a6 100644 --- a/include/linux/vdpa.h +++ b/include/linux/vdpa.h @@ -519,14 +519,17 @@ static inline struct device *vdpa_get_dma_dev(struct vdpa_device *vdev) return vdev->dma_dev; } -static inline int vdpa_reset(struct vdpa_device *vdev) +static inline int vdpa_reset(struct vdpa_device *vdev, u32 flags) { const struct vdpa_config_ops *ops = vdev->config; int ret; down_write(&vdev->cf_lock); vdev->features_valid = false; - ret = ops->reset(vdev); + if (ops->compat_reset && flags) + ret = ops->compat_reset(vdev, flags); + else + ret = ops->reset(vdev); up_write(&vdev->cf_lock); return ret; } -- cgit