summaryrefslogtreecommitdiff
path: root/drivers/iommu/iommufd/main.c
AgeCommit message (Collapse)Author
2025-07-18iommufd: Rename some shortterm-related identifiersXu Yilun
Rename the shortterm-related identifiers to wait-related. The usage of shortterm_users refcount is now beyond its name. It is also used for references which live longer than an ioctl execution. E.g. vdev holds idev's shortterm_users refcount on vdev allocation, releases it during idev's pre_destroy(). Rename the refcount as wait_cnt, since it is always used to sync the referencing & the destruction of the object by waiting for it to go to zero. List all changed identifiers: iommufd_object::shortterm_users -> iommufd_object::wait_cnt REMOVE_WAIT_SHORTTERM -> REMOVE_WAIT iommufd_object_dec_wait_shortterm() -> iommufd_object_dec_wait() zerod_shortterm -> zerod_wait_cnt No functional change intended. Link: https://patch.msgid.link/r/20250716070349.1807226-9-yilun.xu@linux.intel.com Suggested-by: Kevin Tian <kevin.tian@intel.com> Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Xu Yilun <yilun.xu@linux.intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-07-18iommufd: Destroy vdevice on idevice destroyXu Yilun
Destroy iommufd_vdevice (vdev) on iommufd_idevice (idev) destruction so that vdev can't outlive idev. idev represents the physical device bound to iommufd, while the vdev represents the virtual instance of the physical device in the VM. The lifecycle of the vdev should not be longer than idev. This doesn't cause real problem on existing use cases cause vdev doesn't impact the physical device, only provides virtualization information. But to extend vdev for Confidential Computing (CC), there are needs to do secure configuration for the vdev, e.g. TSM Bind/Unbind. These configurations should be rolled back on idev destroy, or the external driver (VFIO) functionality may be impact. The idev is created by external driver so its destruction can't fail. The idev implements pre_destroy() op to actively remove its associated vdev before destroying itself. There are 3 cases on idev pre_destroy(): 1. vdev is already destroyed by userspace. No extra handling needed. 2. vdev is still alive. Use iommufd_object_tombstone_user() to destroy vdev and tombstone the vdev ID. 3. vdev is being destroyed by userspace. The vdev ID is already freed, but vdev destroy handler is not completed. This requires multi-threads syncing - vdev holds idev's short term users reference until vdev destruction completes, idev leverages existing wait_shortterm mechanism for syncing. idev should also block any new reference to it after pre_destroy(), or the following wait shortterm would timeout. Introduce a 'destroying' flag, set it to true on idev pre_destroy(). Any attempt to reference idev should honor this flag under the protection of idev->igroup->lock. Link: https://patch.msgid.link/r/20250716070349.1807226-5-yilun.xu@linux.intel.com Originally-by: Nicolin Chen <nicolinc@nvidia.com> Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Nicolin Chen <nicolinc@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Co-developed-by: "Aneesh Kumar K.V (Arm)" <aneesh.kumar@kernel.org> Signed-off-by: "Aneesh Kumar K.V (Arm)" <aneesh.kumar@kernel.org> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Xu Yilun <yilun.xu@linux.intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-07-18iommufd: Add a pre_destroy() op for objectsXu Yilun
Add a pre_destroy() op which gives objects a chance to clear their short term users references before destruction. This op is intended for external driver created objects (e.g. idev) which does deterministic destruction. In order to manage the lifecycle of interrelated objects as well as the deterministic destruction (e.g. vdev can't outlive idev, and idev destruction can't fail), short term users references are allowed to live out of an ioctl execution. An immediate use case is, vdev holds idev's short term user reference until vdev destruction completes, idev leverages existing wait_shortterm mechanism to ensure it is destroyed after vdev. This extended usage makes the referenced object unable to just wait for its reference gone. It needs to actively trigger the reference removal, as well as prevent new references before wait. Should implement these work in pre_destroy(). Link: https://patch.msgid.link/r/20250716070349.1807226-4-yilun.xu@linux.intel.com Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Nicolin Chen <nicolinc@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Xu Yilun <yilun.xu@linux.intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-07-18iommufd: Add iommufd_object_tombstone_user() helperXu Yilun
Add the iommufd_object_tombstone_user() helper, which allows the caller to destroy an iommufd object created by userspace. This is useful on some destroy paths when the kernel caller finds the object should have been removed by userspace but is still alive. With this helper, the caller destroys the object but leave the object ID reserved (so called tombstone). The tombstone prevents repurposing the object ID without awareness of the original user. Since this happens for abnormal userspace behavior, for simplicity, the tombstoned object ID would be permanently leaked until iommufd_fops_release(). I.e. the original user gets an error when calling ioctl(IOMMU_DESTROY) on that ID. The first use case would be to ensure the iommufd_vdevice can't outlive the associated iommufd_device. Link: https://patch.msgid.link/r/20250716070349.1807226-3-yilun.xu@linux.intel.com Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Nicolin Chen <nicolinc@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Co-developed-by: "Aneesh Kumar K.V (Arm)" <aneesh.kumar@kernel.org> Signed-off-by: "Aneesh Kumar K.V (Arm)" <aneesh.kumar@kernel.org> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Xu Yilun <yilun.xu@linux.intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-07-14iommufd: Do not allow _iommufd_object_alloc_ucmd if abort op is setNicolin Chen
An abort op was introduced to allow its caller to invoke it within a lock in the caller's function. On the other hand, _iommufd_object_alloc_ucmd() would invoke the abort op in iommufd_object_abort_and_destroy() that must be outside the caller's lock. So, these two cannot work together. Add a validation in the _iommufd_object_alloc_ucmd(). Pick -EOPNOTSUPP to reject the function call, indicating that the object allocator is buggy. Link: https://patch.msgid.link/r/20250710202354.1658511-1-nicolinc@nvidia.com Suggested-by: Xu Yilun <yilun.xu@linux.intel.com> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Xu Yilun <yilun.xu@linux.intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-07-11iommufd: Add mmap interfaceNicolin Chen
For vIOMMU passing through HW resources to user space (VMs), allowing a VM to control the passed through HW directly by accessing hardware registers, add an mmap infrastructure to map the physical MMIO pages to user space. Maintain a maple tree per ictx as a translation table managing mmappable regions, from an allocated for-user mmap offset to an iommufd_mmap struct, where it stores the real physical address range for io_remap_pfn_range(). Keep track of the lifecycle of the mmappable region by taking refcount of its owner, so as to enforce user space to unmap the region first before it can destroy its owner object. To allow an IOMMU driver to add and delete mmappable regions onto/from the maple tree, add iommufd_viommu_alloc/destroy_mmap helpers. Link: https://patch.msgid.link/r/9a888a326b12aa5fe940083eae1156304e210fe0.1752126748.git.nicolinc@nvidia.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Pranjal Shrivastava <praan@google.com> Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-07-11iommufd/viommu: Add IOMMUFD_CMD_HW_QUEUE_ALLOC ioctlNicolin Chen
Introduce a new IOMMUFD_CMD_HW_QUEUE_ALLOC ioctl for user space to allocate a HW QUEUE object for a vIOMMU specific HW-accelerated queue, e.g.: - NVIDIA's Virtual Command Queue - AMD vIOMMU's Command Buffer, Event Log Buffers, and PPR Log Buffers Since this is introduced with NVIDIA's VCMDQs that access the guest memory in the physical address space, add an iommufd_hw_queue_alloc_phys() helper that will create an access object to the queue memory in the IOAS, to avoid the mappings of the guest memory from being unmapped, during the life cycle of the HW queue object. AMD's HW will need an hw_queue_init op that is mutually exclusive with the hw_queue_init_phys op, and their case will bypass the access part, i.e. no iommufd_hw_queue_alloc_phys() call. Link: https://patch.msgid.link/r/dab4ace747deb46c1fe70a5c663307f46990ae56.1752126748.git.nicolinc@nvidia.com Reviewed-by: Pranjal Shrivastava <praan@google.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-06-19iommufd: Introduce iommufd_object_alloc_ucmd helperNicolin Chen
An object allocator needs to call either iommufd_object_finalize() upon a success or iommufd_object_abort_and_destroy() upon an error code. To reduce duplication, store a new_obj in the struct iommufd_ucmd and call iommufd_object_finalize/iommufd_object_abort_and_destroy() accordingly in the main function. Similar to iommufd_object_alloc() and __iommufd_object_alloc(), add a pair of helpers: __iommufd_object_alloc_ucmd() and iommufd_object_alloc_ucmd(). Link: https://patch.msgid.link/r/e7206d4227844887cc8dbf0cc7b0242580fafd9d.1749882255.git.nicolinc@nvidia.com Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Acked-by: Pranjal Shrivastava <praan@google.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-06-19iommufd: Move _iommufd_object_alloc out of driver.cNicolin Chen
Now, all driver structures will be allocated by the core, i.e. no longer a need of driver calling _iommufd_object_alloc. Thus, move it back. Before: text data bss dec hex filename 3024 180 0 3204 c84 drivers/iommu/iommufd/driver.o 9074 610 64 9748 2614 drivers/iommu/iommufd/main.o After: text data bss dec hex filename 2665 164 0 2829 b0d drivers/iommu/iommufd/driver.o 9410 618 64 10092 276c drivers/iommu/iommufd/main.o Link: https://patch.msgid.link/r/79e630c7b911930cf36e3c8a775a04e66c528d65.1749882255.git.nicolinc@nvidia.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-06-19iommufd: Apply obvious cosmetic fixesNicolin Chen
Run clang-format but exclude those not so obvious ones, which leaves us: - Align indentations - Add missing spaces - Remove unnecessary spaces - Remove unnecessary line wrappings Link: https://patch.msgid.link/r/9132e1ab45690ab1959c66bbb51ac5536a635388.1749882255.git.nicolinc@nvidia.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-03-18iommufd: Add IOMMUFD_OBJ_VEVENTQ and IOMMUFD_CMD_VEVENTQ_ALLOCNicolin Chen
Introduce a new IOMMUFD_OBJ_VEVENTQ object for vIOMMU Event Queue that provides user space (VMM) another FD to read the vIOMMU Events. Allow a vIOMMU object to allocate vEVENTQs, with a condition that each vIOMMU can only have one single vEVENTQ per type. Add iommufd_veventq_alloc() with iommufd_veventq_ops for the new ioctl. Link: https://patch.msgid.link/r/21acf0751dd5c93846935ee06f93b9c65eff5e04.1741719725.git.nicolinc@nvidia.com Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-02-27iommufd: Implement sw_msi support nativelyJason Gunthorpe
iommufd has a model where the iommu_domain can be changed while the VFIO device is attached. In this case, the MSI should continue to work. This corner case has not worked because the dma-iommu implementation of sw_msi is tied to a single domain. Implement the sw_msi mapping directly and use a global per-fd table to associate assigned IOVA to the MSI pages. This allows the MSI pages to be loaded into a domain before it is attached ensuring that MSI is not disrupted. Link: https://patch.msgid.link/r/e13d23eeacd67c0a692fc468c85b483f4dd51c57.1740014950.git.nicolinc@nvidia.com Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-01-14iommufd: Keep OBJ/IOCTL lists in an alphabetical orderNicolin Chen
Reorder the existing OBJ/IOCTL lists. Also run clang-format for the same coding style at line wrappings. No functional change. Link: https://patch.msgid.link/r/c5e6d6e0a0bb7abc92ad26937fde19c9426bee96.1736237481.git.nicolinc@nvidia.com Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-01-14iommu: iommufd: fix WARNING in iommufd_device_unbindSuraj Sonawane
Fix an issue detected by syzbot: WARNING in iommufd_device_unbind iommufd: Time out waiting for iommufd object to become free Resolve a warning in iommufd_device_unbind caused by a timeout while waiting for the shortterm_users reference count to reach zero. The existing 10-second timeout is insufficient in some scenarios, resulting in failures the above warning. Increase the timeout in iommufd_object_dec_wait_shortterm from 10 seconds to 60 seconds to allow sufficient time for the reference count to drop to zero. This change prevents premature timeouts and reduces the likelihood of warnings during iommufd_device_unbind. Fixes: 6f9c4d8c468c ("iommufd: Do not UAF during iommufd_put_object()") Link: https://patch.msgid.link/r/20241123195900.3176-1-surajsonawane0215@gmail.com Reported-by: syzbot+c92878e123785b1fa2db@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=c92878e123785b1fa2db Tested-by: syzbot+c92878e123785b1fa2db@syzkaller.appspotmail.com Signed-off-by: Suraj Sonawane <surajsonawane0215@gmail.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2024-12-02module: Convert symbol namespace to string literalPeter Zijlstra
Clean up the existing export namespace code along the same lines of commit 33def8498fdd ("treewide: Convert macro and uses of __section(foo) to __section("foo")") and for the same reason, it is not desired for the namespace argument to be a macro expansion itself. Scripted using git grep -l -e MODULE_IMPORT_NS -e EXPORT_SYMBOL_NS | while read file; do awk -i inplace ' /^#define EXPORT_SYMBOL_NS/ { gsub(/__stringify\(ns\)/, "ns"); print; next; } /^#define MODULE_IMPORT_NS/ { gsub(/__stringify\(ns\)/, "ns"); print; next; } /MODULE_IMPORT_NS/ { $0 = gensub(/MODULE_IMPORT_NS\(([^)]*)\)/, "MODULE_IMPORT_NS(\"\\1\")", "g"); } /EXPORT_SYMBOL_NS/ { if ($0 ~ /(EXPORT_SYMBOL_NS[^(]*)\(([^,]+),/) { if ($0 !~ /(EXPORT_SYMBOL_NS[^(]*)\(([^,]+), ([^)]+)\)/ && $0 !~ /(EXPORT_SYMBOL_NS[^(]*)\(\)/ && $0 !~ /^my/) { getline line; gsub(/[[:space:]]*\\$/, ""); gsub(/[[:space:]]/, "", line); $0 = $0 " " line; } $0 = gensub(/(EXPORT_SYMBOL_NS[^(]*)\(([^,]+), ([^)]+)\)/, "\\1(\\2, \"\\3\")", "g"); } } { print }' $file; done Requested-by: Masahiro Yamada <masahiroy@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://mail.google.com/mail/u/2/#inbox/FMfcgzQXKWgMmjdFwwdsfgxzKpVHWPlc Acked-by: Greg KH <gregkh@linuxfoundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-11-14iommufd: Add IOMMU_IOAS_CHANGE_PROCESSSteve Sistare
Add an ioctl that updates all DMA mappings to reflect the current process, Change the mm and transfer locked memory accounting from old to current mm. This will be used for live update, allowing an old process to hand the iommufd device descriptor to a new process. The new process calls the ioctl. IOMMU_IOAS_CHANGE_PROCESS only supports DMA mappings created with IOMMU_IOAS_MAP_FILE, because the kernel metadata for such mappings does not depend on the userland VA of the pages (which is different in the new process). IOMMU_IOAS_CHANGE_PROCESS fails if other types of mappings are present. This is a revised version of code originally provided by Jason. Link: https://patch.msgid.link/r/1731527497-16091-4-git-send-email-steven.sistare@oracle.com Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Steve Sistare <steven.sistare@oracle.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2024-11-14iommufd: Lock all IOAS objectsSteve Sistare
Define helpers to lock and unlock all IOAS objects. This will allow DMA mappings to be updated atomically during live update. This code is substantially the same as an initial version provided by Jason, plus fixes. Link: https://patch.msgid.link/r/1731527497-16091-3-git-send-email-steven.sistare@oracle.com Signed-off-by: Steve Sistare <steven.sistare@oracle.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2024-11-12iommufd/viommu: Add IOMMUFD_OBJ_VDEVICE and IOMMU_VDEVICE_ALLOC ioctlNicolin Chen
Introduce a new IOMMUFD_OBJ_VDEVICE to represent a physical device (struct device) against a vIOMMU (struct iommufd_viommu) object in a VM. This vDEVICE object (and its structure) holds all the infos and attributes in the VM, regarding the device related to the vIOMMU. As an initial patch, add a per-vIOMMU virtual ID. This can be: - Virtual StreamID on a nested ARM SMMUv3, an index to a Stream Table - Virtual DeviceID on a nested AMD IOMMU, an index to a Device Table - Virtual RID on a nested Intel VT-D IOMMU, an index to a Context Table Potentially, this vDEVICE structure would hold some vData for Confidential Compute Architecture (CCA). Use this virtual ID to index an "vdevs" xarray that belongs to a vIOMMU object. Add a new ioctl for vDEVICE allocations. Since a vDEVICE is a connection of a device object and an iommufd_viommu object, take two refcounts in the ioctl handler. Link: https://patch.msgid.link/r/cda8fd2263166e61b8191a3b3207e0d2b08545bf.1730836308.git.nicolinc@nvidia.com Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2024-11-12iommufd/viommu: Add IOMMU_VIOMMU_ALLOC ioctlNicolin Chen
Add a new ioctl for user space to do a vIOMMU allocation. It must be based on a nesting parent HWPT, so take its refcount. IOMMU driver wanting to support vIOMMUs must define its IOMMU_VIOMMU_TYPE_ in the uAPI header and implement a viommu_alloc op in its iommu_ops. Link: https://patch.msgid.link/r/dc2b8ba9ac935007beff07c1761c31cd097ed780.1730836219.git.nicolinc@nvidia.com Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2024-11-12iommufd: Verify object in iommufd_object_finalize/abort()Nicolin Chen
To support driver-allocated vIOMMU objects, it's required for IOMMU driver to call the provided iommufd_viommu_alloc helper to embed the core struct. However, there is no guarantee that every driver will call it and allocate objects properly. Make the iommufd_object_finalize/abort functions more robust to verify if the xarray slot indexed by the input obj->id is having an XA_ZERO_ENTRY, which is the reserved value stored by xa_alloc via iommufd_object_alloc. Link: https://patch.msgid.link/r/334bd4dde8e0a88eb30fa67eeef61827cdb546f9.1730836219.git.nicolinc@nvidia.com Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2024-11-12iommufd: Move _iommufd_object_alloc helper to a sharable fileNicolin Chen
The following patch will add a new vIOMMU allocator that will require this _iommufd_object_alloc to be sharable with IOMMU drivers (and iommufd too). Add a new driver.c file that will be built with CONFIG_IOMMUFD_DRIVER_CORE selected by CONFIG_IOMMUFD, and put the CONFIG_DRIVER under that remaining to be selectable for drivers to build the existing iova_bitmap.c file. Link: https://patch.msgid.link/r/2f4f6e116dc49ffb67ff6c5e8a7a8e789ab9e98e.1730836219.git.nicolinc@nvidia.com Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2024-10-28iommufd: Add IOMMU_IOAS_MAP_FILESteve Sistare
Define the IOMMU_IOAS_MAP_FILE ioctl interface, which allows a user to register memory by passing a memfd plus offset and length. Implement it using the memfd_pin_folios() kAPI. Link: https://patch.msgid.link/r/1729861919-234514-8-git-send-email-steven.sistare@oracle.com Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Steve Sistare <steven.sistare@oracle.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2024-08-26iommufd: Reorder include filesNicolin Chen
Reorder include files to alphabetic order to simplify maintenance, and separate local headers and global headers with a blank line. No functional change intended. Link: https://patch.msgid.link/r/7524b037cc05afe19db3c18f863253e1d1554fa2.1722644866.git.nicolinc@nvidia.com Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2024-07-09iommufd: Add iommufd fault objectLu Baolu
An iommufd fault object provides an interface for delivering I/O page faults to user space. These objects are created and destroyed by user space, and they can be associated with or dissociated from hardware page table objects during page table allocation or destruction. User space interacts with the fault object through a file interface. This interface offers a straightforward and efficient way for user space to handle page faults. It allows user space to read fault messages sequentially and respond to them by writing to the same file. The file interface supports reading messages in poll mode, so it's recommended that user space applications use io_uring to enhance read and write efficiency. A fault object can be associated with any iopf-capable iommufd_hw_pgtable during the pgtable's allocation. All I/O page faults triggered by devices when accessing the I/O addresses of an iommufd_hw_pgtable are routed through the fault object to user space. Similarly, user space's responses to these page faults are routed back to the iommu device driver through the same fault object. Link: https://lore.kernel.org/r/20240702063444.105814-7-baolu.lu@linux.intel.com Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2024-01-11iommufd: Add IOMMU_HWPT_INVALIDATEYi Liu
In nested translation, the stage-1 page table is user-managed but cached by the IOMMU hardware, so an update on present page table entries in the stage-1 page table should be followed with a cache invalidation. Add an IOMMU_HWPT_INVALIDATE ioctl to support such a cache invalidation. It takes hwpt_id to specify the iommu_domain, and a multi-entry array to support multiple invalidation data in one ioctl. enum iommu_hwpt_invalidate_data_type is defined to tag the data type of the entries in the multi-entry array. Link: https://lore.kernel.org/r/20240111041015.47920-3-yi.l.liu@intel.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Co-developed-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-11-29iommufd: Do not UAF during iommufd_put_object()Jason Gunthorpe
The mixture of kernel and user space lifecycle objects continues to be complicated inside iommufd. The obj->destroy_rwsem is used to bring order to the kernel driver destruction sequence but it cannot be sequenced right with the other refcounts so we end up possibly UAF'ing: BUG: KASAN: slab-use-after-free in __up_read+0x627/0x750 kernel/locking/rwsem.c:1342 Read of size 8 at addr ffff888073cde868 by task syz-executor934/6535 CPU: 1 PID: 6535 Comm: syz-executor934 Not tainted 6.6.0-rc7-syzkaller-00195-g2af9b20dbb39 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/09/2023 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0xd9/0x1b0 lib/dump_stack.c:106 print_address_description mm/kasan/report.c:364 [inline] print_report+0xc4/0x620 mm/kasan/report.c:475 kasan_report+0xda/0x110 mm/kasan/report.c:588 __up_read+0x627/0x750 kernel/locking/rwsem.c:1342 iommufd_put_object drivers/iommu/iommufd/iommufd_private.h:149 [inline] iommufd_vfio_ioas+0x46c/0x580 drivers/iommu/iommufd/vfio_compat.c:146 iommufd_fops_ioctl+0x347/0x4d0 drivers/iommu/iommufd/main.c:398 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:871 [inline] __se_sys_ioctl fs/ioctl.c:857 [inline] __x64_sys_ioctl+0x18f/0x210 fs/ioctl.c:857 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x38/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd There are two races here, the more obvious one: CPU 0 CPU 1 iommufd_put_object() iommufd_destroy() refcount_dec(&obj->users) iommufd_object_remove() kfree() up_read(&obj->destroy_rwsem) // Boom And there is also perhaps some possibility that the rwsem could hit an issue: CPU 0 CPU 1 iommufd_put_object() iommufd_object_destroy_user() refcount_dec(&obj->users); down_write(&obj->destroy_rwsem) up_read(&obj->destroy_rwsem); atomic_long_or(RWSEM_FLAG_WAITERS, &sem->count); tmp = atomic_long_add_return_release() rwsem_try_write_lock() iommufd_object_remove() up_write(&obj->destroy_rwsem) kfree() clear_nonspinnable() // Boom Fix this by reorganizing this again so that two refcounts are used to keep track of things with a rule that users == 0 && shortterm_users == 0 means no other threads have that memory. Put a wait_queue in the iommufd_ctx object that is triggered when any sub object reaches a 0 shortterm_users. This allows the same wait for userspace ioctls to finish behavior that the rwsem was providing. This is weaker still than the prior versions: - There is no bias on shortterm_users so if some thread is waiting to destroy other threads can continue to get new read sides - If destruction fails, eg because of an active in-kernel user, then shortterm_users will have cycled to zero momentarily blocking new users - If userspace races destroy with other userspace operations they continue to get an EBUSY since we still can't intermix looking up an ID and sleeping for its unref In all cases these are things that userspace brings on itself, correct programs will not hit them. Fixes: 99f98a7c0d69 ("iommufd: IOMMUFD_DESTROY should not increase the refcount") Link: https://lore.kernel.org/all/2-v2-ca9e00171c5b+123-iommufd_syz4_jgg@nvidia.com/ Reported-by: syzbot+d31adfb277377ef8fcba@syzkaller.appspotmail.com Closes: https://lore.kernel.org/r/00000000000055ef9a0609336580@google.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-10-26iommufd: Add a nested HW pagetable objectNicolin Chen
IOMMU_HWPT_ALLOC already supports iommu_domain allocation for usersapce. But it can only allocate a hw_pagetable that associates to a given IOAS, i.e. only a kernel-managed hw_pagetable of IOMMUFD_OBJ_HWPT_PAGING type. IOMMU drivers can now support user-managed hw_pagetables, for two-stage translation use cases that require user data input from the user space. Add a new IOMMUFD_OBJ_HWPT_NESTED type with its abort/destroy(). Pair it with a new iommufd_hwpt_nested structure and its to_hwpt_nested() helper. Update the to_hwpt_paging() helper, so a NESTED-type hw_pagetable can be handled in the callers, for example iommufd_hw_pagetable_enforce_rr(). Screen the inputs including the parent PAGING-type hw_pagetable that has a need of a new nest_parent flag in the iommufd_hwpt_paging structure. Extend the IOMMU_HWPT_ALLOC ioctl to accept an IOMMU driver specific data input which is tagged by the enum iommu_hwpt_data_type. Also, update the @pt_id to accept hwpt_id too besides an ioas_id. Then, use them to allocate a hw_pagetable of IOMMUFD_OBJ_HWPT_NESTED type using the iommufd_hw_pagetable_alloc_nested() allocator. Link: https://lore.kernel.org/r/20231026043938.63898-8-yi.l.liu@intel.com Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> Co-developed-by: Yi Liu <yi.l.liu@intel.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-10-26iommufd: Derive iommufd_hwpt_paging from iommufd_hw_pagetableNicolin Chen
To prepare for IOMMUFD_OBJ_HWPT_NESTED, derive struct iommufd_hwpt_paging from struct iommufd_hw_pagetable, by leaving the common members in struct iommufd_hw_pagetable. Add a __iommufd_object_alloc and to_hwpt_paging() helpers for the new structure. Then, update "hwpt" to "hwpt_paging" throughout the files, accordingly. Link: https://lore.kernel.org/r/20231026043938.63898-5-yi.l.liu@intel.com Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-10-26iommufd: Rename IOMMUFD_OBJ_HW_PAGETABLE to IOMMUFD_OBJ_HWPT_PAGINGJason Gunthorpe
To add a new IOMMUFD_OBJ_HWPT_NESTED, rename the HWPT object to confine it to PAGING hwpts/domains. The following patch will separate the hwpt structure as well. Link: https://lore.kernel.org/r/20231026043938.63898-3-yi.l.liu@intel.com Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-10-24iommufd: Add IOMMU_HWPT_GET_DIRTY_BITMAPJoao Martins
Connect a hw_pagetable to the IOMMU core dirty tracking read_and_clear_dirty iommu domain op. It exposes all of the functionality for the UAPI that read the dirtied IOVAs while clearing the Dirty bits from the PTEs. In doing so, add an IO pagetable API iopt_read_and_clear_dirty_data() that performs the reading of dirty IOPTEs for a given IOVA range and then copying back to userspace bitmap. Underneath it uses the IOMMU domain kernel API which will read the dirty bits, as well as atomically clearing the IOPTE dirty bit and flushing the IOTLB at the end. The IOVA bitmaps usage takes care of the iteration of the bitmaps user pages efficiently and without copies. Within the iterator function we iterate over io-pagetable contigous areas that have been mapped. Contrary to past incantation of a similar interface in VFIO the IOVA range to be scanned is tied in to the bitmap size, thus the application needs to pass a appropriately sized bitmap address taking into account the iova range being passed *and* page size ... as opposed to allowing bitmap-iova != iova. Link: https://lore.kernel.org/r/20231024135109.73787-8-joao.m.martins@oracle.com Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-10-24iommufd: Add IOMMU_HWPT_SET_DIRTY_TRACKINGJoao Martins
Every IOMMU driver should be able to implement the needed iommu domain ops to control dirty tracking. Connect a hw_pagetable to the IOMMU core dirty tracking ops, specifically the ability to enable/disable dirty tracking on an IOMMU domain (hw_pagetable id). To that end add an io_pagetable kernel API to toggle dirty tracking: * iopt_set_dirty_tracking(iopt, [domain], state) The intended caller of this is via the hw_pagetable object that is created. Internally it will ensure the leftover dirty state is cleared /right before/ dirty tracking starts. This is also useful for iommu drivers which may decide that dirty tracking is always-enabled at boot without wanting to toggle dynamically via corresponding iommu domain op. Link: https://lore.kernel.org/r/20231024135109.73787-7-joao.m.martins@oracle.com Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-08-18iommufd: Add IOMMU_GET_HW_INFOYi Liu
Under nested IOMMU translation, userspace owns the stage-1 translation table (e.g. the stage-1 page table of Intel VT-d or the context table of ARM SMMUv3, and etc.). Stage-1 translation tables are vendor specific, and need to be compatible with the underlying IOMMU hardware. Hence, userspace should know the IOMMU hardware capability before creating and configuring the stage-1 translation table to kernel. This adds IOMMU_GET_HW_INFO ioctl to query the IOMMU hardware information (a.k.a capability) for a given device. The returned data is vendor specific, userspace needs to decode it with the structure by the output @out_data_type field. As only physical devices have IOMMU hardware, so this will return error if the given device is not a physical device. Link: https://lore.kernel.org/r/20230818101033.4100-4-yi.l.liu@intel.com Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Co-developed-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-08-15Merge tag 'v6.5-rc6' into iommufd for-nextJason Gunthorpe
Required for following patches. Resolve merge conflict by using the hunk from the for-next branch and shifting the iommufd_object_deref_user() into iommufd_hw_pagetable_put() Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-08-14iommufd/selftest: Make the mock iommu driver into a real driverJason Gunthorpe
I've avoided doing this because there is no way to make this happen without an intrusion into the core code. Up till now this has avoided needing the core code's probe path with some hackery - but now that default domains are becoming mandatory it is unavoidable. This became a serious problem when the core code stopped allowing partially registered iommu drivers in commit 14891af3799e ("iommu: Move the iommu driver sysfs setup into iommu_init/deinit_device()") which breaks the selftest. That series was developed along with a second series that contained this patch so it was not noticed. Make it so that iommufd selftest can create a real iommu driver and bind it only to is own private bus. Add iommu_device_register_bus() as a core code helper to make this possible. It simply sets the right pointers and registers the notifier block. The mock driver then works like any normal driver should, with probe triggered by the bus ops When the bus->iommu_ops stuff is fully unwound we can probably do better here and remove this special case. Link: https://lore.kernel.org/r/15-v6-e8114faedade+425-iommu_all_defdom_jgg@nvidia.com Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-07-27iommufd: IOMMUFD_DESTROY should not increase the refcountJason Gunthorpe
syzkaller found a race where IOMMUFD_DESTROY increments the refcount: obj = iommufd_get_object(ucmd->ictx, cmd->id, IOMMUFD_OBJ_ANY); if (IS_ERR(obj)) return PTR_ERR(obj); iommufd_ref_to_users(obj); /* See iommufd_ref_to_users() */ if (!iommufd_object_destroy_user(ucmd->ictx, obj)) As part of the sequence to join the two existing primitives together. Allowing the refcount the be elevated without holding the destroy_rwsem violates the assumption that all temporary refcount elevations are protected by destroy_rwsem. Racing IOMMUFD_DESTROY with iommufd_object_destroy_user() will cause spurious failures: WARNING: CPU: 0 PID: 3076 at drivers/iommu/iommufd/device.c:477 iommufd_access_destroy+0x18/0x20 drivers/iommu/iommufd/device.c:478 Modules linked in: CPU: 0 PID: 3076 Comm: syz-executor.0 Not tainted 6.3.0-rc1-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/03/2023 RIP: 0010:iommufd_access_destroy+0x18/0x20 drivers/iommu/iommufd/device.c:477 Code: e8 3d 4e 00 00 84 c0 74 01 c3 0f 0b c3 0f 1f 44 00 00 f3 0f 1e fa 48 89 fe 48 8b bf a8 00 00 00 e8 1d 4e 00 00 84 c0 74 01 c3 <0f> 0b c3 0f 1f 44 00 00 41 57 41 56 41 55 4c 8d ae d0 00 00 00 41 RSP: 0018:ffffc90003067e08 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff888109ea0300 RCX: 0000000000000000 RDX: 0000000000000001 RSI: 0000000000000000 RDI: 00000000ffffffff RBP: 0000000000000004 R08: 0000000000000000 R09: ffff88810bbb3500 R10: ffff88810bbb3e48 R11: 0000000000000000 R12: ffffc90003067e88 R13: ffffc90003067ea8 R14: ffff888101249800 R15: 00000000fffffffe FS: 00007ff7254fe6c0(0000) GS:ffff888237c00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000555557262da8 CR3: 000000010a6fd000 CR4: 0000000000350ef0 Call Trace: <TASK> iommufd_test_create_access drivers/iommu/iommufd/selftest.c:596 [inline] iommufd_test+0x71c/0xcf0 drivers/iommu/iommufd/selftest.c:813 iommufd_fops_ioctl+0x10f/0x1b0 drivers/iommu/iommufd/main.c:337 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:870 [inline] __se_sys_ioctl fs/ioctl.c:856 [inline] __x64_sys_ioctl+0x84/0xc0 fs/ioctl.c:856 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x38/0x80 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd The solution is to not increment the refcount on the IOMMUFD_DESTROY path at all. Instead use the xa_lock to serialize everything. The refcount check == 1 and xa_erase can be done under a single critical region. This avoids the need for any refcount incrementing. It has the downside that if userspace races destroy with other operations it will get an EBUSY instead of waiting, but this is kind of racing is already dangerous. Fixes: 2ff4bed7fee7 ("iommufd: File descriptor, context, kconfig and makefiles") Link: https://lore.kernel.org/r/2-v1-85aacb2af554+bc-iommufd_syz3_jgg@nvidia.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reported-by: syzbot+7574ebfe589049630608@syzkaller.appspotmail.com Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-07-26iommufd: Add IOMMU_HWPT_ALLOCJason Gunthorpe
This allows userspace to manually create HWPTs on IOAS's and then use those HWPTs as inputs to iommufd_device_attach/replace(). Following series will extend this to allow creating iommu_domains with driver specific parameters. Link: https://lore.kernel.org/r/17-v8-6659224517ea+532-iommufd_alloc_jgg@nvidia.com Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-07-26iommufd: Make destroy_rwsem use a lock class per object typeJason Gunthorpe
The selftest invokes things like replace under the object lock of its idev which protects the idev in a similar way to a real user. Unfortunately this triggers lockdep. A lock class per type will solve the problem. Link: https://lore.kernel.org/r/15-v8-6659224517ea+532-iommufd_alloc_jgg@nvidia.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-07-26iommufd: Add iommufd_device_replace()Jason Gunthorpe
Replace allows all the devices in a group to move in one step to a new HWPT. Further, the HWPT move is done without going through a blocking domain so that the IOMMU driver can implement some level of non-distruption to ongoing DMA if that has meaning for it (eg for future special driver domains) Replace uses a lot of the same logic as normal attach, except the actual domain change over has different restrictions, and we are careful to sequence things so that failure is going to leave everything the way it was, and not get trapped in a blocking domain or something if there is ENOMEM. Link: https://lore.kernel.org/r/14-v8-6659224517ea+532-iommufd_alloc_jgg@nvidia.com Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-07-26iommufd: Allow a hwpt to be aborted after allocationJason Gunthorpe
During creation the hwpt must have the ioas->mutex held until the object is finalized. This means we need to be able to call iommufd_object_abort_and_destroy() while holding the mutex. Since iommufd_hw_pagetable_destroy() also needs the mutex this is problematic. Fix it by creating a special abort op for the object that can assume the caller is holding the lock, as required by the contract. The next patch will add another iommufd_object_abort_and_destroy() for a hwpt. Fixes: e8d57210035b ("iommufd: Add kAPI toward external drivers for physical devices") Link: https://lore.kernel.org/r/10-v8-6659224517ea+532-iommufd_alloc_jgg@nvidia.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-07-26iommufd: Add iommufd_groupJason Gunthorpe
When the hwpt to device attachment is fairly static we could get away with the simple approach of keeping track of the groups via a device list. But with replace this is infeasible. Add an automatically managed struct that is 1:1 with the iommu_group per-ictx so we can store the necessary tracking information there. Link: https://lore.kernel.org/r/2-v8-6659224517ea+532-iommufd_alloc_jgg@nvidia.com Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-07-25iommufd: Add iommufd_ctx_from_fd()Yi Liu
It's common to get a reference to the iommufd context from a given file descriptor. So adds an API for it. Existing users of this API are compiled only when IOMMUFD is enabled, so no need to have a stub for the IOMMUFD disabled case. Tested-by: Yanting Jiang <yanting.jiang@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Link: https://lore.kernel.org/r/20230718135551.6592-21-yi.l.liu@intel.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-07-25iommufd: Reserve all negative IDs in the iommufd xarrayYi Liu
With this reservation, IOMMUFD users can encode the negative IDs for specific purposes. e.g. VFIO needs two reserved values to tell userspace the ID returned is not valid but has other meaning. Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Tested-by: Yanting Jiang <yanting.jiang@intel.com> Tested-by: Terrence Xu <terrence.xu@intel.com> Tested-by: Zhenzhong Duan <zhenzhong.duan@intel.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Link: https://lore.kernel.org/r/20230718105542.4138-4-yi.l.liu@intel.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-01-23iommufd: Add three missing structures in ucmd_bufferYi Liu
struct iommu_ioas_copy, struct iommu_option and struct iommu_vfio_ioas are missed in ucmd_buffer. Although they are smaller than the size of ucmd_buffer, it is safer to list them in ucmd_buffer explicitly. Fixes: aad37e71d5c4 ("iommufd: IOCTLs for the io_pagetable") Fixes: d624d6652a65 ("iommufd: vfio container FD ioctl compatibility") Link: https://lore.kernel.org/r/20230120122040.280219-1-yi.l.liu@intel.com Signed-off-by: Yi Liu <yi.l.liu@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2022-12-02iommufd: Allow iommufd to supply /dev/vfio/vfioJason Gunthorpe
If the VFIO container is compiled out, give a kconfig option for iommufd to provide the miscdev node with the same name and permissions as vfio uses. The compatibility node supports the same ioctls as VFIO and automatically enables the VFIO compatible pinned page accounting mode. Link: https://lore.kernel.org/r/10-v4-42cd2eb0e3eb+335a-vfio_iommufd_jgg@nvidia.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Yi Liu <yi.l.liu@intel.com> Reviewed-by: Alex Williamson <alex.williamson@redhat.com> Tested-by: Alex Williamson <alex.williamson@redhat.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Yi Liu <yi.l.liu@intel.com> Tested-by: Lixiao Yang <lixiao.yang@intel.com> Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> Tested-by: Yu He <yu.he@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2022-11-30iommufd: Add some fault injection pointsJason Gunthorpe
This increases the coverage the fail_nth test gets, as well as via syzkaller. Link: https://lore.kernel.org/r/17-v6-a196d26f289e+11787-iommufd_jgg@nvidia.com Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> # s390 Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2022-11-30iommufd: Add kernel support for testing iommufdJason Gunthorpe
Provide a mock kernel module for the iommu_domain that allows it to run without any HW and the mocking provides a way to directly validate that the PFNs loaded into the iommu_domain are correct. This exposes the access kAPI toward userspace to allow userspace to explore the functionality of pages.c and io_pagetable.c The mock also simulates the rare case of PAGE_SIZE > iommu page size as the mock will operate at a 2K iommu page size. This allows exercising all of the calculations to support this mismatch. This is also intended to support syzkaller exploring the same space. However, it is an unusually invasive config option to enable all of this. The config option should not be enabled in a production kernel. Link: https://lore.kernel.org/r/16-v6-a196d26f289e+11787-iommufd_jgg@nvidia.com Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> # s390 Tested-by: Eric Auger <eric.auger@redhat.com> # aarch64 Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2022-11-30iommufd: vfio container FD ioctl compatibilityJason Gunthorpe
iommufd can directly implement the /dev/vfio/vfio container IOCTLs by mapping them into io_pagetable operations. A userspace application can test against iommufd and confirm compatibility then simply make a small change to open /dev/iommu instead of /dev/vfio/vfio. For testing purposes /dev/vfio/vfio can be symlinked to /dev/iommu and then all applications will use the compatibility path with no code changes. A later series allows /dev/vfio/vfio to be directly provided by iommufd, which allows the rlimit mode to work the same as well. This series just provides the iommufd side of compatibility. Actually linking this to VFIO_SET_CONTAINER is a followup series, with a link in the cover letter. Internally the compatibility API uses a normal IOAS object that, like vfio, is automatically allocated when the first device is attached. Userspace can also query or set this IOAS object directly using the IOMMU_VFIO_IOAS ioctl. This allows mixing and matching new iommufd only features while still using the VFIO style map/unmap ioctls. While this is enough to operate qemu, it has a few differences: - Resource limits rely on memory cgroups to bound what userspace can do instead of the module parameter dma_entry_limit. - VFIO P2P is not implemented. The DMABUF patches for vfio are a start at a solution where iommufd would import a special DMABUF. This is to avoid further propogating the follow_pfn() security problem. - A full audit for pedantic compatibility details (eg errnos, etc) has not yet been done - powerpc SPAPR is left out, as it is not connected to the iommu_domain framework. It seems interest in SPAPR is minimal as it is currently non-working in v6.1-rc1. They will have to convert to the iommu subsystem framework to enjoy iommfd. The following are not going to be implemented and we expect to remove them from VFIO type1: - SW access 'dirty tracking'. As discussed in the cover letter this will be done in VFIO. - VFIO_TYPE1_NESTING_IOMMU https://lore.kernel.org/all/0-v1-0093c9b0e345+19-vfio_no_nesting_jgg@nvidia.com/ - VFIO_DMA_MAP_FLAG_VADDR https://lore.kernel.org/all/Yz777bJZjTyLrHEQ@nvidia.com/ Link: https://lore.kernel.org/r/15-v6-a196d26f289e+11787-iommufd_jgg@nvidia.com Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Yi Liu <yi.l.liu@intel.com> Tested-by: Lixiao Yang <lixiao.yang@intel.com> Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2022-11-30iommufd: Add kAPI toward external drivers for kernel accessJason Gunthorpe
Kernel access is the mode that VFIO "mdevs" use. In this case there is no struct device and no IOMMU connection. iommufd acts as a record keeper for accesses and returns the actual struct pages back to the caller to use however they need. eg with kmap or the DMA API. Each caller must create a struct iommufd_access with iommufd_access_create(), similar to how iommufd_device_bind() works. Using this struct the caller can access blocks of IOVA using iommufd_access_pin_pages() or iommufd_access_rw(). Callers must provide a callback that immediately unpins any IOVA being used within a range. This happens if userspace unmaps the IOVA under the pin. The implementation forwards the access requests directly to the iopt infrastructure that manages the iopt_pages_access. Link: https://lore.kernel.org/r/14-v6-a196d26f289e+11787-iommufd_jgg@nvidia.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Yi Liu <yi.l.liu@intel.com> Tested-by: Lixiao Yang <lixiao.yang@intel.com> Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2022-11-30iommufd: Add kAPI toward external drivers for physical devicesJason Gunthorpe
Add the four functions external drivers need to connect physical DMA to the IOMMUFD: iommufd_device_bind() / iommufd_device_unbind() Register the device with iommufd and establish security isolation. iommufd_device_attach() / iommufd_device_detach() Connect a bound device to a page table Binding a device creates a device object ID in the uAPI, however the generic API does not yet provide any IOCTLs to manipulate them. Link: https://lore.kernel.org/r/13-v6-a196d26f289e+11787-iommufd_jgg@nvidia.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Yi Liu <yi.l.liu@intel.com> Tested-by: Lixiao Yang <lixiao.yang@intel.com> Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2022-11-30iommufd: Add a HW pagetable objectJason Gunthorpe
The hw_pagetable object exposes the internal struct iommu_domain's to userspace. An iommu_domain is required when any DMA device attaches to an IOAS to control the io page table through the iommu driver. For compatibility with VFIO the hw_pagetable is automatically created when a DMA device is attached to the IOAS. If a compatible iommu_domain already exists then the hw_pagetable associated with it is used for the attachment. In the initial series there is no iommufd uAPI for the hw_pagetable object. The next patch provides driver facing APIs for IO page table attachment that allows drivers to accept either an IOAS or a hw_pagetable ID and for the driver to return the hw_pagetable ID that was auto-selected from an IOAS. The expectation is the driver will provide uAPI through its own FD for attaching its device to iommufd. This allows userspace to learn the mapping of devices to iommu_domains and to override the automatic attachment. The future HW specific interface will allow userspace to create hw_pagetable objects using iommu_domains with IOMMU driver specific parameters. This infrastructure will allow linking those domains to IOAS's and devices. Link: https://lore.kernel.org/r/12-v6-a196d26f289e+11787-iommufd_jgg@nvidia.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Yi Liu <yi.l.liu@intel.com> Tested-by: Lixiao Yang <lixiao.yang@intel.com> Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>