summaryrefslogtreecommitdiff
path: root/drivers/vfio/pci
AgeCommit message (Collapse)Author
2024-07-19Merge tag 'vfio-v6.11-rc1' of https://github.com/awilliam/linux-vfioLinus Torvalds
Pull VFIO updates from Alex Williamson: - Add support for 8-byte accesses when using read/write through the device regions. This fills a gap for userspace drivers that might not be able to use access through mmap to perform native register width accesses (Gerd Bayer) - Add missing MODULE_DESCRIPTION to vfio-mdev sample drivers and replace a non-standard MODULE_INFO usage (Jeff Johnson) * tag 'vfio-v6.11-rc1' of https://github.com/awilliam/linux-vfio: vfio-mdev: add missing MODULE_DESCRIPTION() macros vfio/pci: Fix typo in macro to declare accessors vfio/pci: Support 8-byte PCI loads and stores vfio/pci: Extract duplicated code into macro
2024-07-10vfio/pci: Init the count variable in collecting hot-reset devicesYi Liu
The count variable is used without initialization, it results in mistakes in the device counting and crashes the userspace if the get hot reset info path is triggered. Fixes: f6944d4a0b87 ("vfio/pci: Collect hot-reset devices to local buffer") Link: https://bugzilla.kernel.org/show_bug.cgi?id=219010 Reported-by: Žilvinas Žaltiena <zaltys@natrix.lt> Cc: Beld Zhang <beldzhang@gmail.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20240710004150.319105-1-yi.l.liu@intel.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-06-21vfio/pci: Support 8-byte PCI loads and storesBen Segal
Many PCI adapters can benefit or even require full 64bit read and write access to their registers. In order to enable work on user-space drivers for these devices add two new variations vfio_pci_core_io{read|write}64 of the existing access methods when the architecture supports 64-bit ioreads and iowrites. Signed-off-by: Ben Segal <bpsegal@us.ibm.com> Co-developed-by: Gerd Bayer <gbayer@linux.ibm.com> Signed-off-by: Gerd Bayer <gbayer@linux.ibm.com> Link: https://lore.kernel.org/r/20240619115847.1344875-3-gbayer@linux.ibm.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-06-21vfio/pci: Extract duplicated code into macroGerd Bayer
vfio_pci_core_do_io_rw() repeats the same code for multiple access widths. Factor this out into a macro Suggested-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Gerd Bayer <gbayer@linux.ibm.com> Link: https://lore.kernel.org/r/20240619115847.1344875-2-gbayer@linux.ibm.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-06-12vfio/pci: Insert full vma on mmap'd MMIO faultAlex Williamson
In order to improve performance of typical scenarios we can try to insert the entire vma on fault. This accelerates typical cases, such as when the MMIO region is DMA mapped by QEMU. The vfio_iommu_type1 driver will fault in the entire DMA mapped range through fixup_user_fault(). In synthetic testing, this improves the time required to walk a PCI BAR mapping from userspace by roughly 1/3rd. This is likely an interim solution until vmf_insert_pfn_{pmd,pud}() gain support for pfnmaps. Suggested-by: Yan Zhao <yan.y.zhao@intel.com> Link: https://lore.kernel.org/all/Zl6XdUkt%2FzMMGOLF@yzhao56-desk.sh.intel.com/ Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Link: https://lore.kernel.org/r/20240607035213.2054226-1-alex.williamson@redhat.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-05-31vfio/pci: Use unmap_mapping_range()Alex Williamson
With the vfio device fd tied to the address space of the pseudo fs inode, we can use the mm to track all vmas that might be mmap'ing device BARs, which removes our vma_list and all the complicated lock ordering necessary to manually zap each related vma. Note that we can no longer store the pfn in vm_pgoff if we want to use unmap_mapping_range() to zap a selective portion of the device fd corresponding to BAR mappings. This also converts our mmap fault handler to use vmf_insert_pfn() because we no longer have a vma_list to avoid the concurrency problem with io_remap_pfn_range(). The goal is to eventually use the vm_ops huge_fault handler to avoid the additional faulting overhead, but vmf_insert_pfn_{pmd,pud}() need to learn about pfnmaps first. Also, Jason notes that a race exists between unmap_mapping_range() and the fops mmap callback if we were to call io_remap_pfn_range() to populate the vma on mmap. Specifically, mmap_region() does call_mmap() before it does vma_link_file() which gives a window where the vma is populated but invisible to unmap_mapping_range(). Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Link: https://lore.kernel.org/r/20240530045236.1005864-3-alex.williamson@redhat.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-05-20Merge tag 'vfio-v6.10-rc1' of https://github.com/awilliam/linux-vfioLinus Torvalds
Pull vfio updates from Alex Williamson: - The vfio fsl-mc bus driver has become orphaned. We'll consider removing it in future releases if a new maintainer isn't found (Alex Williamson) - Improved usage of opaque data in vfio-pci INTx handling, avoiding lookups of the eventfd through the interrupt and irqfd runtime paths (Alex Williamson) - Resolve an error path memory leak introduced in vfio-pci interrupt code (Ye Bin) - Addition of interrupt support for vfio devices exposed on the CDX bus, including a new MSI allocation helper and export of existing helpers for MSI alloc and free (Nipun Gupta) - A new vfio-pci variant driver supporting migration of Intel QAT VF devices for the GEN4 PFs (Xin Zeng & Yahui Cao) - Resolve a possibly circular locking dependency in vfio-pci by avoiding copy_to_user() from a PCI bus walk callback (Alex Williamson) - Trivial docs update to remove a duplicate semicolon (Foryun Ma) * tag 'vfio-v6.10-rc1' of https://github.com/awilliam/linux-vfio: vfio/pci: Restore zero affected bus reset devices warning vfio: remove an extra semicolon vfio/pci: Collect hot-reset devices to local buffer vfio/qat: Add vfio_pci driver for Intel QAT SR-IOV VF devices vfio/cdx: add interrupt support genirq/msi: Add MSI allocation helper and export MSI functions vfio/pci: fix potential memory leak in vfio_intx_enable() vfio/pci: Pass eventfd context object through irqfd vfio/pci: Pass eventfd context to IRQ handler MAINTAINERS: Orphan vfio fsl-mc bus driver
2024-05-19Merge tag 'mm-stable-2024-05-17-19-19' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull mm updates from Andrew Morton: "The usual shower of singleton fixes and minor series all over MM, documented (hopefully adequately) in the respective changelogs. Notable series include: - Lucas Stach has provided some page-mapping cleanup/consolidation/ maintainability work in the series "mm/treewide: Remove pXd_huge() API". - In the series "Allow migrate on protnone reference with MPOL_PREFERRED_MANY policy", Donet Tom has optimized mempolicy's MPOL_PREFERRED_MANY mode, yielding almost doubled performance in one test. - In their series "Memory allocation profiling" Kent Overstreet and Suren Baghdasaryan have contributed a means of determining (via /proc/allocinfo) whereabouts in the kernel memory is being allocated: number of calls and amount of memory. - Matthew Wilcox has provided the series "Various significant MM patches" which does a number of rather unrelated things, but in largely similar code sites. - In his series "mm: page_alloc: freelist migratetype hygiene" Johannes Weiner has fixed the page allocator's handling of migratetype requests, with resulting improvements in compaction efficiency. - In the series "make the hugetlb migration strategy consistent" Baolin Wang has fixed a hugetlb migration issue, which should improve hugetlb allocation reliability. - Liu Shixin has hit an I/O meltdown caused by readahead in a memory-tight memcg. Addressed in the series "Fix I/O high when memory almost met memcg limit". - In the series "mm/filemap: optimize folio adding and splitting" Kairui Song has optimized pagecache insertion, yielding ~10% performance improvement in one test. - Baoquan He has cleaned up and consolidated the early zone initialization code in the series "mm/mm_init.c: refactor free_area_init_core()". - Baoquan has also redone some MM initializatio code in the series "mm/init: minor clean up and improvement". - MM helper cleanups from Christoph Hellwig in his series "remove follow_pfn". - More cleanups from Matthew Wilcox in the series "Various page->flags cleanups". - Vlastimil Babka has contributed maintainability improvements in the series "memcg_kmem hooks refactoring". - More folio conversions and cleanups in Matthew Wilcox's series: "Convert huge_zero_page to huge_zero_folio" "khugepaged folio conversions" "Remove page_idle and page_young wrappers" "Use folio APIs in procfs" "Clean up __folio_put()" "Some cleanups for memory-failure" "Remove page_mapping()" "More folio compat code removal" - David Hildenbrand chipped in with "fs/proc/task_mmu: convert hugetlb functions to work on folis". - Code consolidation and cleanup work related to GUP's handling of hugetlbs in Peter Xu's series "mm/gup: Unify hugetlb, part 2". - Rick Edgecombe has developed some fixes to stack guard gaps in the series "Cover a guard gap corner case". - Jinjiang Tu has fixed KSM's behaviour after a fork+exec in the series "mm/ksm: fix ksm exec support for prctl". - Baolin Wang has implemented NUMA balancing for multi-size THPs. This is a simple first-cut implementation for now. The series is "support multi-size THP numa balancing". - Cleanups to vma handling helper functions from Matthew Wilcox in the series "Unify vma_address and vma_pgoff_address". - Some selftests maintenance work from Dev Jain in the series "selftests/mm: mremap_test: Optimizations and style fixes". - Improvements to the swapping of multi-size THPs from Ryan Roberts in the series "Swap-out mTHP without splitting". - Kefeng Wang has significantly optimized the handling of arm64's permission page faults in the series "arch/mm/fault: accelerate pagefault when badaccess" "mm: remove arch's private VM_FAULT_BADMAP/BADACCESS" - GUP cleanups from David Hildenbrand in "mm/gup: consistently call it GUP-fast". - hugetlb fault code cleanups from Vishal Moola in "Hugetlb fault path to use struct vm_fault". - selftests build fixes from John Hubbard in the series "Fix selftests/mm build without requiring "make headers"". - Memory tiering fixes/improvements from Ho-Ren (Jack) Chuang in the series "Improved Memory Tier Creation for CPUless NUMA Nodes". Fixes the initialization code so that migration between different memory types works as intended. - David Hildenbrand has improved follow_pte() and fixed an errant driver in the series "mm: follow_pte() improvements and acrn follow_pte() fixes". - David also did some cleanup work on large folio mapcounts in his series "mm: mapcount for large folios + page_mapcount() cleanups". - Folio conversions in KSM in Alex Shi's series "transfer page to folio in KSM". - Barry Song has added some sysfs stats for monitoring multi-size THP's in the series "mm: add per-order mTHP alloc and swpout counters". - Some zswap cleanups from Yosry Ahmed in the series "zswap same-filled and limit checking cleanups". - Matthew Wilcox has been looking at buffer_head code and found the documentation to be lacking. The series is "Improve buffer head documentation". - Multi-size THPs get more work, this time from Lance Yang. His series "mm/madvise: enhance lazyfreeing with mTHP in madvise_free" optimizes the freeing of these things. - Kemeng Shi has added more userspace-visible writeback instrumentation in the series "Improve visibility of writeback". - Kemeng Shi then sent some maintenance work on top in the series "Fix and cleanups to page-writeback". - Matthew Wilcox reduces mmap_lock traffic in the anon vma code in the series "Improve anon_vma scalability for anon VMAs". Intel's test bot reported an improbable 3x improvement in one test. - SeongJae Park adds some DAMON feature work in the series "mm/damon: add a DAMOS filter type for page granularity access recheck" "selftests/damon: add DAMOS quota goal test" - Also some maintenance work in the series "mm/damon/paddr: simplify page level access re-check for pageout" "mm/damon: misc fixes and improvements" - David Hildenbrand has disabled some known-to-fail selftests ni the series "selftests: mm: cow: flag vmsplice() hugetlb tests as XFAIL". - memcg metadata storage optimizations from Shakeel Butt in "memcg: reduce memory consumption by memcg stats". - DAX fixes and maintenance work from Vishal Verma in the series "dax/bus.c: Fixups for dax-bus locking"" * tag 'mm-stable-2024-05-17-19-19' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (426 commits) memcg, oom: cleanup unused memcg_oom_gfp_mask and memcg_oom_order selftests/mm: hugetlb_madv_vs_map: avoid test skipping by querying hugepage size at runtime mm/hugetlb: add missing VM_FAULT_SET_HINDEX in hugetlb_wp mm/hugetlb: add missing VM_FAULT_SET_HINDEX in hugetlb_fault selftests: cgroup: add tests to verify the zswap writeback path mm: memcg: make alloc_mem_cgroup_per_node_info() return bool mm/damon/core: fix return value from damos_wmark_metric_value mm: do not update memcg stats for NR_{FILE/SHMEM}_PMDMAPPED selftests: cgroup: remove redundant enabling of memory controller Docs/mm/damon/maintainer-profile: allow posting patches based on damon/next tree Docs/mm/damon/maintainer-profile: change the maintainer's timezone from PST to PT Docs/mm/damon/design: use a list for supported filters Docs/admin-guide/mm/damon/usage: fix wrong schemes effective quota update command Docs/admin-guide/mm/damon/usage: fix wrong example of DAMOS filter matching sysfs file selftests/damon: classify tests for functionalities and regressions selftests/damon/_damon_sysfs: use 'is' instead of '==' for 'None' selftests/damon/_damon_sysfs: find sysfs mount point from /proc/mounts selftests/damon/_damon_sysfs: check errors from nr_schemes file reads mm/damon/core: initialize ->esz_bp from damos_quota_init_priv() selftests/damon: add a test for DAMOS quota goal ...
2024-05-17vfio/pci: Restore zero affected bus reset devices warningAlex Williamson
Yi notes relative to commit f6944d4a0b87 ("vfio/pci: Collect hot-reset devices to local buffer") that we previously tested the resulting device count with a WARN_ON, which was removed when we switched to the in-loop user copy in commit b56b7aabcf3c ("vfio/pci: Copy hot-reset device info to userspace in the devices loop"). Finding no devices in the bus/slot would be an unexpected condition, so let's restore the warning and trigger a -ERANGE error here as success with no devices would be an unexpected result to userspace as well. Suggested-by: Yi Liu <yi.l.liu@intel.com> Reviewed-by: Cédric Le Goater <clg@redhat.com> Reviewed-by: Yi Liu <yi.l.liu@intel.com> Link: https://lore.kernel.org/r/20240516174831.2257970-1-alex.williamson@redhat.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-05-13VFIO: Add the SPR_DSA and SPR_IAX devices to the denylistArjan van de Ven
Due to an erratum with the SPR_DSA and SPR_IAX devices, it is not secure to assign these devices to virtual machines. Add the PCI IDs of these devices to the VFIO denylist to ensure that this is handled appropriately by the VFIO subsystem. The SPR_DSA and SPR_IAX devices are on-SOC devices for the Sapphire Rapids (and related) family of products that perform data movement and compression. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
2024-05-09vfio/pci: Collect hot-reset devices to local bufferAlex Williamson
Lockdep reports the below circular locking dependency issue. The mmap_lock acquisition while holding pci_bus_sem is due to the use of copy_to_user() from within a pci_walk_bus() callback. Building the devices array directly into the user buffer is only for convenience. Instead we can allocate a local buffer for the array, bounded by the number of devices on the bus/slot, fill the device information into this local buffer, then copy it into the user buffer outside the bus walk callback. ====================================================== WARNING: possible circular locking dependency detected 6.9.0-rc5+ #39 Not tainted ------------------------------------------------------ CPU 0/KVM/4113 is trying to acquire lock: ffff99a609ee18a8 (&vdev->vma_lock){+.+.}-{4:4}, at: vfio_pci_mmap_fault+0x35/0x1a0 [vfio_pci_core] but task is already holding lock: ffff99a243a052a0 (&mm->mmap_lock){++++}-{4:4}, at: vaddr_get_pfns+0x3f/0x170 [vfio_iommu_type1] which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #3 (&mm->mmap_lock){++++}-{4:4}: __lock_acquire+0x4e4/0xb90 lock_acquire+0xbc/0x2d0 __might_fault+0x5c/0x80 _copy_to_user+0x1e/0x60 vfio_pci_fill_devs+0x9f/0x130 [vfio_pci_core] vfio_pci_walk_wrapper+0x45/0x60 [vfio_pci_core] __pci_walk_bus+0x6b/0xb0 vfio_pci_ioctl_get_pci_hot_reset_info+0x10b/0x1d0 [vfio_pci_core] vfio_pci_core_ioctl+0x1cb/0x400 [vfio_pci_core] vfio_device_fops_unl_ioctl+0x7e/0x140 [vfio] __x64_sys_ioctl+0x8a/0xc0 do_syscall_64+0x8d/0x170 entry_SYSCALL_64_after_hwframe+0x76/0x7e -> #2 (pci_bus_sem){++++}-{4:4}: __lock_acquire+0x4e4/0xb90 lock_acquire+0xbc/0x2d0 down_read+0x3e/0x160 pci_bridge_wait_for_secondary_bus.part.0+0x33/0x2d0 pci_reset_bus+0xdd/0x160 vfio_pci_dev_set_hot_reset+0x256/0x270 [vfio_pci_core] vfio_pci_ioctl_pci_hot_reset_groups+0x1a3/0x280 [vfio_pci_core] vfio_pci_core_ioctl+0x3b5/0x400 [vfio_pci_core] vfio_device_fops_unl_ioctl+0x7e/0x140 [vfio] __x64_sys_ioctl+0x8a/0xc0 do_syscall_64+0x8d/0x170 entry_SYSCALL_64_after_hwframe+0x76/0x7e -> #1 (&vdev->memory_lock){+.+.}-{4:4}: __lock_acquire+0x4e4/0xb90 lock_acquire+0xbc/0x2d0 down_write+0x3b/0xc0 vfio_pci_zap_and_down_write_memory_lock+0x1c/0x30 [vfio_pci_core] vfio_basic_config_write+0x281/0x340 [vfio_pci_core] vfio_config_do_rw+0x1fa/0x300 [vfio_pci_core] vfio_pci_config_rw+0x75/0xe50 [vfio_pci_core] vfio_pci_rw+0xea/0x1a0 [vfio_pci_core] vfs_write+0xea/0x520 __x64_sys_pwrite64+0x90/0xc0 do_syscall_64+0x8d/0x170 entry_SYSCALL_64_after_hwframe+0x76/0x7e -> #0 (&vdev->vma_lock){+.+.}-{4:4}: check_prev_add+0xeb/0xcc0 validate_chain+0x465/0x530 __lock_acquire+0x4e4/0xb90 lock_acquire+0xbc/0x2d0 __mutex_lock+0x97/0xde0 vfio_pci_mmap_fault+0x35/0x1a0 [vfio_pci_core] __do_fault+0x31/0x160 do_pte_missing+0x65/0x3b0 __handle_mm_fault+0x303/0x720 handle_mm_fault+0x10f/0x460 fixup_user_fault+0x7f/0x1f0 follow_fault_pfn+0x66/0x1c0 [vfio_iommu_type1] vaddr_get_pfns+0xf2/0x170 [vfio_iommu_type1] vfio_pin_pages_remote+0x348/0x4e0 [vfio_iommu_type1] vfio_pin_map_dma+0xd2/0x330 [vfio_iommu_type1] vfio_dma_do_map+0x2c0/0x440 [vfio_iommu_type1] vfio_iommu_type1_ioctl+0xc5/0x1d0 [vfio_iommu_type1] __x64_sys_ioctl+0x8a/0xc0 do_syscall_64+0x8d/0x170 entry_SYSCALL_64_after_hwframe+0x76/0x7e other info that might help us debug this: Chain exists of: &vdev->vma_lock --> pci_bus_sem --> &mm->mmap_lock Possible unsafe locking scenario: block dm-0: the capability attribute has been deprecated. CPU0 CPU1 ---- ---- rlock(&mm->mmap_lock); lock(pci_bus_sem); lock(&mm->mmap_lock); lock(&vdev->vma_lock); *** DEADLOCK *** 2 locks held by CPU 0/KVM/4113: #0: ffff99a25f294888 (&iommu->lock#2){+.+.}-{4:4}, at: vfio_dma_do_map+0x60/0x440 [vfio_iommu_type1] #1: ffff99a243a052a0 (&mm->mmap_lock){++++}-{4:4}, at: vaddr_get_pfns+0x3f/0x170 [vfio_iommu_type1] stack backtrace: CPU: 1 PID: 4113 Comm: CPU 0/KVM Not tainted 6.9.0-rc5+ #39 Hardware name: Dell Inc. PowerEdge T640/04WYPY, BIOS 2.15.1 06/16/2022 Call Trace: <TASK> dump_stack_lvl+0x64/0xa0 check_noncircular+0x131/0x150 check_prev_add+0xeb/0xcc0 ? add_chain_cache+0x10a/0x2f0 ? __lock_acquire+0x4e4/0xb90 validate_chain+0x465/0x530 __lock_acquire+0x4e4/0xb90 lock_acquire+0xbc/0x2d0 ? vfio_pci_mmap_fault+0x35/0x1a0 [vfio_pci_core] ? lock_is_held_type+0x9a/0x110 __mutex_lock+0x97/0xde0 ? vfio_pci_mmap_fault+0x35/0x1a0 [vfio_pci_core] ? lock_acquire+0xbc/0x2d0 ? vfio_pci_mmap_fault+0x35/0x1a0 [vfio_pci_core] ? find_held_lock+0x2b/0x80 ? vfio_pci_mmap_fault+0x35/0x1a0 [vfio_pci_core] vfio_pci_mmap_fault+0x35/0x1a0 [vfio_pci_core] __do_fault+0x31/0x160 do_pte_missing+0x65/0x3b0 __handle_mm_fault+0x303/0x720 handle_mm_fault+0x10f/0x460 fixup_user_fault+0x7f/0x1f0 follow_fault_pfn+0x66/0x1c0 [vfio_iommu_type1] vaddr_get_pfns+0xf2/0x170 [vfio_iommu_type1] vfio_pin_pages_remote+0x348/0x4e0 [vfio_iommu_type1] vfio_pin_map_dma+0xd2/0x330 [vfio_iommu_type1] vfio_dma_do_map+0x2c0/0x440 [vfio_iommu_type1] vfio_iommu_type1_ioctl+0xc5/0x1d0 [vfio_iommu_type1] __x64_sys_ioctl+0x8a/0xc0 do_syscall_64+0x8d/0x170 ? rcu_core+0x8d/0x250 ? __lock_release+0x5e/0x160 ? rcu_core+0x8d/0x250 ? lock_release+0x5f/0x120 ? sched_clock+0xc/0x30 ? sched_clock_cpu+0xb/0x190 ? irqtime_account_irq+0x40/0xc0 ? __local_bh_enable+0x54/0x60 ? __do_softirq+0x315/0x3ca ? lockdep_hardirqs_on_prepare.part.0+0x97/0x140 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7f8300d0357b Code: ff ff ff 85 c0 79 9b 49 c7 c4 ff ff ff ff 5b 5d 4c 89 e0 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 75 68 0f 00 f7 d8 64 89 01 48 RSP: 002b:00007f82ef3fb948 EFLAGS: 00000206 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f8300d0357b RDX: 00007f82ef3fb990 RSI: 0000000000003b71 RDI: 0000000000000023 RBP: 00007f82ef3fb9c0 R08: 0000000000000000 R09: 0000561b7e0bcac2 R10: 0000000000000000 R11: 0000000000000206 R12: 0000000000000000 R13: 0000000200000000 R14: 0000381800000000 R15: 0000000000000000 </TASK> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20240503143138.3562116-1-alex.williamson@redhat.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-04-29vfio/qat: Add vfio_pci driver for Intel QAT SR-IOV VF devicesXin Zeng
Add vfio pci variant driver for Intel QAT SR-IOV VF devices. This driver registers to the vfio subsystem through the interfaces exposed by the subsystem. It follows the live migration protocol v2 defined in uapi/linux/vfio.h and interacts with Intel QAT PF driver through a set of interfaces defined in qat/qat_mig_dev.h to support live migration of Intel QAT VF devices. This version only covers migration for Intel QAT GEN4 VF devices. Co-developed-by: Yahui Cao <yahui.cao@intel.com> Signed-off-by: Yahui Cao <yahui.cao@intel.com> Signed-off-by: Xin Zeng <xin.zeng@intel.com> Reviewed-by: Giovanni Cabiddu <giovanni.cabiddu@intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Link: https://lore.kernel.org/r/20240426064051.2859652-1-xin.zeng@intel.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-04-25fix missing vmalloc.h includesKent Overstreet
Patch series "Memory allocation profiling", v6. Overview: Low overhead [1] per-callsite memory allocation profiling. Not just for debug kernels, overhead low enough to be deployed in production. Example output: root@moria-kvm:~# sort -rn /proc/allocinfo 127664128 31168 mm/page_ext.c:270 func:alloc_page_ext 56373248 4737 mm/slub.c:2259 func:alloc_slab_page 14880768 3633 mm/readahead.c:247 func:page_cache_ra_unbounded 14417920 3520 mm/mm_init.c:2530 func:alloc_large_system_hash 13377536 234 block/blk-mq.c:3421 func:blk_mq_alloc_rqs 11718656 2861 mm/filemap.c:1919 func:__filemap_get_folio 9192960 2800 kernel/fork.c:307 func:alloc_thread_stack_node 4206592 4 net/netfilter/nf_conntrack_core.c:2567 func:nf_ct_alloc_hashtable 4136960 1010 drivers/staging/ctagmod/ctagmod.c:20 [ctagmod] func:ctagmod_start 3940352 962 mm/memory.c:4214 func:alloc_anon_folio 2894464 22613 fs/kernfs/dir.c:615 func:__kernfs_new_node ... Usage: kconfig options: - CONFIG_MEM_ALLOC_PROFILING - CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT - CONFIG_MEM_ALLOC_PROFILING_DEBUG adds warnings for allocations that weren't accounted because of a missing annotation sysctl: /proc/sys/vm/mem_profiling Runtime info: /proc/allocinfo Notes: [1]: Overhead To measure the overhead we are comparing the following configurations: (1) Baseline with CONFIG_MEMCG_KMEM=n (2) Disabled by default (CONFIG_MEM_ALLOC_PROFILING=y && CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=n) (3) Enabled by default (CONFIG_MEM_ALLOC_PROFILING=y && CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=y) (4) Enabled at runtime (CONFIG_MEM_ALLOC_PROFILING=y && CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=n && /proc/sys/vm/mem_profiling=1) (5) Baseline with CONFIG_MEMCG_KMEM=y && allocating with __GFP_ACCOUNT (6) Disabled by default (CONFIG_MEM_ALLOC_PROFILING=y && CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=n) && CONFIG_MEMCG_KMEM=y (7) Enabled by default (CONFIG_MEM_ALLOC_PROFILING=y && CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=y) && CONFIG_MEMCG_KMEM=y Performance overhead: To evaluate performance we implemented an in-kernel test executing multiple get_free_page/free_page and kmalloc/kfree calls with allocation sizes growing from 8 to 240 bytes with CPU frequency set to max and CPU affinity set to a specific CPU to minimize the noise. Below are results from running the test on Ubuntu 22.04.2 LTS with 6.8.0-rc1 kernel on 56 core Intel Xeon: kmalloc pgalloc (1 baseline) 6.764s 16.902s (2 default disabled) 6.793s (+0.43%) 17.007s (+0.62%) (3 default enabled) 7.197s (+6.40%) 23.666s (+40.02%) (4 runtime enabled) 7.405s (+9.48%) 23.901s (+41.41%) (5 memcg) 13.388s (+97.94%) 48.460s (+186.71%) (6 def disabled+memcg) 13.332s (+97.10%) 48.105s (+184.61%) (7 def enabled+memcg) 13.446s (+98.78%) 54.963s (+225.18%) Memory overhead: Kernel size: text data bss dec diff (1) 26515311 18890222 17018880 62424413 (2) 26524728 19423818 16740352 62688898 264485 (3) 26524724 19423818 16740352 62688894 264481 (4) 26524728 19423818 16740352 62688898 264485 (5) 26541782 18964374 16957440 62463596 39183 Memory consumption on a 56 core Intel CPU with 125GB of memory: Code tags: 192 kB PageExts: 262144 kB (256MB) SlabExts: 9876 kB (9.6MB) PcpuExts: 512 kB (0.5MB) Total overhead is 0.2% of total memory. Benchmarks: Hackbench tests run 100 times: hackbench -s 512 -l 200 -g 15 -f 25 -P baseline disabled profiling enabled profiling avg 0.3543 0.3559 (+0.0016) 0.3566 (+0.0023) stdev 0.0137 0.0188 0.0077 hackbench -l 10000 baseline disabled profiling enabled profiling avg 6.4218 6.4306 (+0.0088) 6.5077 (+0.0859) stdev 0.0933 0.0286 0.0489 stress-ng tests: stress-ng --class memory --seq 4 -t 60 stress-ng --class cpu --seq 4 -t 60 Results posted at: https://evilpiepirate.org/~kent/memalloc_prof_v4_stress-ng/ [2] https://lore.kernel.org/all/20240306182440.2003814-1-surenb@google.com/ This patch (of 37): The next patch drops vmalloc.h from a system header in order to fix a circular dependency; this adds it to all the files that were pulling it in implicitly. [kent.overstreet@linux.dev: fix arch/alpha/lib/memcpy.c] Link: https://lkml.kernel.org/r/20240327002152.3339937-1-kent.overstreet@linux.dev [surenb@google.com: fix arch/x86/mm/numa_32.c] Link: https://lkml.kernel.org/r/20240402180933.1663992-1-surenb@google.com [kent.overstreet@linux.dev: a few places were depending on sizes.h] Link: https://lkml.kernel.org/r/20240404034744.1664840-1-kent.overstreet@linux.dev [arnd@arndb.de: fix mm/kasan/hw_tags.c] Link: https://lkml.kernel.org/r/20240404124435.3121534-1-arnd@kernel.org [surenb@google.com: fix arc build] Link: https://lkml.kernel.org/r/20240405225115.431056-1-surenb@google.com Link: https://lkml.kernel.org/r/20240321163705.3067592-1-surenb@google.com Link: https://lkml.kernel.org/r/20240321163705.3067592-2-surenb@google.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Tested-by: Kees Cook <keescook@chromium.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andreas Hindborg <a.hindborg@samsung.com> Cc: Benno Lossin <benno.lossin@proton.me> Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Gary Guo <gary@garyguo.net> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wedson Almeida Filho <wedsonaf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-22vfio/pci: fix potential memory leak in vfio_intx_enable()Ye Bin
If vfio_irq_ctx_alloc() failed will lead to 'name' memory leak. Fixes: 18c198c96a81 ("vfio/pci: Create persistent INTx handler") Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lore.kernel.org/r/20240415015029.3699844-1-yebin10@huawei.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-04-22vfio/pci: Pass eventfd context object through irqfdAlex Williamson
Further avoid lookup of the context object by passing it through the irqfd data field. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Link: https://lore.kernel.org/r/20240401195406.3720453-3-alex.williamson@redhat.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-04-22vfio/pci: Pass eventfd context to IRQ handlerAlex Williamson
Create a link back to the vfio_pci_core_device on the eventfd context object to avoid lookups in the interrupt path. The context is known valid in the interrupt handler. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Link: https://lore.kernel.org/r/20240401195406.3720453-2-alex.williamson@redhat.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-03-11vfio/pci: Create persistent INTx handlerAlex Williamson
A vulnerability exists where the eventfd for INTx signaling can be deconfigured, which unregisters the IRQ handler but still allows eventfds to be signaled with a NULL context through the SET_IRQS ioctl or through unmask irqfd if the device interrupt is pending. Ideally this could be solved with some additional locking; the igate mutex serializes the ioctl and config space accesses, and the interrupt handler is unregistered relative to the trigger, but the irqfd path runs asynchronous to those. The igate mutex cannot be acquired from the atomic context of the eventfd wake function. Disabling the irqfd relative to the eventfd registration is potentially incompatible with existing userspace. As a result, the solution implemented here moves configuration of the INTx interrupt handler to track the lifetime of the INTx context object and irq_type configuration, rather than registration of a particular trigger eventfd. Synchronization is added between the ioctl path and eventfd_signal() wrapper such that the eventfd trigger can be dynamically updated relative to in-flight interrupts or irqfd callbacks. Cc: <stable@vger.kernel.org> Fixes: 89e1f7d4c66d ("vfio: Add PCI device driver") Reported-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Link: https://lore.kernel.org/r/20240308230557.805580-5-alex.williamson@redhat.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-03-11vfio/pci: Lock external INTx masking opsAlex Williamson
Mask operations through config space changes to DisINTx may race INTx configuration changes via ioctl. Create wrappers that add locking for paths outside of the core interrupt code. In particular, irq_type is updated holding igate, therefore testing is_intx() requires holding igate. For example clearing DisINTx from config space can otherwise race changes of the interrupt configuration. This aligns interfaces which may trigger the INTx eventfd into two camps, one side serialized by igate and the other only enabled while INTx is configured. A subsequent patch introduces synchronization for the latter flows. Cc: <stable@vger.kernel.org> Fixes: 89e1f7d4c66d ("vfio: Add PCI device driver") Reported-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Link: https://lore.kernel.org/r/20240308230557.805580-3-alex.williamson@redhat.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-03-11vfio/pci: Disable auto-enable of exclusive INTx IRQAlex Williamson
Currently for devices requiring masking at the irqchip for INTx, ie. devices without DisINTx support, the IRQ is enabled in request_irq() and subsequently disabled as necessary to align with the masked status flag. This presents a window where the interrupt could fire between these events, resulting in the IRQ incrementing the disable depth twice. This would be unrecoverable for a user since the masked flag prevents nested enables through vfio. Instead, invert the logic using IRQF_NO_AUTOEN such that exclusive INTx is never auto-enabled, then unmask as required. Cc: <stable@vger.kernel.org> Fixes: 89e1f7d4c66d ("vfio: Add PCI device driver") Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Link: https://lore.kernel.org/r/20240308230557.805580-2-alex.williamson@redhat.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-03-11vfio/pds: Refactor/simplify reset logicBrett Creeley
The current logic for handling resets is more complicated than it needs to be. The deferred_reset flag is used to indicate a reset is needed and the deferred_reset_state is the requested, post-reset, state. Also, the deferred_reset logic was added to vfio migration drivers to prevent a circular locking dependency with respect to mm_lock and state mutex. This is mainly because of the copy_to/from_user() functions(which takes mm_lock) invoked under state mutex. Remove all of the deferred reset logic and just pass the requested next state to pds_vfio_reset() so it can be used for VMM and DSC initiated resets. This removes the need for pds_vfio_state_mutex_lock(), so remove that and replace its use with a simple mutex_unlock(). Also, remove the reset_mutex as it's no longer needed since the state_mutex can be the driver's primary protector. Suggested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Reviewed-by: Shannon Nelson <shannon.nelson@amd.com> Signed-off-by: Brett Creeley <brett.creeley@amd.com> Link: https://lore.kernel.org/r/20240308182149.22036-3-brett.creeley@amd.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-03-11vfio/pds: Make sure migration file isn't accessed after resetBrett Creeley
It's possible the migration file is accessed after reset when it has been cleaned up, especially when it's initiated by the device. This is because the driver doesn't rip out the filep when cleaning up it only frees the related page structures and sets its local struct pds_vfio_lm_file pointer to NULL. This can cause a NULL pointer dereference, which is shown in the example below during a restore after a device initiated reset: BUG: kernel NULL pointer dereference, address: 000000000000000c PF: supervisor read access in kernel mode PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP NOPTI RIP: 0010:pds_vfio_get_file_page+0x5d/0xf0 [pds_vfio_pci] [...] Call Trace: <TASK> pds_vfio_restore_write+0xf6/0x160 [pds_vfio_pci] vfs_write+0xc9/0x3f0 ? __fget_light+0xc9/0x110 ksys_write+0xb5/0xf0 __x64_sys_write+0x1a/0x20 do_syscall_64+0x38/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd [...] Add a disabled flag to the driver's struct pds_vfio_lm_file that gets set during cleanup. Then make sure to check the flag when the migration file is accessed via its file_operations. By default this flag will be false as the memory for struct pds_vfio_lm_file is kzalloc'd, which means the struct pds_vfio_lm_file is enabled and accessible. Also, since the file_operations and driver's migration file cleanup happen under the protection of the same pds_vfio_lm_file.lock, using this flag is thread safe. Fixes: 8512ed256334 ("vfio/pds: Always clear the save/restore FDs on reset") Reviewed-by: Shannon Nelson <shannon.nelson@amd.com> Signed-off-by: Brett Creeley <brett.creeley@amd.com> Link: https://lore.kernel.org/r/20240308182149.22036-2-brett.creeley@amd.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-03-11vfio/mlx5: Enforce PRE_COPY supportYishai Hadas
Enable live migration only once the firmware supports PRE_COPY. PRE_COPY has been supported by the firmware for a long time already [1] and is required to achieve a low downtime upon live migration. This lets us clean up some old code that is not applicable those days while PRE_COPY is fully supported by the firmware. [1] The minimum firmware version that supports PRE_COPY is 28.36.1010, it was released in January 2023. No firmware without PRE_COPY support ever available to users. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/20240306105624.114830-1-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-03-05hisi_acc_vfio_pci: Remove the deferred_reset logicShameer Kolothum
The deferred_reset logic was added to vfio migration drivers to prevent a circular locking dependency with respect to mm_lock and state mutex. This is mainly because of the copy_to/from_user() functions(which takes mm_lock) invoked under state mutex. But for HiSilicon driver, the only place where we now hold the state mutex for copy_to_user is during the PRE_COPY IOCTL. So for pre_copy, release the lock as soon as we have updated the data and perform copy_to_user without state mutex. By this, we can get rid of the deferred_reset logic. Link: https://lore.kernel.org/kvm/20240220132459.GM13330@nvidia.com/ Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Reviewed-by: Brett Creeley <brett.creeley@amd.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20240229091152.56664-1-shameerali.kolothum.thodi@huawei.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-03-04vfio/nvgrace-gpu: Convey kvm to map device memory region as noncachedAnkit Agrawal
The NVIDIA Grace Hopper GPUs have device memory that is supposed to be used as a regular RAM. It is accessible through CPU-GPU chip-to-chip cache coherent interconnect and is present in the system physical address space. The device memory is split into two regions - termed as usemem and resmem - in the system physical address space, with each region mapped and exposed to the VM as a separate fake device BAR [1]. Owing to a hardware defect for Multi-Instance GPU (MIG) feature [2], there is a requirement - as a workaround - for the resmem BAR to display uncached memory characteristics. Based on [3], on system with FWB enabled such as Grace Hopper, the requisite properties (uncached, unaligned access) can be achieved through a VM mapping (S1) of NORMAL_NC and host mapping (S2) of MT_S2_FWB_NORMAL_NC. KVM currently maps the MMIO region in S2 as MT_S2_FWB_DEVICE_nGnRE by default. The fake device BARs thus displays DEVICE_nGnRE behavior in the VM. The following table summarizes the behavior for the various S1 and S2 mapping combinations for systems with FWB enabled [3]. S1 | S2 | Result NORMAL_NC | NORMAL_NC | NORMAL_NC NORMAL_NC | DEVICE_nGnRE | DEVICE_nGnRE Recently a change was added that modifies this default behavior and make KVM map MMIO as MT_S2_FWB_NORMAL_NC when a VMA flag VM_ALLOW_ANY_UNCACHED is set [4]. Setting S2 as MT_S2_FWB_NORMAL_NC provides the desired behavior (uncached, unaligned access) for resmem. To use VM_ALLOW_ANY_UNCACHED flag, the platform must guarantee that no action taken on the MMIO mapping can trigger an uncontained failure. The Grace Hopper satisfies this requirement. So set the VM_ALLOW_ANY_UNCACHED flag in the VMA. Applied over next-20240227. base-commit: 22ba90670a51 Link: https://lore.kernel.org/all/20240220115055.23546-4-ankita@nvidia.com/ [1] Link: https://www.nvidia.com/en-in/technologies/multi-instance-gpu/ [2] Link: https://developer.arm.com/documentation/ddi0487/latest/ section D8.5.5 [3] Link: https://lore.kernel.org/all/20240224150546.368-1-ankita@nvidia.com/ [4] Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Kevin Tian <kevin.tian@intel.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Vikram Sethi <vsethi@nvidia.com> Cc: Zhi Wang <zhiw@nvidia.com> Signed-off-by: Ankit Agrawal <ankita@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20240229193934.2417-1-ankita@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-03-04Merge branch 'kvm-arm64/vfio-normal-nc' of ↵Alex Williamson
https://git.kernel.org/pub/scm/linux/kernel/git/oupton/linux into v6.9/vfio/next
2024-03-01vfio/pds: Always clear the save/restore FDs on resetBrett Creeley
After reset the VFIO device state will always be put in VFIO_DEVICE_STATE_RUNNING, but the save/restore files will only be cleared if the previous state was VFIO_DEVICE_STATE_ERROR. This can/will cause the restore/save files to be leaked if/when the migration state machine transitions through the states that re-allocates these files. Fix this by always clearing the restore/save files for resets. Fixes: 7dabb1bcd177 ("vfio/pds: Add support for firmware recovery") Cc: stable@vger.kernel.org Signed-off-by: Brett Creeley <brett.creeley@amd.com> Reviewed-by: Shannon Nelson <shannon.nelson@amd.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Link: https://lore.kernel.org/r/20240228003205.47311-2-brett.creeley@amd.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-02-24vfio: Convey kvm that the vfio-pci device is wc safeAnkit Agrawal
The VM_ALLOW_ANY_UNCACHED flag is implemented for ARM64, allowing KVM stage 2 device mapping attributes to use Normal-NC rather than DEVICE_nGnRE, which allows guest mappings supporting write-combining attributes (WC). ARM does not architecturally guarantee this is safe, and indeed some MMIO regions like the GICv2 VCPU interface can trigger uncontained faults if Normal-NC is used. To safely use VFIO in KVM the platform must guarantee full safety in the guest where no action taken against a MMIO mapping can trigger an uncontained failure. The expectation is that most VFIO PCI platforms support this for both mapping types, at least in common flows, based on some expectations of how PCI IP is integrated. So make vfio-pci set the VM_ALLOW_ANY_UNCACHED flag. Suggested-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Jason Gunthorpe <jgg@nvidia.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Alex Williamson <alex.williamson@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Ankit Agrawal <ankita@nvidia.com> Link: https://lore.kernel.org/r/20240224150546.368-5-ankita@nvidia.com Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-02-22vfio/nvgrace-gpu: Add vfio pci variant module for grace hopperAnkit Agrawal
NVIDIA's upcoming Grace Hopper Superchip provides a PCI-like device for the on-chip GPU that is the logical OS representation of the internal proprietary chip-to-chip cache coherent interconnect. The device is peculiar compared to a real PCI device in that whilst there is a real 64b PCI BAR1 (comprising region 2 & region 3) on the device, it is not used to access device memory once the faster chip-to-chip interconnect is initialized (occurs at the time of host system boot). The device memory is accessed instead using the chip-to-chip interconnect that is exposed as a contiguous physically addressable region on the host. This device memory aperture can be obtained from host ACPI table using device_property_read_u64(), according to the FW specification. Since the device memory is cache coherent with the CPU, it can be mmap into the user VMA with a cacheable mapping using remap_pfn_range() and used like a regular RAM. The device memory is not added to the host kernel, but mapped directly as this reduces memory wastage due to struct pages. There is also a requirement of a minimum reserved 1G uncached region (termed as resmem) to support the Multi-Instance GPU (MIG) feature [1]. This is to work around a HW defect. Based on [2], the requisite properties (uncached, unaligned access) can be achieved through a VM mapping (S1) of NORMAL_NC and host (S2) mapping with MemAttr[2:0]=0b101. To provide a different non-cached property to the reserved 1G region, it needs to be carved out from the device memory and mapped as a separate region in Qemu VMA with pgprot_writecombine(). pgprot_writecombine() sets the Qemu VMA page properties (pgprot) as NORMAL_NC. Provide a VFIO PCI variant driver that adapts the unique device memory representation into a more standard PCI representation facing userspace. The variant driver exposes these two regions - the non-cached reserved (resmem) and the cached rest of the device memory (termed as usemem) as separate VFIO 64b BAR regions. This is divergent from the baremetal approach, where the device memory is exposed as a device memory region. The decision for a different approach was taken in view of the fact that it would necessiate additional code in Qemu to discover and insert those regions in the VM IPA, along with the additional VM ACPI DSDT changes to communicate the device memory region IPA to the VM workloads. Moreover, this behavior would have to be added to a variety of emulators (beyond top of tree Qemu) out there desiring grace hopper support. Since the device implements 64-bit BAR0, the VFIO PCI variant driver maps the uncached carved out region to the next available PCI BAR (i.e. comprising of region 2 and 3). The cached device memory aperture is assigned BAR region 4 and 5. Qemu will then naturally generate a PCI device in the VM with the uncached aperture reported as BAR2 region, the cacheable as BAR4. The variant driver provides emulation for these fake BARs' PCI config space offset registers. The hardware ensures that the system does not crash when the memory is accessed with the memory enable turned off. It synthesis ~0 reads and dropped writes on such access. So there is no need to support the disablement/enablement of BAR through PCI_COMMAND config space register. The memory layout on the host looks like the following: devmem (memlength) |--------------------------------------------------| |-------------cached------------------------|--NC--| | | usemem.memphys resmem.memphys PCI BARs need to be aligned to the power-of-2, but the actual memory on the device may not. A read or write access to the physical address from the last device PFN up to the next power-of-2 aligned physical address results in reading ~0 and dropped writes. Note that the GPU device driver [6] is capable of knowing the exact device memory size through separate means. The device memory size is primarily kept in the system ACPI tables for use by the VFIO PCI variant module. Note that the usemem memory is added by the VM Nvidia device driver [5] to the VM kernel as memblocks. Hence make the usable memory size memblock (MEMBLK_SIZE) aligned. This is a hardwired ABI value between the GPU FW and VFIO driver. The VM device driver make use of the same value for its calculation to determine USEMEM size. Currently there is no provision in KVM for a S2 mapping with MemAttr[2:0]=0b101, but there is an ongoing effort to provide the same [3]. As previously mentioned, resmem is mapped pgprot_writecombine(), that sets the Qemu VMA page properties (pgprot) as NORMAL_NC. Using the proposed changes in [3] and [4], KVM marks the region with MemAttr[2:0]=0b101 in S2. If the device memory properties are not present, the driver registers the vfio-pci-core function pointers. Since there are no ACPI memory properties generated for the VM, the variant driver inside the VM will only use the vfio-pci-core ops and hence try to map the BARs as non cached. This is not a problem as the CPUs have FWB enabled which blocks the VM mapping's ability to override the cacheability set by the host mapping. This goes along with a qemu series [6] to provides the necessary implementation of the Grace Hopper Superchip firmware specification so that the guest operating system can see the correct ACPI modeling for the coherent GPU device. Verified with the CUDA workload in the VM. [1] https://www.nvidia.com/en-in/technologies/multi-instance-gpu/ [2] section D8.5.5 of https://developer.arm.com/documentation/ddi0487/latest/ [3] https://lore.kernel.org/all/20240211174705.31992-1-ankita@nvidia.com/ [4] https://lore.kernel.org/all/20230907181459.18145-2-ankita@nvidia.com/ [5] https://github.com/NVIDIA/open-gpu-kernel-modules [6] https://lore.kernel.org/all/20231203060245.31593-1-ankita@nvidia.com/ Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Yishai Hadas <yishaih@nvidia.com> Reviewed-by: Zhi Wang <zhi.wang.linux@gmail.com> Signed-off-by: Aniket Agashe <aniketa@nvidia.com> Signed-off-by: Ankit Agrawal <ankita@nvidia.com> Link: https://lore.kernel.org/r/20240220115055.23546-4-ankita@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-02-22vfio/pci: rename and export range_intersect_rangeAnkit Agrawal
range_intersect_range determines an overlap between two ranges. If an overlap, the helper function returns the overlapping offset and size. The VFIO PCI variant driver emulates the PCI config space BAR offset registers. These offset may be accessed for read/write with a variety of lengths including sub-word sizes from sub-word offsets. The driver makes use of this helper function to read/write the targeted part of the emulated register. Make this a vfio_pci_core function, rename and export as GPL. Also update references in virtio driver. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Yishai Hadas <yishaih@nvidia.com> Signed-off-by: Ankit Agrawal <ankita@nvidia.com> Link: https://lore.kernel.org/r/20240220115055.23546-3-ankita@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-02-22vfio/pci: rename and export do_io_rw()Ankit Agrawal
do_io_rw() is used to read/write to the device MMIO. The grace hopper VFIO PCI variant driver require this functionality to read/write to its memory. Rename this as vfio_pci_core functions and export as GPL. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Yishai Hadas <yishaih@nvidia.com> Signed-off-by: Ankit Agrawal <ankita@nvidia.com> Link: https://lore.kernel.org/r/20240220115055.23546-2-ankita@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-02-22vfio/mlx5: Let firmware knows upon leaving PRE_COPY back to RUNNINGYishai Hadas
Let firmware knows upon leaving PRE_COPY back to RUNNING as of some error in the target/migration cancellation. This will let firmware cleaning its internal resources that were turned on upon PRE_COPY. The flow is based on the device specification in this area. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Leon Romanovsky <leon@kernel.org> Link: https://lore.kernel.org/r/20240205124828.232701-6-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-02-22vfio/mlx5: Block incremental query upon migf state errorYishai Hadas
Block incremental query which is state-dependent once the migration file was previously marked with state error. This may prevent redundant calls to firmware upon PRE_COPY which will end-up with a failure and a syndrome printed in dmesg. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Leon Romanovsky <leon@kernel.org> Link: https://lore.kernel.org/r/20240205124828.232701-5-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-02-22vfio/mlx5: Handle the EREMOTEIO error upon the SAVE commandYishai Hadas
The SAVE command uses the async command interface over the PF. Upon a failure in the firmware -EREMOTEIO is returned. In that case call mlx5_cmd_out_err() to let it print the command failure details including the firmware syndrome. Note: The other commands in the driver use the sync command interface in a way that a firmware syndrome is printed upon an error inside mlx5_core. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Leon Romanovsky <leon@kernel.org> Link: https://lore.kernel.org/r/20240205124828.232701-4-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-02-22vfio/mlx5: Add support for tracker object change eventYishai Hadas
Add support for tracker object change event by referring to its MLX5_EVENT_TYPE_OBJECT_CHANGE event when occurs. This lets the driver recognize whether the firmware moved the tracker object to an error state. In that case, the driver will skip/block any usage of that object including an early exit in case the object was previously marked with an error. This functionality also covers the case when no CQE is delivered as of the error state. The driver was adapted to the device specification to handle the above. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Leon Romanovsky <leon@kernel.org> Link: https://lore.kernel.org/r/20240205124828.232701-3-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-02-22vfio/pci: WARN_ON driver_override kasprintf failureKunwu Chan
kasprintf() returns a pointer to dynamically allocated memory which can be NULL upon failure. This is a blocking notifier callback, so errno isn't a proper return value. Use WARN_ON to small allocation failures. Signed-off-by: Kunwu Chan <chentao@kylinos.cn> Link: https://lore.kernel.org/r/20240115063434.20278-1-chentao@kylinos.cn Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-01-18Merge tag 'vfio-v6.8-rc1' of https://github.com/awilliam/linux-vfioLinus Torvalds
Pull VFIO updates from Alex Williamson: - Add debugfs support, initially used for reporting device migration state (Longfang Liu) - Fixes and support for migration dirty tracking across multiple IOVA regions in the pds-vfio-pci driver (Brett Creeley) - Improved IOMMU allocation accounting visibility (Pasha Tatashin) - Virtio infrastructure and a new virtio-vfio-pci variant driver, which provides emulation of a legacy virtio interfaces on modern virtio hardware for virtio-net VF devices where the PF driver exposes support for legacy admin queues, ie. an emulated IO BAR on an SR-IOV VF to provide driver ABI compatibility to legacy devices (Yishai Hadas & Feng Liu) - Migration fixes for the hisi-acc-vfio-pci variant driver (Shameer Kolothum) - Kconfig dependency fix for new virtio-vfio-pci variant driver (Arnd Bergmann) * tag 'vfio-v6.8-rc1' of https://github.com/awilliam/linux-vfio: (22 commits) vfio/virtio: fix virtio-pci dependency hisi_acc_vfio_pci: Update migration data pointer correctly on saving/resume vfio/virtio: Declare virtiovf_pci_aer_reset_done() static vfio/virtio: Introduce a vfio driver over virtio devices vfio/pci: Expose vfio_pci_core_iowrite/read##size() vfio/pci: Expose vfio_pci_core_setup_barmap() virtio-pci: Introduce APIs to execute legacy IO admin commands virtio-pci: Initialize the supported admin commands virtio-pci: Introduce admin commands virtio-pci: Introduce admin command sending function virtio-pci: Introduce admin virtqueue virtio: Define feature bit for administration virtqueue vfio/type1: account iommu allocations vfio/pds: Add multi-region support vfio/pds: Move seq/ack bitmaps into region struct vfio/pds: Pass region info to relevant functions vfio/pds: Move and rename region specific info vfio/pds: Only use a single SGL for both seq and ack vfio/pds: Fix calculations in pds_vfio_dirty_sync MAINTAINERS: Add vfio debugfs interface doc link ...
2024-01-10vfio/virtio: fix virtio-pci dependencyArnd Bergmann
The new vfio-virtio driver already has a dependency on VIRTIO_PCI_ADMIN_LEGACY, but that is a bool symbol and allows vfio-virtio to be built-in even if virtio-pci itself is a loadable module. This leads to a link failure: aarch64-linux-ld: drivers/vfio/pci/virtio/main.o: in function `virtiovf_pci_probe': main.c:(.text+0xec): undefined reference to `virtio_pci_admin_has_legacy_io' aarch64-linux-ld: drivers/vfio/pci/virtio/main.o: in function `virtiovf_pci_init_device': main.c:(.text+0x260): undefined reference to `virtio_pci_admin_legacy_io_notify_info' aarch64-linux-ld: drivers/vfio/pci/virtio/main.o: in function `virtiovf_pci_bar0_rw': main.c:(.text+0x6ec): undefined reference to `virtio_pci_admin_legacy_common_io_read' aarch64-linux-ld: main.c:(.text+0x6f4): undefined reference to `virtio_pci_admin_legacy_device_io_read' aarch64-linux-ld: main.c:(.text+0x7f0): undefined reference to `virtio_pci_admin_legacy_common_io_write' aarch64-linux-ld: main.c:(.text+0x7f8): undefined reference to `virtio_pci_admin_legacy_device_io_write' Add another explicit dependency on the tristate symbol. Fixes: eb61eca0e8c3 ("vfio/virtio: Introduce a vfio driver over virtio devices") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Link: https://lore.kernel.org/r/20240109075731.2726731-1-arnd@kernel.org Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-01-08Merge tag 'vfs-6.8.misc' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "This contains the usual miscellaneous features, cleanups, and fixes for vfs and individual fses. Features: - Add Jan Kara as VFS reviewer - Show correct device and inode numbers in proc/<pid>/maps for vma files on stacked filesystems. This is now easily doable thanks to the backing file work from the last cycles. This comes with selftests Cleanups: - Remove a redundant might_sleep() from wait_on_inode() - Initialize pointer with NULL, not 0 - Clarify comment on access_override_creds() - Rework and simplify eventfd_signal() and eventfd_signal_mask() helpers - Process aio completions in batches to avoid needless wakeups - Completely decouple struct mnt_idmap from namespaces. We now only keep the actual idmapping around and don't stash references to namespaces - Reformat maintainer entries to indicate that a given subsystem belongs to fs/ - Simplify fput() for files that were never opened - Get rid of various pointless file helpers - Rename various file helpers - Rename struct file members after SLAB_TYPESAFE_BY_RCU switch from last cycle - Make relatime_need_update() return bool - Use GFP_KERNEL instead of GFP_USER when allocating superblocks - Replace deprecated ida_simple_*() calls with their current ida_*() counterparts Fixes: - Fix comments on user namespace id mapping helpers. They aren't kernel doc comments so they shouldn't be using /** - s/Retuns/Returns/g in various places - Add missing parameter documentation on can_move_mount_beneath() - Rename i_mapping->private_data to i_mapping->i_private_data - Fix a false-positive lockdep warning in pipe_write() for watch queues - Improve __fget_files_rcu() code generation to improve performance - Only notify writer that pipe resizing has finished after setting pipe->max_usage otherwise writers are never notified that the pipe has been resized and hang - Fix some kernel docs in hfsplus - s/passs/pass/g in various places - Fix kernel docs in ntfs - Fix kcalloc() arguments order reported by gcc 14 - Fix uninitialized value in reiserfs" * tag 'vfs-6.8.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (36 commits) reiserfs: fix uninit-value in comp_keys watch_queue: fix kcalloc() arguments order ntfs: dir.c: fix kernel-doc function parameter warnings fs: fix doc comment typo fs tree wide selftests/overlayfs: verify device and inode numbers in /proc/pid/maps fs/proc: show correct device and inode numbers in /proc/pid/maps eventfd: Remove usage of the deprecated ida_simple_xx() API fs: super: use GFP_KERNEL instead of GFP_USER for super block allocation fs/hfsplus: wrapper.c: fix kernel-doc warnings fs: add Jan Kara as reviewer fs/inode: Make relatime_need_update return bool pipe: wakeup wr_wait after setting max_usage file: remove __receive_fd() file: stop exposing receive_fd_user() fs: replace f_rcuhead with f_task_work file: remove pointless wrapper file: s/close_fd_get_file()/file_close_fd()/g Improve __fget_files_rcu() code generation (and thus __fget_light()) file: massage cleanup of files that failed to open fs/pipe: Fix lockdep false-positive in watchqueue pipe_write() ...
2024-01-05hisi_acc_vfio_pci: Update migration data pointer correctly on saving/resumeShameer Kolothum
When the optional PRE_COPY support was added to speed up the device compatibility check, it failed to update the saving/resuming data pointers based on the fd offset. This results in migration data corruption and when the device gets started on the destination the following error is reported in some cases, [ 478.907684] arm-smmu-v3 arm-smmu-v3.2.auto: event 0x10 received: [ 478.913691] arm-smmu-v3 arm-smmu-v3.2.auto: 0x0000310200000010 [ 478.919603] arm-smmu-v3 arm-smmu-v3.2.auto: 0x000002088000007f [ 478.925515] arm-smmu-v3 arm-smmu-v3.2.auto: 0x0000000000000000 [ 478.931425] arm-smmu-v3 arm-smmu-v3.2.auto: 0x0000000000000000 [ 478.947552] hisi_zip 0000:31:00.0: qm_axi_rresp [error status=0x1] found [ 478.955930] hisi_zip 0000:31:00.0: qm_db_timeout [error status=0x400] found [ 478.955944] hisi_zip 0000:31:00.0: qm sq doorbell timeout in function 2 Fixes: d9a871e4a143 ("hisi_acc_vfio_pci: Introduce support for PRE_COPY state transitions") Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20231120091406.780-1-shameerali.kolothum.thodi@huawei.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-12-20vfio/virtio: Declare virtiovf_pci_aer_reset_done() staticYishai Hadas
Declare virtiovf_pci_aer_reset_done() as a static function to prevent the below build warning. "warning: no previous prototype for 'virtiovf_pci_aer_reset_done' [-Wmissing-prototypes]" Fixes: eb61eca0e8c3 ("vfio/virtio: Introduce a vfio driver over virtio devices") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Closes: https://lore.kernel.org/lkml/20231220143122.63337669@canb.auug.org.au/ Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/20231220082456.241973-1-yishaih@nvidia.com Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202312202115.oDmvN1VE-lkp@intel.com/ Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-12-19Merge branch 'v6.8/vfio/virtio' into v6.8/vfio/nextAlex Williamson
2023-12-19vfio/virtio: Introduce a vfio driver over virtio devicesYishai Hadas
Introduce a vfio driver over virtio devices to support the legacy interface functionality for VFs. Background, from the virtio spec [1]. -------------------------------------------------------------------- In some systems, there is a need to support a virtio legacy driver with a device that does not directly support the legacy interface. In such scenarios, a group owner device can provide the legacy interface functionality for the group member devices. The driver of the owner device can then access the legacy interface of a member device on behalf of the legacy member device driver. For example, with the SR-IOV group type, group members (VFs) can not present the legacy interface in an I/O BAR in BAR0 as expected by the legacy pci driver. If the legacy driver is running inside a virtual machine, the hypervisor executing the virtual machine can present a virtual device with an I/O BAR in BAR0. The hypervisor intercepts the legacy driver accesses to this I/O BAR and forwards them to the group owner device (PF) using group administration commands. -------------------------------------------------------------------- Specifically, this driver adds support for a virtio-net VF to be exposed as a transitional device to a guest driver and allows the legacy IO BAR functionality on top. This allows a VM which uses a legacy virtio-net driver in the guest to work transparently over a VF which its driver in the host is that new driver. The driver can be extended easily to support some other types of virtio devices (e.g virtio-blk), by adding in a few places the specific type properties as was done for virtio-net. For now, only the virtio-net use case was tested and as such we introduce the support only for such a device. Practically, Upon probing a VF for a virtio-net device, in case its PF supports legacy access over the virtio admin commands and the VF doesn't have BAR 0, we set some specific 'vfio_device_ops' to be able to simulate in SW a transitional device with I/O BAR in BAR 0. The existence of the simulated I/O bar is reported later on by overwriting the VFIO_DEVICE_GET_REGION_INFO command and the device exposes itself as a transitional device by overwriting some properties upon reading its config space. Once we report the existence of I/O BAR as BAR 0 a legacy driver in the guest may use it via read/write calls according to the virtio specification. Any read/write towards the control parts of the BAR will be captured by the new driver and will be translated into admin commands towards the device. In addition, any data path read/write access (i.e. virtio driver notifications) will be captured by the driver and forwarded to the physical BAR which its properties were supplied by the admin command VIRTIO_ADMIN_CMD_LEGACY_NOTIFY_INFO upon the probing/init flow. With that code in place a legacy driver in the guest has the look and feel as if having a transitional device with legacy support for both its control and data path flows. [1] https://github.com/oasis-tcs/virtio-spec/commit/03c2d32e5093ca9f2a17797242fbef88efe94b8c Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Link: https://lore.kernel.org/r/20231219093247.170936-10-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-12-19vfio/pci: Expose vfio_pci_core_iowrite/read##size()Yishai Hadas
Expose vfio_pci_core_iowrite/read##size() to let it be used by drivers. This functionality is needed to enable direct access to some physical BAR of the device with the proper locks/checks in place. The next patches from this series will use this functionality on a data path flow when a direct access to the BAR is needed. Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Link: https://lore.kernel.org/r/20231219093247.170936-9-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-12-19vfio/pci: Expose vfio_pci_core_setup_barmap()Yishai Hadas
Expose vfio_pci_core_setup_barmap() to be used by drivers. This will let drivers to mmap a BAR and re-use it from both vfio and the driver when it's applicable. This API will be used in the next patches by the vfio/virtio coming driver. Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Link: https://lore.kernel.org/r/20231219093247.170936-8-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-12-04vfio/pds: Add multi-region supportBrett Creeley
Only supporting a single region/range is limiting, wasteful, and in some cases broken (i.e. when there are large gaps in the iova memory ranges). Fix this by adding support for multiple regions based on what the device tells the driver it can support. Signed-off-by: Brett Creeley <brett.creeley@amd.com> Signed-off-by: Shannon Nelson <shannon.nelson@amd.com> Link: https://lore.kernel.org/r/20231117001207.2793-7-brett.creeley@amd.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-12-04vfio/pds: Move seq/ack bitmaps into region structBrett Creeley
Since the host seq/ack bitmaps are part of a region move them into struct pds_vfio_region. Also, make use of the bmp_bytes value for validation purposes. Signed-off-by: Brett Creeley <brett.creeley@amd.com> Signed-off-by: Shannon Nelson <shannon.nelson@amd.com> Link: https://lore.kernel.org/r/20231117001207.2793-6-brett.creeley@amd.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-12-04vfio/pds: Pass region info to relevant functionsBrett Creeley
A later patch in the series implements multi-region support. That will require specific regions to be passed to relevant functions. Prepare for that change by passing the region structure to relevant functions. Signed-off-by: Brett Creeley <brett.creeley@amd.com> Signed-off-by: Shannon Nelson <shannon.nelson@amd.com> Link: https://lore.kernel.org/r/20231117001207.2793-5-brett.creeley@amd.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-12-04vfio/pds: Move and rename region specific infoBrett Creeley
An upcoming change in this series will add support for multiple regions. To prepare for that, move region specific information into struct pds_vfio_region and rename the members for readability. This will reduce the size of the patch that actually implements multiple region support. Signed-off-by: Brett Creeley <brett.creeley@amd.com> Signed-off-by: Shannon Nelson <shannon.nelson@amd.com> Link: https://lore.kernel.org/r/20231117001207.2793-4-brett.creeley@amd.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-12-04vfio/pds: Only use a single SGL for both seq and ackBrett Creeley
Since the seq/ack operations never happen in parallel there is no need for multiple scatter gather lists per region. The current implementation is wasting memory. Fix this by only using a single scatter-gather list for both the seq and ack operations. Signed-off-by: Brett Creeley <brett.creeley@amd.com> Signed-off-by: Shannon Nelson <shannon.nelson@amd.com> Link: https://lore.kernel.org/r/20231117001207.2793-3-brett.creeley@amd.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-12-04vfio/pds: Fix calculations in pds_vfio_dirty_syncBrett Creeley
The incorrect check is being done for comparing the iova/length being requested to sync. This can cause the dirty sync operation to fail. Fix this by making sure the iova offset added to the requested sync length doesn't exceed the region_size. Also, the region_start is assumed to always be at 0. This can cause dirty tracking to fail because the device/driver bitmap offset always starts at 0, however, the region_start/iova may not. Fix this by determining the iova offset from region_start to determine the bitmap offset. Fixes: f232836a9152 ("vfio/pds: Add support for dirty page tracking") Signed-off-by: Brett Creeley <brett.creeley@amd.com> Signed-off-by: Shannon Nelson <shannon.nelson@amd.com> Link: https://lore.kernel.org/r/20231117001207.2793-2-brett.creeley@amd.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>