Age | Commit message (Collapse) | Author |
|
git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd
Pull iommufd updates from Jason Gunthorpe:
"This broadly brings the assigned HW command queue support to iommufd.
This feature is used to improve SVA performance in VMs by avoiding
paravirtualization traps during SVA invalidations.
Along the way I think some of the core logic is in a much better state
to support future driver backed features.
Summary:
- IOMMU HW now has features to directly assign HW command queues to a
guest VM. In this mode the command queue operates on a limited set
of invalidation commands that are suitable for improving guest
invalidation performance and easy for the HW to virtualize.
This brings the generic infrastructure to allow IOMMU drivers to
expose such command queues through the iommufd uAPI, mmap the
doorbell pages, and get the guest physical range for the command
queue ring itself.
- An implementation for the NVIDIA SMMUv3 extension "cmdqv" is built
on the new iommufd command queue features. It works with the
existing SMMU driver support for cmdqv in guest VMs.
- Many precursor cleanups and improvements to support the above
cleanly, changes to the general ioctl and object helpers, driver
support for VDEVICE, and mmap pgoff cookie infrastructure.
- Sequence VDEVICE destruction to always happen before VFIO device
destruction. When using the above type features, and also in future
confidential compute, the internal virtual device representation
becomes linked to HW or CC TSM configuration and objects. If a VFIO
device is removed from iommufd those HW objects should also be
cleaned up to prevent a sort of UAF. This became important now that
we have HW backing the VDEVICE.
- Fix one syzkaller found error related to math overflows during iova
allocation"
* tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd: (57 commits)
iommu/arm-smmu-v3: Replace vsmmu_size/type with get_viommu_size
iommu/arm-smmu-v3: Do not bother impl_ops if IOMMU_VIOMMU_TYPE_ARM_SMMUV3
iommufd: Rename some shortterm-related identifiers
iommufd/selftest: Add coverage for vdevice tombstone
iommufd/selftest: Explicitly skip tests for inapplicable variant
iommufd/vdevice: Remove struct device reference from struct vdevice
iommufd: Destroy vdevice on idevice destroy
iommufd: Add a pre_destroy() op for objects
iommufd: Add iommufd_object_tombstone_user() helper
iommufd/viommu: Roll back to use iommufd_object_alloc() for vdevice
iommufd/selftest: Test reserved regions near ULONG_MAX
iommufd: Prevent ALIGN() overflow
iommu/tegra241-cmdqv: import IOMMUFD module namespace
iommufd: Do not allow _iommufd_object_alloc_ucmd if abort op is set
iommu/tegra241-cmdqv: Add IOMMU_VEVENTQ_TYPE_TEGRA241_CMDQV support
iommu/tegra241-cmdqv: Add user-space use support
iommu/tegra241-cmdqv: Do not statically map LVCMDQs
iommu/tegra241-cmdqv: Simplify deinit flow in tegra241_cmdqv_remove_vintf()
iommu/tegra241-cmdqv: Use request_threaded_irq
iommu/arm-smmu-v3-iommufd: Add hw_info to impl_ops
...
|
|
Pull kvm updates from Paolo Bonzini:
"ARM:
- Host driver for GICv5, the next generation interrupt controller for
arm64, including support for interrupt routing, MSIs, interrupt
translation and wired interrupts
- Use FEAT_GCIE_LEGACY on GICv5 systems to virtualize GICv3 VMs on
GICv5 hardware, leveraging the legacy VGIC interface
- Userspace control of the 'nASSGIcap' GICv3 feature, allowing
userspace to disable support for SGIs w/o an active state on
hardware that previously advertised it unconditionally
- Map supporting endpoints with cacheable memory attributes on
systems with FEAT_S2FWB and DIC where KVM no longer needs to
perform cache maintenance on the address range
- Nested support for FEAT_RAS and FEAT_DoubleFault2, allowing the
guest hypervisor to inject external aborts into an L2 VM and take
traps of masked external aborts to the hypervisor
- Convert more system register sanitization to the config-driven
implementation
- Fixes to the visibility of EL2 registers, namely making VGICv3
system registers accessible through the VGIC device instead of the
ONE_REG vCPU ioctls
- Various cleanups and minor fixes
LoongArch:
- Add stat information for in-kernel irqchip
- Add tracepoints for CPUCFG and CSR emulation exits
- Enhance in-kernel irqchip emulation
- Various cleanups
RISC-V:
- Enable ring-based dirty memory tracking
- Improve perf kvm stat to report interrupt events
- Delegate illegal instruction trap to VS-mode
- MMU improvements related to upcoming nested virtualization
s390x
- Fixes
x86:
- Add CONFIG_KVM_IOAPIC for x86 to allow disabling support for I/O
APIC, PIC, and PIT emulation at compile time
- Share device posted IRQ code between SVM and VMX and harden it
against bugs and runtime errors
- Use vcpu_idx, not vcpu_id, for GA log tag/metadata, to make lookups
O(1) instead of O(n)
- For MMIO stale data mitigation, track whether or not a vCPU has
access to (host) MMIO based on whether the page tables have MMIO
pfns mapped; using VFIO is prone to false negatives
- Rework the MSR interception code so that the SVM and VMX APIs are
more or less identical
- Recalculate all MSR intercepts from scratch on MSR filter changes,
instead of maintaining shadow bitmaps
- Advertise support for LKGS (Load Kernel GS base), a new instruction
that's loosely related to FRED, but is supported and enumerated
independently
- Fix a user-triggerable WARN that syzkaller found by setting the
vCPU in INIT_RECEIVED state (aka wait-for-SIPI), and then putting
the vCPU into VMX Root Mode (post-VMXON). Trying to detect every
possible path leading to architecturally forbidden states is hard
and even risks breaking userspace (if it goes from valid to valid
state but passes through invalid states), so just wait until
KVM_RUN to detect that the vCPU state isn't allowed
- Add KVM_X86_DISABLE_EXITS_APERFMPERF to allow disabling
interception of APERF/MPERF reads, so that a "properly" configured
VM can access APERF/MPERF. This has many caveats (APERF/MPERF
cannot be zeroed on vCPU creation or saved/restored on suspend and
resume, or preserved over thread migration let alone VM migration)
but can be useful whenever you're interested in letting Linux
guests see the effective physical CPU frequency in /proc/cpuinfo
- Reject KVM_SET_TSC_KHZ for vm file descriptors if vCPUs have been
created, as there's no known use case for changing the default
frequency for other VM types and it goes counter to the very reason
why the ioctl was added to the vm file descriptor. And also, there
would be no way to make it work for confidential VMs with a
"secure" TSC, so kill two birds with one stone
- Dynamically allocation the shadow MMU's hashed page list, and defer
allocating the hashed list until it's actually needed (the TDP MMU
doesn't use the list)
- Extract many of KVM's helpers for accessing architectural local
APIC state to common x86 so that they can be shared by guest-side
code for Secure AVIC
- Various cleanups and fixes
x86 (Intel):
- Preserve the host's DEBUGCTL.FREEZE_IN_SMM when running the guest.
Failure to honor FREEZE_IN_SMM can leak host state into guests
- Explicitly check vmcs12.GUEST_DEBUGCTL on nested VM-Enter to
prevent L1 from running L2 with features that KVM doesn't support,
e.g. BTF
x86 (AMD):
- WARN and reject loading kvm-amd.ko instead of panicking the kernel
if the nested SVM MSRPM offsets tracker can't handle an MSR (which
is pretty much a static condition and therefore should never
happen, but still)
- Fix a variety of flaws and bugs in the AVIC device posted IRQ code
- Inhibit AVIC if a vCPU's ID is too big (relative to what hardware
supports) instead of rejecting vCPU creation
- Extend enable_ipiv module param support to SVM, by simply leaving
IsRunning clear in the vCPU's physical ID table entry
- Disable IPI virtualization, via enable_ipiv, if the CPU is affected
by erratum #1235, to allow (safely) enabling AVIC on such CPUs
- Request GA Log interrupts if and only if the target vCPU is
blocking, i.e. only if KVM needs a notification in order to wake
the vCPU
- Intercept SPEC_CTRL on AMD if the MSR shouldn't exist according to
the vCPU's CPUID model
- Accept any SNP policy that is accepted by the firmware with respect
to SMT and single-socket restrictions. An incompatible policy
doesn't put the kernel at risk in any way, so there's no reason for
KVM to care
- Drop a superfluous WBINVD (on all CPUs!) when destroying a VM and
use WBNOINVD instead of WBINVD when possible for SEV cache
maintenance
- When reclaiming memory from an SEV guest, only do cache flushes on
CPUs that have ever run a vCPU for the guest, i.e. don't flush the
caches for CPUs that can't possibly have cache lines with dirty,
encrypted data
Generic:
- Rework irqbypass to track/match producers and consumers via an
xarray instead of a linked list. Using a linked list leads to
O(n^2) insertion times, which is hugely problematic for use cases
that create large numbers of VMs. Such use cases typically don't
actually use irqbypass, but eliminating the pointless registration
is a future problem to solve as it likely requires new uAPI
- Track irqbypass's "token" as "struct eventfd_ctx *" instead of a
"void *", to avoid making a simple concept unnecessarily difficult
to understand
- Decouple device posted IRQs from VFIO device assignment, as binding
a VM to a VFIO group is not a requirement for enabling device
posted IRQs
- Clean up and document/comment the irqfd assignment code
- Disallow binding multiple irqfds to an eventfd with a priority
waiter, i.e. ensure an eventfd is bound to at most one irqfd
through the entire host, and add a selftest to verify eventfd:irqfd
bindings are globally unique
- Add a tracepoint for KVM_SET_MEMORY_ATTRIBUTES to help debug issues
related to private <=> shared memory conversions
- Drop guest_memfd's .getattr() implementation as the VFS layer will
call generic_fillattr() if inode_operations.getattr is NULL
- Fix issues with dirty ring harvesting where KVM doesn't bound the
processing of entries in any way, which allows userspace to keep
KVM in a tight loop indefinitely
- Kill off kvm_arch_{start,end}_assignment() and x86's associated
tracking, now that KVM no longer uses assigned_device_count as a
heuristic for either irqbypass usage or MDS mitigation
Selftests:
- Fix a comment typo
- Verify KVM is loaded when getting any KVM module param so that
attempting to run a selftest without kvm.ko loaded results in a
SKIP message about KVM not being loaded/enabled (versus some random
parameter not existing)
- Skip tests that hit EACCES when attempting to access a file, and
print a "Root required?" help message. In most cases, the test just
needs to be run with elevated permissions"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (340 commits)
Documentation: KVM: Use unordered list for pre-init VGIC registers
RISC-V: KVM: Avoid re-acquiring memslot in kvm_riscv_gstage_map()
RISC-V: KVM: Use find_vma_intersection() to search for intersecting VMAs
RISC-V: perf/kvm: Add reporting of interrupt events
RISC-V: KVM: Enable ring-based dirty memory tracking
RISC-V: KVM: Fix inclusion of Smnpm in the guest ISA bitmap
RISC-V: KVM: Delegate illegal instruction fault to VS mode
RISC-V: KVM: Pass VMID as parameter to kvm_riscv_hfence_xyz() APIs
RISC-V: KVM: Factor-out g-stage page table management
RISC-V: KVM: Add vmid field to struct kvm_riscv_hfence
RISC-V: KVM: Introduce struct kvm_gstage_mapping
RISC-V: KVM: Factor-out MMU related declarations into separate headers
RISC-V: KVM: Use ncsr_xyz() in kvm_riscv_vcpu_trap_redirect()
RISC-V: KVM: Implement kvm_arch_flush_remote_tlbs_range()
RISC-V: KVM: Don't flush TLB when PTE is unchanged
RISC-V: KVM: Replace KVM_REQ_HFENCE_GVMA_VMID_ALL with KVM_REQ_TLB_FLUSH
RISC-V: KVM: Rename and move kvm_riscv_local_tlb_sanitize()
RISC-V: KVM: Drop the return value of kvm_riscv_vcpu_aia_init()
RISC-V: KVM: Check kvm_riscv_vcpu_alloc_vector_context() return value
KVM: arm64: selftests: Add FEAT_RAS EL2 registers to get-reg-list
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux
Pull iommu updates from Will Deacon:
"Core:
- Remove the 'pgsize_bitmap' member from 'struct iommu_ops'
- Convert the x86 drivers over to msi_create_parent_irq_domain()
AMD-Vi:
- Add support for examining driver/device internals via debugfs
- Add support for "HATDis" to disable host translation when it is not
supported
- Add support for limiting the maximum host translation level based
on EFR[HATS]
Apple DART:
- Don't enable as built-in by default when ARCH_APPLE is selected
Arm SMMU:
- Devicetree bindings update for the Qualcomm SMMU in the "Milos" SoC
- Support for Qualcomm SM6115 MDSS parts
- Disable PRR on Qualcomm SM8250 as using these bits causes the
hypervisor to explode
Intel VT-d:
- Reorganize Intel VT-d to be ready for iommupt
- Optimize iotlb_sync_map for non-caching/non-RWBF modes
- Fix missed PASID in dev TLB invalidation in cache_tag_flush_all()
Mediatek:
- Fix build warnings when W=1
Samsung Exynos:
- Add support for reserved memory regions specified by the bootloader
TI OMAP:
- Use syscon_regmap_lookup_by_phandle_args() instead of parsing the
node manually
Misc:
- Cleanups and minor fixes across the board"
* tag 'iommu-updates-v6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux: (48 commits)
iommu/vt-d: Fix UAF on sva unbind with pending IOPFs
iommu/vt-d: Make iotlb_sync_map a static property of dmar_domain
dt-bindings: arm-smmu: Remove sdm845-cheza specific entry
iommu/amd: Fix geometry.aperture_end for V2 tables
iommu/amd: Wrap debugfs ABI testing symbols snippets in literal code blocks
iommu/amd: Add documentation for AMD IOMMU debugfs support
iommu/amd: Add debugfs support to dump IRT Table
iommu/amd: Add debugfs support to dump device table
iommu/amd: Add support for device id user input
iommu/amd: Add debugfs support to dump IOMMU command buffer
iommu/amd: Add debugfs support to dump IOMMU Capability registers
iommu/amd: Add debugfs support to dump IOMMU MMIO registers
iommu/amd: Refactor AMD IOMMU debugfs initial setup
dt-bindings: arm-smmu: document the support on Milos
iommu/exynos: add support for reserved regions
iommu/arm-smmu: disable PRR on SM8250
iommu/arm-smmu-v3: Revert vmaster in the error path
iommu/io-pgtable-arm: Remove unused macro iopte_prot
iommu/arm-smmu-qcom: Add SM6115 MDSS compatible
iommu/qcom: Fix pgsize_bitmap
...
|
|
KVM IRQ changes for 6.17
- Rework irqbypass to track/match producers and consumers via an xarray
instead of a linked list. Using a linked list leads to O(n^2) insertion
times, which is hugely problematic for use cases that create large numbers
of VMs. Such use cases typically don't actually use irqbypass, but
eliminating the pointless registration is a future problem to solve as it
likely requires new uAPI.
- Track irqbypass's "token" as "struct eventfd_ctx *" instead of a "void *",
to avoid making a simple concept unnecessarily difficult to understand.
- Add CONFIG_KVM_IOAPIC for x86 to allow disabling support for I/O APIC, PIC,
and PIT emulation at compile time.
- Drop x86's irq_comm.c, and move a pile of IRQ related code into irq.c.
- Fix a variety of flaws and bugs in the AVIC device posted IRQ code.
- Inhibited AVIC if a vCPU's ID is too big (relative to what hardware
supports) instead of rejecting vCPU creation.
- Extend enable_ipiv module param support to SVM, by simply leaving IsRunning
clear in the vCPU's physical ID table entry.
- Disable IPI virtualization, via enable_ipiv, if the CPU is affected by
erratum #1235, to allow (safely) enabling AVIC on such CPUs.
- Dedup x86's device posted IRQ code, as the vast majority of functionality
can be shared verbatime between SVM and VMX.
- Harden the device posted IRQ code against bugs and runtime errors.
- Use vcpu_idx, not vcpu_id, for GA log tag/metadata, to make lookups O(1)
instead of O(n).
- Generate GA Log interrupts if and only if the target vCPU is blocking, i.e.
only if KVM needs a notification in order to wake the vCPU.
- Decouple device posted IRQs from VFIO device assignment, as binding a VM to
a VFIO group is not a requirement for enabling device posted IRQs.
- Clean up and document/comment the irqfd assignment code.
- Disallow binding multiple irqfds to an eventfd with a priority waiter, i.e.
ensure an eventfd is bound to at most one irqfd through the entire host,
and add a selftest to verify eventfd:irqfd bindings are globally unique.
|
|
* intel/vt-d:
iommu/vt-d: Fix UAF on sva unbind with pending IOPFs
iommu/vt-d: Make iotlb_sync_map a static property of dmar_domain
iommu/vt-d: Deduplicate cache_tag_flush_all by reusing flush_range
iommu/vt-d: Fix missing PASID in dev TLB flush with cache_tag_flush_all
iommu/vt-d: Split paging_domain_compatible()
iommu/vt-d: Split intel_iommu_enforce_cache_coherency()
iommu/vt-d: Create unique domain ops for each stage
iommu/vt-d: Split intel_iommu_domain_alloc_paging_flags()
iommu/vt-d: Do not wipe out the page table NID when devices detach
iommu/vt-d: Fold domain_exit() into intel_iommu_domain_free()
iommu/vt-d: Lift the __pa to domain_setup_first_level/intel_svm_set_dev_pasid()
iommu/vt-d: Optimize iotlb_sync_map for non-caching/non-RWBF modes
iommu/vt-d: Remove the CONFIG_X86 wrapping from iommu init hook
|
|
Commit 17fce9d2336d ("iommu/vt-d: Put iopf enablement in domain attach
path") disables IOPF on device by removing the device from its IOMMU's
IOPF queue when the last IOPF-capable domain is detached from the device.
Unfortunately, it did this in a wrong place where there are still pending
IOPFs. As a result, a use-after-free error is potentially triggered and
eventually a kernel panic with a kernel trace similar to the following:
refcount_t: underflow; use-after-free.
WARNING: CPU: 3 PID: 313 at lib/refcount.c:28 refcount_warn_saturate+0xd8/0xe0
Workqueue: iopf_queue/dmar0-iopfq iommu_sva_handle_iopf
Call Trace:
<TASK>
iopf_free_group+0xe/0x20
process_one_work+0x197/0x3d0
worker_thread+0x23a/0x350
? rescuer_thread+0x4a0/0x4a0
kthread+0xf8/0x230
? finish_task_switch.isra.0+0x81/0x260
? kthreads_online_cpu+0x110/0x110
? kthreads_online_cpu+0x110/0x110
ret_from_fork+0x13b/0x170
? kthreads_online_cpu+0x110/0x110
ret_from_fork_asm+0x11/0x20
</TASK>
---[ end trace 0000000000000000 ]---
The intel_pasid_tear_down_entry() function is responsible for blocking
hardware from generating new page faults and flushing all in-flight
ones. Therefore, moving iopf_for_domain_remove() after this function
should resolve this.
Fixes: 17fce9d2336d ("iommu/vt-d: Put iopf enablement in domain attach path")
Reported-by: Ethan Milon <ethan.milon@eviden.com>
Closes: https://lore.kernel.org/r/e8b37f3e-8539-40d4-8993-43a1f3ffe5aa@eviden.com
Suggested-by: Ethan Milon <ethan.milon@eviden.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20250723072045.1853328-1-baolu.lu@linux.intel.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Commit 12724ce3fe1a ("iommu/vt-d: Optimize iotlb_sync_map for
non-caching/non-RWBF modes") dynamically set iotlb_sync_map. This causes
synchronization issues due to lack of locking on map and attach paths,
racing iommufd userspace operations.
Invalidation changes must precede device attachment to ensure all flushes
complete before hardware walks page tables, preventing coherence issues.
Make domain->iotlb_sync_map static, set once during domain allocation. If
an IOMMU requires iotlb_sync_map but the domain lacks it, attach is
rejected. This won't reduce domain sharing: RWBF and shadowing page table
caching are legacy uses with legacy hardware. Mixed configs (some IOMMUs
in caching mode, others not) are unlikely in real-world scenarios.
Fixes: 12724ce3fe1a ("iommu/vt-d: Optimize iotlb_sync_map for non-caching/non-RWBF modes")
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20250721051657.1695788-1-baolu.lu@linux.intel.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
The logic in cache_tag_flush_all() to iterate over cache tags and issue
TLB invalidations is largely duplicated in cache_tag_flush_range(), with
the only difference being the range parameters.
Extend cache_tag_flush_range() to handle a full address space flush when
called with start = 0 and end = ULONG_MAX. This allows
cache_tag_flush_all() to simply delegate to cache_tag_flush_range()
Signed-off-by: Ethan Milon <ethan.milon@eviden.com>
Link: https://lore.kernel.org/r/20250708214821.30967-2-ethan.milon@eviden.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20250714045028.958850-12-baolu.lu@linux.intel.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
The function cache_tag_flush_all() was originally implemented with
incorrect device TLB invalidation logic that does not handle PASID, in
commit c4d27ffaa8eb ("iommu/vt-d: Add cache tag invalidation helpers")
This causes regressions where full address space TLB invalidations occur
with a PASID attached, such as during transparent hugepage unmapping in
SVA configurations or when calling iommu_flush_iotlb_all(). In these
cases, the device receives a TLB invalidation that lacks PASID.
This incorrect logic was later extracted into
cache_tag_flush_devtlb_all(), in commit 3297d047cd7f ("iommu/vt-d:
Refactor IOTLB and Dev-IOTLB flush for batching")
The fix replaces the call to cache_tag_flush_devtlb_all() with
cache_tag_flush_devtlb_psi(), which properly handles PASID.
Fixes: 4f609dbff51b ("iommu/vt-d: Use cache helpers in arch_invalidate_secondary_tlbs")
Fixes: 4e589a53685c ("iommu/vt-d: Use cache_tag_flush_all() in flush_iotlb_all")
Signed-off-by: Ethan Milon <ethan.milon@eviden.com>
Link: https://lore.kernel.org/r/20250708214821.30967-1-ethan.milon@eviden.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20250714045028.958850-11-baolu.lu@linux.intel.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Make First/Second stage specific functions that follow the same pattern in
intel_iommu_domain_alloc_first/second_stage() for computing
EOPNOTSUPP. This makes the code easier to understand as if we couldn't
create a domain with the parameters for this IOMMU instance then we
certainly are not compatible with it.
Check superpage support directly against the per-stage cap bits and the
pgsize_bitmap.
Add a note that the force_snooping is read without locking. The locking
needs to cover the compatible check and the add of the device to the list.
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/7-v3-dbbe6f7e7ae3+124ffe-vtd_prep_jgg@nvidia.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20250714045028.958850-10-baolu.lu@linux.intel.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
First Stage and Second Stage have very different ways to deny
no-snoop. The first stage uses the PGSNP bit which is global per-PASID so
enabling requires loading new PASID entries for all the attached devices.
Second stage uses a bit per PTE, so enabling just requires telling future
maps to set the bit.
Since we now have two domain ops we can have two functions that can
directly code their required actions instead of a bunch of logic dancing
around use_first_level.
Combine domain_set_force_snooping() into the new functions since they are
the only caller.
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/6-v3-dbbe6f7e7ae3+124ffe-vtd_prep_jgg@nvidia.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20250714045028.958850-9-baolu.lu@linux.intel.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Use the domain ops pointer to tell what kind of domain it is instead of
the internal use_first_level indication. This also protects against
wrongly using a SVA/nested/IDENTITY/BLOCKED domain type in places they
should not be.
The only remaining uses of use_first_level outside the paging domain are in
paging_domain_compatible() and intel_iommu_enforce_cache_coherency().
Thus, remove the useless sets of use_first_level in
intel_svm_domain_alloc() and intel_iommu_domain_alloc_nested(). None of
the unique ops for these domain types ever reference it on their call
chains.
Add a WARN_ON() check in domain_context_mapping_one() as it only works
with second stage.
This is preparation for iommupt which will have different ops for each of
the stages.
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/5-v3-dbbe6f7e7ae3+124ffe-vtd_prep_jgg@nvidia.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20250714045028.958850-8-baolu.lu@linux.intel.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Create stage specific functions that check the stage specific conditions
if each stage can be supported.
Have intel_iommu_domain_alloc_paging_flags() call both stages in sequence
until one does not return EOPNOTSUPP and prefer to use the first stage if
available and suitable for the requested flags.
Move second stage only operations like nested_parent and dirty_tracking
into the second stage function for clarity.
Move initialization of the iommu_domain members into paging_domain_alloc().
Drop initialization of domain->owner as the callers all do it.
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/4-v3-dbbe6f7e7ae3+124ffe-vtd_prep_jgg@nvidia.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20250714045028.958850-7-baolu.lu@linux.intel.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
The NID is used to control which NUMA node memory for the page table is
allocated it from. It should be a permanent property of the page table
when it was allocated and not change during attach/detach of devices.
Reviewed-by: Wei Wang <wei.w.wang@intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/3-v3-dbbe6f7e7ae3+124ffe-vtd_prep_jgg@nvidia.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Fixes: 7c204426b818 ("iommu/vt-d: Add domain_alloc_paging support")
Link: https://lore.kernel.org/r/20250714045028.958850-6-baolu.lu@linux.intel.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
It has only one caller, no need for two functions.
Correct the WARN_ON() error handling to leak the entire page table if the
HW is still referencing it so we don't UAF during WARN_ON recovery.
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/2-v3-dbbe6f7e7ae3+124ffe-vtd_prep_jgg@nvidia.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20250714045028.958850-5-baolu.lu@linux.intel.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Pass the phys_addr_t down through the call chain from the top instead of
passing a pgd_t * KVA. This moves the __pa() into
domain_setup_first_level() which is the first function to obtain the pgd
from the IOMMU page table in this call chain.
The SVA flow is also adjusted to get the pa of the mm->pgd.
iommput will move the __pa() into iommupt code, it never shares the KVA of
the page table with the driver.
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/1-v3-dbbe6f7e7ae3+124ffe-vtd_prep_jgg@nvidia.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20250714045028.958850-4-baolu.lu@linux.intel.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
The iotlb_sync_map iommu ops allows drivers to perform necessary cache
flushes when new mappings are established. For the Intel iommu driver,
this callback specifically serves two purposes:
- To flush caches when a second-stage page table is attached to a device
whose iommu is operating in caching mode (CAP_REG.CM==1).
- To explicitly flush internal write buffers to ensure updates to memory-
resident remapping structures are visible to hardware (CAP_REG.RWBF==1).
However, in scenarios where neither caching mode nor the RWBF flag is
active, the cache_tag_flush_range_np() helper, which is called in the
iotlb_sync_map path, effectively becomes a no-op.
Despite being a no-op, cache_tag_flush_range_np() involves iterating
through all cache tags of the iommu's attached to the domain, protected
by a spinlock. This unnecessary execution path introduces overhead,
leading to a measurable I/O performance regression. On systems with NVMes
under the same bridge, performance was observed to drop from approximately
~6150 MiB/s down to ~4985 MiB/s.
Introduce a flag in the dmar_domain structure. This flag will only be set
when iotlb_sync_map is required (i.e., when CM or RWBF is set). The
cache_tag_flush_range_np() is called only for domains where this flag is
set. This flag, once set, is immutable, given that there won't be mixed
configurations in real-world scenarios where some IOMMUs in a system
operate in caching mode while others do not. Theoretically, the
immutability of this flag does not impact functionality.
Reported-by: Ioanna Alifieraki <ioanna-maria.alifieraki@canonical.com>
Closes: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2115738
Link: https://lore.kernel.org/r/20250701171154.52435-1-ioanna-maria.alifieraki@canonical.com
Fixes: 129dab6e1286 ("iommu/vt-d: Use cache_tag_flush_range_np() in iotlb_sync_map")
Cc: stable@vger.kernel.org
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20250703031545.3378602-1-baolu.lu@linux.intel.com
Link: https://lore.kernel.org/r/20250714045028.958850-3-baolu.lu@linux.intel.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
iommu init hook is wrapped in CONFI_X86 and is a remnant of dmar.c when
it was a common code in "drivers/pci/dmar.c". This was added in commit
(9d5ce73a64be2 x86: intel-iommu: Convert detect_intel_iommu to use
iommu_init hook)
Now this is built only for x86. This config wrap could be removed.
Signed-off-by: Vineeth Pillai (Google) <vineeth@bitbyteword.org>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20250616131740.3499289-1-vineeth@bitbyteword.org
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20250714045028.958850-2-baolu.lu@linux.intel.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
The hw_info uAPI will support a bidirectional data_type field that can be
used as an input field for user space to request for a specific info data.
To prepare for the uAPI update, change the iommu layer first:
- Add a new IOMMU_HW_INFO_TYPE_DEFAULT as an input, for which driver can
output its only (or firstly) supported type
- Update the kdoc accordingly
- Roll out the type validation in the existing drivers
Link: https://patch.msgid.link/r/00f4a2d3d930721f61367014717b3ba2d1e82a81.1752126748.git.nicolinc@nvidia.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Pranjal Shrivastava <praan@google.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Replace u32 to make it clear. No functional changes.
Also simplify the kdoc since the type itself is clear enough.
Link: https://patch.msgid.link/r/651c50dee8ab900f691202ef0204cd5a43fdd6a2.1752126748.git.nicolinc@nvidia.com
Reviewed-by: Pranjal Shrivastava <praan@google.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Commit <4f1492efb495> ("iommu/vt-d: Revert ATS timing change to fix boot
failure") placed the enabling of ATS in the probe_finalize callback. This
occurs after the default domain attachment, which is when the ATS cache
tag is assigned. Consequently, the device TLB cache tag is missed when the
domain is attached, leading to the device TLB not being invalidated in the
iommu_unmap paths.
Fix this by assigning the CACHE_TAG_DEVTLB cache tag when ATS is enabled.
Fixes: 4f1492efb495 ("iommu/vt-d: Revert ATS timing change to fix boot failure")
Cc: stable@vger.kernel.org
Suggested-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Tested-by: Shuicheng Lin <shuicheng.lin@intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20250625050135.3129955-1-baolu.lu@linux.intel.com
Link: https://lore.kernel.org/r/20250628100351.3198955-2-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
Now that we have a concise helper to create an MSI parent domain,
switch the Intel IOMMU remapping over to that.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Nam Cao <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20241204124549.607054-10-maz@kernel.org
Link: https://lore.kernel.org/r/169c793c50be8493cfd9d11affb00e9ed6341c36.1750858125.git.namcao@linutronix.de
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
These drivers all set the domain->pgsize_bitmap in their
domain_alloc_paging() functions, so the ops value is never used. Delete
it.
Reviewed-by: Sven Peter <sven@svenpeter.dev> # for Apple DART
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Tomasz Jeznach <tjeznach@rivosinc.com> # for RISC-V
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Link: https://lore.kernel.org/r/3-v2-68a2e1ba507c+1fb-iommu_rm_ops_pgsize_jgg@nvidia.com
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
Split the vcpu_data structure that serves as a handoff from KVM to IOMMU
drivers into vendor specific structures. Overloading a single structure
makes the code hard to read and maintain, is *very* misleading as it
suggests that mixing vendors is actually supported, and bastardizing
Intel's posted interrupt descriptor address when AMD's IOMMU already has
its own structure is quite unnecessary.
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Link: https://lore.kernel.org/r/20250611224604.313496-33-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
'arm/smmu/bindings', 'fsl/pamu', 'mediatek', 'renesas/ipmmu', 's390', 'intel/vt-d', 'amd/amd-vi' and 'core' into next
|
|
Commit 2031c469f816 ("iommu/vt-d: Add support for static identity domain")
changed the context entry setup during domain attachment from a
set-and-check policy to a clear-and-reset approach. This inadvertently
introduced a regression affecting PCI aliased devices behind PCIe-to-PCI
bridges.
Specifically, keyboard and touchpad stopped working on several Apple
Macbooks with below messages:
kernel: platform pxa2xx-spi.3: Adding to iommu group 20
kernel: input: Apple SPI Keyboard as
/devices/pci0000:00/0000:00:1e.3/pxa2xx-spi.3/spi_master/spi2/spi-APP000D:00/input/input0
kernel: DMAR: DRHD: handling fault status reg 3
kernel: DMAR: [DMA Read NO_PASID] Request device [00:1e.3] fault addr
0xffffa000 [fault reason 0x06] PTE Read access is not set
kernel: DMAR: DRHD: handling fault status reg 3
kernel: DMAR: [DMA Read NO_PASID] Request device [00:1e.3] fault addr
0xffffa000 [fault reason 0x06] PTE Read access is not set
kernel: applespi spi-APP000D:00: Error writing to device: 01 0e 00 00
kernel: DMAR: DRHD: handling fault status reg 3
kernel: DMAR: [DMA Read NO_PASID] Request device [00:1e.3] fault addr
0xffffa000 [fault reason 0x06] PTE Read access is not set
kernel: DMAR: DRHD: handling fault status reg 3
kernel: applespi spi-APP000D:00: Error writing to device: 01 0e 00 00
Fix this by restoring the previous context setup order.
Fixes: 2031c469f816 ("iommu/vt-d: Add support for static identity domain")
Closes: https://lore.kernel.org/all/4dada48a-c5dd-4c30-9c85-5b03b0aa01f0@bfh.ch/
Cc: stable@vger.kernel.org
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
Link: https://lore.kernel.org/r/20250514060523.2862195-1-baolu.lu@linux.intel.com
Link: https://lore.kernel.org/r/20250520075849.755012-2-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
According to "Function return values and names" in coding-style.rst, the
dmar_ats_supported() function should return a boolean instead of an
integer. Also, rename "ret" to "supported" to be more straightforward.
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
Link: https://lore.kernel.org/r/20250509140021.4029303-3-wei.w.wang@intel.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
The function dmar_find_matched_satc_unit() contains a duplicate call to
pci_physfn(). This call is unnecessary as pci_physfn() has already been
invoked by the caller. Removing the redundant call simplifies the code
and improves efficiency a bit.
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Link: https://lore.kernel.org/r/20250509140021.4029303-2-wei.w.wang@intel.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
The domain ID allocator is currently protected by a spin_lock. However,
ida_alloc_range can potentially block if it needs to allocate memory to
grow its internal structures.
Replace the spin_lock with a mutex which allows sleep on block. Thus,
the memory allocation flags can be updated from GFP_ATOMIC to GFP_KERNEL
to allow blocking memory allocations if necessary.
Introduce a new mutex, did_lock, specifically for protecting the domain
ida. The existing spinlock will remain for protecting other intel_iommu
fields.
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20250430021135.2370244-3-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
Switch the intel iommu driver to use the ida mechanism for managing domain
IDs, replacing the previous fixed-size bitmap.
The previous approach allocated a bitmap large enough to cover the maximum
number of domain IDs supported by the hardware, regardless of the actual
number of domains in use. This led to unnecessary memory consumption,
especially on systems supporting a large number of iommu units but only
utilizing a small number of domain IDs.
The ida allocator dynamically manages the allocation and freeing of integer
IDs, only consuming memory for the IDs that are currently in use. This
significantly optimizes memory usage compared to the fixed-size bitmap.
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20250430021135.2370244-2-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
VT-D HW can do WO permissions on the second-stage but not the first-stage
page table formats. The commit eea53c581688 ("iommu/vt-d: Remove WO
permissions on second-level paging entries") wanted to make this uniform
for VT-D by disabling the support for WO permissions in the second-stage.
This isn't consistent with how other drivers are working. Instead if the
underlying HW can support WO, it should. For instance AMD already supports
WO on its second stage (v1) format and not its first (v2).
If WO support needs to be discoverable it should be done through an
iommu_domain capability flag.
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/0-v1-c26553717e90+65f-iommu_vtd_ss_wo_jgg@nvidia.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
Nothing in there is active if CONFIG_INTEL_IOMMU is not enabled, so the whole
directory can depend on that switch as well.
Fixes: ab65ba57e3ac ("iommu/vt-d: Move Kconfig and Makefile bits down into intel directory")
Signed-off-by: Rolf Eike Beer <eb@emlix.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/3818749.MHq7AAxBmi@devpool92.emlix.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
No external drivers use these interfaces anymore. Furthermore, no existing
iommu drivers implement anything in the callbacks. Remove them to avoid
dead code.
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Link: https://lore.kernel.org/r/20250418080130.1844424-9-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
Update iopf enablement in the driver to use the new method, similar to
the arm-smmu-v3 driver. Enable iopf support when any domain with an
iopf_handler is attached, and disable it when the domain is removed.
Place all the logic for controlling the PRI and iopf queue in the domain
set/remove/replace paths. Keep track of the number of domains set to the
device and PASIDs that require iopf. When the first domain requiring iopf
is attached, add the device to the iopf queue and enable PRI. When the
last domain is removed, remove it from the iopf queue and disable PRI.
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
Link: https://lore.kernel.org/r/20250418080130.1844424-4-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
None of the drivers implement anything here anymore, remove the dead code.
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
Link: https://lore.kernel.org/r/20250418080130.1844424-3-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
On the Lenovo ThinkPad X201, when Intel VT-d is enabled in the BIOS, the
kernel boots with errors related to DMAR, the graphical interface appeared
quite choppy, and the system resets erratically within a minute after it
booted:
DMAR: DRHD: handling fault status reg 3
DMAR: [DMA Write NO_PASID] Request device [00:02.0] fault addr 0xb97ff000
[fault reason 0x05] PTE Write access is not set
Upon comparing boot logs with VT-d on/off, I found that the Intel Calpella
quirk (`quirk_calpella_no_shadow_gtt()') correctly applied the igfx IOMMU
disable/quirk correctly:
pci 0000:00:00.0: DMAR: BIOS has allocated no shadow GTT; disabling IOMMU
for graphics
Whereas with VT-d on, it went into the "else" branch, which then
triggered the DMAR handling fault above:
... else if (!disable_igfx_iommu) {
/* we have to ensure the gfx device is idle before we flush */
pci_info(dev, "Disabling batched IOTLB flush on Ironlake\n");
iommu_set_dma_strict();
}
Now, this is not exactly scientific, but moving 0x0044 to quirk_iommu_igfx
seems to have fixed the aforementioned issue. Running a few `git blame'
runs on the function, I have found that the quirk was originally
introduced as a fix specific to ThinkPad X201:
commit 9eecabcb9a92 ("intel-iommu: Abort IOMMU setup for igfx if BIOS gave
no shadow GTT space")
Which was later revised twice to the "else" branch we saw above:
- 2011: commit 6fbcfb3e467a ("intel-iommu: Workaround IOTLB hang on
Ironlake GPU")
- 2024: commit ba00196ca41c ("iommu/vt-d: Decouple igfx_off from graphic
identity mapping")
I'm uncertain whether further testings on this particular laptops were
done in 2011 and (honestly I'm not sure) 2024, but I would be happy to do
some distro-specific testing if that's what would be required to verify
this patch.
P.S., I also see IDs 0x0040, 0x0062, and 0x006a listed under the same
`quirk_calpella_no_shadow_gtt()' quirk, but I'm not sure how similar these
chipsets are (if they share the same issue with VT-d or even, indeed, if
this issue is specific to a bug in the Lenovo BIOS). With regards to
0x0062, it seems to be a Centrino wireless card, but not a chipset?
I have also listed a couple (distro and kernel) bug reports below as
references (some of them are from 7-8 years ago!), as they seem to be
similar issue found on different Westmere/Ironlake, Haswell, and Broadwell
hardware setups.
Cc: stable@vger.kernel.org
Fixes: 6fbcfb3e467a ("intel-iommu: Workaround IOTLB hang on Ironlake GPU")
Fixes: ba00196ca41c ("iommu/vt-d: Decouple igfx_off from graphic identity mapping")
Link: https://groups.google.com/g/qubes-users/c/4NP4goUds2c?pli=1
Link: https://bugs.archlinux.org/task/65362
Link: https://bbs.archlinux.org/viewtopic.php?id=230323
Reported-by: Wenhao Sun <weiguangtwk@outlook.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=197029
Signed-off-by: Mingcong Bai <jeffbai@aosc.io>
Link: https://lore.kernel.org/r/20250415133330.12528-1-jeffbai@aosc.io
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
Commit <5518f239aff1> ("iommu/vt-d: Move scalable mode ATS enablement to
probe path") changed the PCI ATS enablement logic to run earlier,
specifically before the default domain attachment.
On some client platforms, this change resulted in boot failures, causing
the kernel to panic with the following message and call trace:
Kernel panic - not syncing: DMAR hardware is malfunctioning
CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.14.0-rc3+ #175
Call Trace:
<TASK>
dump_stack_lvl+0x6f/0xb0
dump_stack+0x10/0x16
panic+0x10a/0x2b7
iommu_enable_translation.cold+0xc/0xc
intel_iommu_init+0xe39/0xec0
? trace_hardirqs_on+0x1e/0xd0
? __pfx_pci_iommu_init+0x10/0x10
pci_iommu_init+0xd/0x40
do_one_initcall+0x5b/0x390
kernel_init_freeable+0x26d/0x2b0
? __pfx_kernel_init+0x10/0x10
kernel_init+0x15/0x120
ret_from_fork+0x35/0x60
? __pfx_kernel_init+0x10/0x10
ret_from_fork_asm+0x1a/0x30
RIP: 1f0f:0x0
Code: Unable to access opcode bytes at 0xffffffffffffffd6.
RSP: 0000:0000000000000000 EFLAGS: 841f0f2e66 ORIG_RAX:
1f0f2e6600000000
RAX: 0000000000000000 RBX: 1f0f2e6600000000 RCX:
2e66000000000084
RDX: 0000000000841f0f RSI: 000000841f0f2e66 RDI:
00841f0f2e660000
RBP: 00841f0f2e660000 R08: 00841f0f2e660000 R09:
000000841f0f2e66
R10: 0000000000841f0f R11: 2e66000000000084 R12:
000000841f0f2e66
R13: 0000000000841f0f R14: 2e66000000000084 R15:
1f0f2e6600000000
</TASK>
---[ end Kernel panic - not syncing: DMAR hardware is malfunctioning ]---
Fix this by reverting the timing change for ATS enablement introduced by
the offending commit and restoring the previous behavior.
Fixes: 5518f239aff1 ("iommu/vt-d: Move scalable mode ATS enablement to probe path")
Reported-by: Jarkko Nikula <jarkko.nikula@linux.intel.com>
Closes: https://lore.kernel.org/linux-iommu/01b9c72f-460d-4f77-b696-54c6825babc9@linux.intel.com/
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Tested-by: Jarkko Nikula <jarkko.nikula@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20250416073608.1799578-1-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
Intel is the only thing that uses this now, convert to the size versions,
trying to avoid PAGE_SHIFT.
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/23-v4-c8663abbb606+3f7-iommu_pages_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
Use iommu_alloc_pages_node_sz() instead.
AMD and Intel are both using 4K pages for these structures since those
drivers only work on 4K PAGE_SIZE.
riscv is also spec'd to use SZ_4K.
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/21-v4-c8663abbb606+3f7-iommu_pages_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
Convert most of the places calling get_order() as an argument to the
iommu-pages allocator into order_base_2() or the _sz flavour
instead. These places already have an exact size, there is no particular
reason to use order here.
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/19-v4-c8663abbb606+3f7-iommu_pages_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
This converts the remaining places using list of pages to the new API.
The Intel free path was shared with its gather path, so it is converted at
the same time.
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/11-v4-c8663abbb606+3f7-iommu_pages_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
Use iommu_free_pages() instead.
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Mostafa Saleh <smostafa@google.com>
Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/6-v4-c8663abbb606+3f7-iommu_pages_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
Now that we have a folio under the allocation iommu_free_pages() can know
the order of the original allocation and do the correct thing to free it.
The next patch will rename iommu_free_page() to iommu_free_pages() so we
have naming consistency with iommu_alloc_pages_node().
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Mostafa Saleh <smostafa@google.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/5-v4-c8663abbb606+3f7-iommu_pages_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
Do not touch per-device DMA ops when the driver has been converted to use
the dma-iommu API.
Fixes: c588072bba6b ("iommu/vt-d: Convert intel iommu driver to the iommu ops")
Signed-off-by: Petr Tesarik <ptesarik@suse.com>
Link: https://lore.kernel.org/r/20250403165605.278541-1-ptesarik@suse.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
Set the posted MSI irq_chip's irq_ack() hook to irq_move_irq() instead of
a dummy/empty callback so that posted MSIs process pending changes to the
IRQ's SMP affinity. Failure to honor a pending set-affinity results in
userspace being unable to change the effective affinity of the IRQ, as
IRQD_SETAFFINITY_PENDING is never cleared and so irq_set_affinity_locked()
always defers moving the IRQ.
The issue is most easily reproducible by setting /proc/irq/xx/smp_affinity
multiple times in quick succession, as only the first update is likely to
be handled in process context.
Fixes: ed1e48ea4370 ("iommu/vt-d: Enable posted mode for device MSIs")
Cc: Robert Lippert <rlippert@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Reported-by: Wentao Yang <wentaoyang@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20250321194249.1217961-1-seanjc@google.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd
Pull iommufd updates from Jason Gunthorpe:
"Two significant new items:
- Allow reporting IOMMU HW events to userspace when the events are
clearly linked to a device.
This is linked to the VIOMMU object and is intended to be used by a
VMM to forward HW events to the virtual machine as part of
emulating a vIOMMU. ARM SMMUv3 is the first driver to use this
mechanism. Like the existing fault events the data is delivered
through a simple FD returning event records on read().
- PASID support in VFIO.
The "Process Address Space ID" is a PCI feature that allows the
device to tag all PCI DMA operations with an ID. The IOMMU will
then use the ID to select a unique translation for those DMAs. This
is part of Intel's vIOMMU support as VT-D HW requires the
hypervisor to manage each PASID entry.
The support is generic so any VFIO user could attach any
translation to a PASID, and the support should work on ARM SMMUv3
as well. AMD requires additional driver work.
Some minor updates, along with fixes:
- Prevent using nested parents with fault's, no driver support today
- Put a single "cookie_type" value in the iommu_domain to indicate
what owns the various opaque owner fields"
* tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd: (49 commits)
iommufd: Test attach before detaching pasid
iommufd: Fix iommu_vevent_header tables markup
iommu: Convert unreachable() to BUG()
iommufd: Balance veventq->num_events inc/dec
iommufd: Initialize the flags of vevent in iommufd_viommu_report_event()
iommufd/selftest: Add coverage for reporting max_pasid_log2 via IOMMU_HW_INFO
iommufd: Extend IOMMU_GET_HW_INFO to report PASID capability
vfio: VFIO_DEVICE_[AT|DE]TACH_IOMMUFD_PT support pasid
vfio-iommufd: Support pasid [at|de]tach for physical VFIO devices
ida: Add ida_find_first_range()
iommufd/selftest: Add coverage for iommufd pasid attach/detach
iommufd/selftest: Add test ops to test pasid attach/detach
iommufd/selftest: Add a helper to get test device
iommufd/selftest: Add set_dev_pasid in mock iommu
iommufd: Allow allocating PASID-compatible domain
iommu/vt-d: Add IOMMU_HWPT_ALLOC_PASID support
iommufd: Enforce PASID-compatible domain for RID
iommufd: Support pasid attach/replace
iommufd: Enforce PASID-compatible domain in PASID path
iommufd/device: Add pasid_attach array to track per-PASID attach
...
|
|
Intel iommu driver just treats it as a nop since Intel VT-d does not have
special requirement on domains attached to either the PASID or RID of a
PASID-capable device.
Link: https://patch.msgid.link/r/20250321171940.7213-14-yi.l.liu@intel.com
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
'rockchip', 's390', 'core', 'intel/vt-d' and 'amd/amd-vi' into next
|
|
We have recently seen report of lockdep circular lock dependency warnings
on platforms like Skylake and Kabylake:
======================================================
WARNING: possible circular locking dependency detected
6.14.0-rc6-CI_DRM_16276-gca2c04fe76e8+ #1 Not tainted
------------------------------------------------------
swapper/0/1 is trying to acquire lock:
ffffffff8360ee48 (iommu_probe_device_lock){+.+.}-{3:3},
at: iommu_probe_device+0x1d/0x70
but task is already holding lock:
ffff888102c7efa8 (&device->physical_node_lock){+.+.}-{3:3},
at: intel_iommu_init+0xe75/0x11f0
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #6 (&device->physical_node_lock){+.+.}-{3:3}:
__mutex_lock+0xb4/0xe40
mutex_lock_nested+0x1b/0x30
intel_iommu_init+0xe75/0x11f0
pci_iommu_init+0x13/0x70
do_one_initcall+0x62/0x3f0
kernel_init_freeable+0x3da/0x6a0
kernel_init+0x1b/0x200
ret_from_fork+0x44/0x70
ret_from_fork_asm+0x1a/0x30
-> #5 (dmar_global_lock){++++}-{3:3}:
down_read+0x43/0x1d0
enable_drhd_fault_handling+0x21/0x110
cpuhp_invoke_callback+0x4c6/0x870
cpuhp_issue_call+0xbf/0x1f0
__cpuhp_setup_state_cpuslocked+0x111/0x320
__cpuhp_setup_state+0xb0/0x220
irq_remap_enable_fault_handling+0x3f/0xa0
apic_intr_mode_init+0x5c/0x110
x86_late_time_init+0x24/0x40
start_kernel+0x895/0xbd0
x86_64_start_reservations+0x18/0x30
x86_64_start_kernel+0xbf/0x110
common_startup_64+0x13e/0x141
-> #4 (cpuhp_state_mutex){+.+.}-{3:3}:
__mutex_lock+0xb4/0xe40
mutex_lock_nested+0x1b/0x30
__cpuhp_setup_state_cpuslocked+0x67/0x320
__cpuhp_setup_state+0xb0/0x220
page_alloc_init_cpuhp+0x2d/0x60
mm_core_init+0x18/0x2c0
start_kernel+0x576/0xbd0
x86_64_start_reservations+0x18/0x30
x86_64_start_kernel+0xbf/0x110
common_startup_64+0x13e/0x141
-> #3 (cpu_hotplug_lock){++++}-{0:0}:
__cpuhp_state_add_instance+0x4f/0x220
iova_domain_init_rcaches+0x214/0x280
iommu_setup_dma_ops+0x1a4/0x710
iommu_device_register+0x17d/0x260
intel_iommu_init+0xda4/0x11f0
pci_iommu_init+0x13/0x70
do_one_initcall+0x62/0x3f0
kernel_init_freeable+0x3da/0x6a0
kernel_init+0x1b/0x200
ret_from_fork+0x44/0x70
ret_from_fork_asm+0x1a/0x30
-> #2 (&domain->iova_cookie->mutex){+.+.}-{3:3}:
__mutex_lock+0xb4/0xe40
mutex_lock_nested+0x1b/0x30
iommu_setup_dma_ops+0x16b/0x710
iommu_device_register+0x17d/0x260
intel_iommu_init+0xda4/0x11f0
pci_iommu_init+0x13/0x70
do_one_initcall+0x62/0x3f0
kernel_init_freeable+0x3da/0x6a0
kernel_init+0x1b/0x200
ret_from_fork+0x44/0x70
ret_from_fork_asm+0x1a/0x30
-> #1 (&group->mutex){+.+.}-{3:3}:
__mutex_lock+0xb4/0xe40
mutex_lock_nested+0x1b/0x30
__iommu_probe_device+0x24c/0x4e0
probe_iommu_group+0x2b/0x50
bus_for_each_dev+0x7d/0xe0
iommu_device_register+0xe1/0x260
intel_iommu_init+0xda4/0x11f0
pci_iommu_init+0x13/0x70
do_one_initcall+0x62/0x3f0
kernel_init_freeable+0x3da/0x6a0
kernel_init+0x1b/0x200
ret_from_fork+0x44/0x70
ret_from_fork_asm+0x1a/0x30
-> #0 (iommu_probe_device_lock){+.+.}-{3:3}:
__lock_acquire+0x1637/0x2810
lock_acquire+0xc9/0x300
__mutex_lock+0xb4/0xe40
mutex_lock_nested+0x1b/0x30
iommu_probe_device+0x1d/0x70
intel_iommu_init+0xe90/0x11f0
pci_iommu_init+0x13/0x70
do_one_initcall+0x62/0x3f0
kernel_init_freeable+0x3da/0x6a0
kernel_init+0x1b/0x200
ret_from_fork+0x44/0x70
ret_from_fork_asm+0x1a/0x30
other info that might help us debug this:
Chain exists of:
iommu_probe_device_lock --> dmar_global_lock -->
&device->physical_node_lock
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&device->physical_node_lock);
lock(dmar_global_lock);
lock(&device->physical_node_lock);
lock(iommu_probe_device_lock);
*** DEADLOCK ***
This driver uses a global lock to protect the list of enumerated DMA
remapping units. It is necessary due to the driver's support for dynamic
addition and removal of remapping units at runtime.
Two distinct code paths require iteration over this remapping unit list:
- Device registration and probing: the driver iterates the list to
register each remapping unit with the upper layer IOMMU framework
and subsequently probe the devices managed by that unit.
- Global configuration: Upper layer components may also iterate the list
to apply configuration changes.
The lock acquisition order between these two code paths was reversed. This
caused lockdep warnings, indicating a risk of deadlock. Fix this warning
by releasing the global lock before invoking upper layer interfaces for
device registration.
Fixes: b150654f74bf ("iommu/vt-d: Fix suspicious RCU usage")
Closes: https://lore.kernel.org/linux-iommu/SJ1PR11MB612953431F94F18C954C4A9CB9D32@SJ1PR11MB6129.namprd11.prod.outlook.com/
Tested-by: Chaitanya Kumar Borah <chaitanya.kumar.borah@intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20250317035714.1041549-1-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
Don't overwrite an IRTE that is posting IRQs to a vCPU with a posted MSI
entry if the host IRQ affinity happens to change. If/when the IRTE is
reverted back to "host mode", it will be reconfigured as a posted MSI or
remapped entry as appropriate.
Drop the "mode" field, which doesn't differentiate between posted MSIs and
posted vCPUs, in favor of a dedicated posted_vcpu flag. Note! The two
posted_{msi,vcpu} flags are intentionally not mutually exclusive; an IRTE
can transition between posted MSI and posted vCPU.
Fixes: ed1e48ea4370 ("iommu/vt-d: Enable posted mode for device MSIs")
Cc: stable@vger.kernel.org
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20250315025135.2365846-3-seanjc@google.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|