linux-arm.git - Russell King's ARM Linux kernel tree

Age	Commit message (Collapse)	Author
2025-06-23	Merge tag 'f2fs-for-6.16-rc4' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs Pull f2fs fixes from Jaegeuk Kim: - fix double-unlock introduced by the recent folio conversion - fix stale page content beyond EOF complained by xfstests/generic/363 * tag 'f2fs-for-6.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: f2fs: fix to zero post-eof page f2fs: Fix __write_node_folio() conversion
2025-06-23	Revert "PCI/ACPI: Fix allocated memory release on error in pci_acpi_scan_root()"	Zhe Qiao
	This reverts commit 631b2af2f357 ("PCI/ACPI: Fix allocated memory release on error in pci_acpi_scan_root()"). The reverted patch causes the 'ri->cfg' and 'root_ops' resources to be released multiple times. When acpi_pci_root_create() fails, these resources have already been released internally by the __acpi_pci_root_release_info() function. Releasing them again in pci_acpi_scan_root() leads to incorrect behavior and potential memory issues. We plan to resolve the issue using a more appropriate fix. Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/all/aEmdnuw715btq7Q5@stanley.mountain/ Signed-off-by: Zhe Qiao <qiaozhe@iscas.ac.cn> Acked-by: Dan Carpenter <dan.carpenter@linaro.org> Link: https://patch.msgid.link/20250619072608.2075475-1-qiaozhe@iscas.ac.cn Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-06-23	net: netpoll: Initialize UDP checksum field before checksumming	Breno Leitao
	commit f1fce08e63fe ("netpoll: Eliminate redundant assignment") removed the initialization of the UDP checksum, which was wrong and broke netpoll IPv6 transmission due to bad checksumming. udph->check needs to be set before calling csum_ipv6_magic(). Fixes: f1fce08e63fe ("netpoll: Eliminate redundant assignment") Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250620-netpoll_fix-v1-1-f9f0b82bc059@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-23	Merge tag 'for-6.16-rc3-tag' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "Fixes: - fix invalid inode pointer dereferences during log replay - fix a race between renames and directory logging - fix shutting down delayed iput worker - fix device byte accounting when dropping chunk - in zoned mode, fix offset calculations for DUP profile when conventional and sequential zones are used together Regression fixes: - fix possible double unlock of extent buffer tree (xarray conversion) - in zoned mode, fix extent buffer refcount when writing out extents (xarray conversion) Error handling fixes and updates: - handle unexpected extent type when replaying log - check and warn if there are remaining delayed inodes when putting a root - fix assertion when building free space tree - handle csum tree error with mount option 'rescue=ibadroot' Other: - error message updates: add prefix to all scrub related messages, include other information in messages" * tag 'for-6.16-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: zoned: fix alloc_offset calculation for partly conventional block groups btrfs: handle csum tree error with rescue=ibadroots correctly btrfs: fix race between async reclaim worker and close_ctree() btrfs: fix assertion when building free space tree btrfs: don't silently ignore unexpected extent type when replaying log btrfs: fix invalid inode pointer dereferences during log replay btrfs: fix double unlock of buffer_tree xarray when releasing subpage eb btrfs: update superblock's device bytes_used when dropping chunk btrfs: fix a race between renames and directory logging btrfs: scrub: add prefix for the error messages btrfs: warn if leaking delayed_nodes in btrfs_put_root() btrfs: fix delayed ref refcount leak in debug assertion btrfs: include root in error message when unlinking inode btrfs: don't drop a reference if btrfs_check_write_meta_pointer() fails
2025-06-23	libbpf: Fix null pointer dereference in btf_dump__free on allocation failure	Yuan Chen
	When btf_dump__new() fails to allocate memory for the internal hashmap (btf_dump->type_names), it returns an error code. However, the cleanup function btf_dump__free() does not check if btf_dump->type_names is NULL before attempting to free it. This leads to a null pointer dereference when btf_dump__free() is called on a btf_dump object. Fixes: 351131b51c7a ("libbpf: add btf_dump API for BTF-to-C conversion") Signed-off-by: Yuan Chen <chenyuan@kylinos.cn> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250618011933.11423-1-chenyuan_fl@163.com
2025-06-23	attach_recursive_mnt(): do not lock the covering tree when sliding something ↵	Al Viro
	under it If we are propagating across the userns boundary, we need to lock the mounts added there. However, in case when something has already been mounted there and we end up sliding a new tree under that, the stuff that had been there before should not get locked. IOW, lock_mnt_tree() should be called before we reparent the preexisting tree on top of what we are adding. Fixes: 3bd045cc9c4b ("separate copying and locking mount tree on cross-userns copies") Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-23	replace collect_mounts()/drop_collected_mounts() with a safer variant	Al Viro
	collect_mounts() has several problems - one can't iterate over the results directly, so it has to be done with callback passed to iterate_mounts(); it has an oopsable race with d_invalidate(); it creates temporary clones of mounts invisibly for sync umount (IOW, you can have non-lazy umount succeed leaving filesystem not mounted anywhere and yet still busy). A saner approach is to give caller an array of struct path that would pin every mount in a subtree, without cloning any mounts. * collect_mounts()/drop_collected_mounts()/iterate_mounts() is gone * collect_paths(where, preallocated, size) gives either ERR_PTR(-E...) or a pointer to array of struct path, one for each chunk of tree visible under 'where' (i.e. the first element is a copy of where, followed by (mount,root) for everything mounted under it - the same set collect_mounts() would give). Unlike collect_mounts(), the mounts are not cloned - we just get pinning references to the roots of subtrees in the caller's namespace. Array is terminated by {NULL, NULL} struct path. If it fits into preallocated array (on-stack, normally), that's where it goes; otherwise it's allocated by kmalloc_array(). Passing 0 as size means that 'preallocated' is ignored (and expected to be NULL). * drop_collected_paths(paths, preallocated) is given the array returned by an earlier call of collect_paths() and the preallocated array passed to that call. All mount/dentry references are dropped and array is kfree'd if it's not equal to 'preallocated'. * instead of iterate_mounts(), users should just iterate over array of struct path - nothing exotic is needed for that. Existing users (all in audit_tree.c) are converted. [folded a fix for braino reported by Venkat Rao Bagalkote <venkat88@linux.ibm.com>] Fixes: 80b5dce8c59b0 ("vfs: Add a function to lazily unmount all mounts from any dentry") Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-23	PCI/PTM: Build debugfs code only if CONFIG_DEBUG_FS is enabled	Manivannan Sadhasivam
	Otherwise, the following build error will happen for CONFIG_DEBUG_FS=n && CONFIG_PCIE_PTM=y: drivers/pci/pcie/ptm.c:498:25: error: redefinition of 'pcie_ptm_create_debugfs' 498 \| struct pci_ptm_debugfs pcie_ptm_create_debugfs(struct device dev, void pdata, \| ^ ./include/linux/pci.h:1915:2: note: previous definition is here 1915 \| pcie_ptm_create_debugfs(struct device dev, void pdata, \| ^ drivers/pci/pcie/ptm.c:546:6: error: redefinition of 'pcie_ptm_destroy_debugfs' 546 \| void pcie_ptm_destroy_debugfs(struct pci_ptm_debugfs ptm_debugfs) \| ^ ./include/linux/pci.h:1918:1: note: previous definition is here 1918 \| pcie_ptm_destroy_debugfs(struct pci_ptm_debugfs ptm_debugfs) { } \| Fixes: 132833405e61 ("PCI: Add debugfs support for exposing PTM context") Reported-by: Eric Biggers <ebiggers@kernel.org> Closes: https://lore.kernel.org/linux-pci/20250607025506.GA16607@sol Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Tested-by: Eric Biggers <ebiggers@kernel.org> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Link: https://patch.msgid.link/20250608033305.15214-1-manivannan.sadhasivam@linaro.org
2025-06-23	scsi: qla4xxx: Fix missing DMA mapping error in qla4xxx_alloc_pdu()	Thomas Fourier
	dma_map_XXX() can fail and should be tested for errors with dma_mapping_error(). Fixes: b3a271a94d00 ("[SCSI] qla4xxx: support iscsiadm session mgmt") Signed-off-by: Thomas Fourier <fourier.thomas@gmail.com> Link: https://lore.kernel.org/r/20250618071742.21822-2-fourier.thomas@gmail.com Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2025-06-23	scsi: qla2xxx: Fix DMA mapping test in qla24xx_get_port_database()	Thomas Fourier
	dma_map_XXX() functions return as error values DMA_MAPPING_ERROR which is often ~0. The error value should be tested with dma_mapping_error() like it was done in qla26xx_dport_diagnostics(). Fixes: 818c7f87a177 ("scsi: qla2xxx: Add changes in preparation for vendor extended FDMI/RDP") Signed-off-by: Thomas Fourier <fourier.thomas@gmail.com> Link: https://lore.kernel.org/r/20250617161115.39888-2-fourier.thomas@gmail.com Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2025-06-23	KVM: selftests: Add a KVM_IRQFD test to verify uniqueness requirements	Sean Christopherson
	Add a selftest to verify that eventfd+irqfd bindings are globally unique, i.e. that KVM doesn't allow multiple irqfds to bind to a single eventfd, even across VMs. Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250522235223.3178519-14-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	KVM: selftests: Add utilities to create eventfds and do KVM_IRQFD	Sean Christopherson
	Add helpers to create eventfds and to (de)assign eventfds via KVM_IRQFD. Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250522235223.3178519-13-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	KVM: selftests: Assert that eventfd() succeeds in Xen shinfo test	Sean Christopherson
	Assert that eventfd() succeeds in the Xen shinfo test instead of skipping the associated testcase. While eventfd() is outside the scope of KVM, KVM unconditionally selects EVENTFD, i.e. the syscall should always succeed. Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250522235223.3178519-12-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	KVM: Drop sanity check that per-VM list of irqfds is unique	Sean Christopherson
	Now that the eventfd's waitqueue ensures it has at most one priority waiter, i.e. prevents KVM from binding multiple irqfds to one eventfd, drop KVM's sanity check that eventfds are unique for a single VM. Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250522235223.3178519-11-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	KVM: Disallow binding multiple irqfds to an eventfd with a priority waiter	Sean Christopherson
	Disallow binding an irqfd to an eventfd that already has a priority waiter, i.e. to an eventfd that already has an attached irqfd. KVM always operates in exclusive mode for EPOLL_IN (unconditionally returns '1'), i.e. only the first waiter will be notified. KVM already disallows binding multiple irqfds to an eventfd in a single VM, but doesn't guard against multiple VMs binding to an eventfd. Adding the extra protection reduces the pain of a userspace VMM bug, e.g. if userspace fails to de-assign before re-assigning when transferring state for intra-host migration, then the migration will explicitly fail as opposed to dropping IRQs on the destination VM. Temporarily keep KVM's manual check on irqfds.items, but add a WARN, e.g. to allow sanity checking the waitqueue enforcement. Cc: Oliver Upton <oliver.upton@linux.dev> Cc: David Matlack <dmatlack@google.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250522235223.3178519-10-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	sched/wait: Add a waitqueue helper for fully exclusive priority waiters	Sean Christopherson
	Add a waitqueue helper to add a priority waiter that requires exclusive wakeups, i.e. that requires that it be the _only_ priority waiter. The API will be used by KVM to ensure that at most one of KVM's irqfds is bound to a single eventfd (across the entire kernel). Open code the helper instead of using __add_wait_queue() so that the common path doesn't need to "handle" impossible failures. Cc: K Prateek Nayak <kprateek.nayak@amd.com> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250522235223.3178519-9-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	xen: privcmd: Don't mark eventfd waiter as EXCLUSIVE	Sean Christopherson
	Don't set WQ_FLAG_EXCLUSIVE when adding an irqfd to a wait queue, as irqfd_wakeup() unconditionally returns '0', i.e. doesn't actually operate in exclusive mode. Note, the use of WQ_FLAG_PRIORITY is also dubious, but that's a problem for another day. Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250522235223.3178519-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	sched/wait: Drop WQ_FLAG_EXCLUSIVE from add_wait_queue_priority()	Sean Christopherson
	Drop the setting of WQ_FLAG_EXCLUSIVE from add_wait_queue_priority() and instead have callers manually add the flag prior to adding their structure to the queue. Blindly setting WQ_FLAG_EXCLUSIVE is flawed, as the nature of exclusive, priority waiters means that only the first waiter added will ever receive notifications. Pushing the flawed behavior to callers will allow fixing the problem one hypervisor at a time (KVM added the flawed API, and then KVM's code was copy+pasted nearly verbatim by Xen and Hyper-V), and will also allow for adding an API that provides true exclusivity, i.e. that guarantees at most one priority waiter is in the queue. Opportunistically add a comment in Hyper-V to call out the mess. Xen privcmd's irqfd_wakefup() doesn't actually operate in exclusive mode, i.e. can be "fixed" simply by dropping WQ_FLAG_EXCLUSIVE. And KVM is primed to switch to the aforementioned fully exclusive API, i.e. won't be carrying the flawed code for long. No functional change intended. Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250522235223.3178519-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	KVM: Add irqfd to eventfd's waitqueue while holding irqfds.lock	Sean Christopherson
	Add an irqfd to its target eventfd's waitqueue while holding irqfds.lock, which is mildly terrifying but functionally safe. irqfds.lock is taken inside the waitqueue's lock, but if and only if the eventfd is being released, i.e. that path is mutually exclusive with registration as KVM holds a reference to the eventfd (and obviously must do so to avoid UAF). This will allow using the eventfd's waitqueue to enforce KVM's requirement that eventfd is assigned to at most one irqfd, without introducing races. Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250522235223.3178519-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	KVM: Add irqfd to KVM's list via the vfs_poll() callback	Sean Christopherson
	Add the irqfd structure to KVM's list of irqfds in kvm_irqfd_register(), i.e. via the vfs_poll() callback. This will allow taking irqfds.lock across the entire registration sequence (add to waitqueue, add to list), and more importantly will allow inserting into KVM's list if and only if adding to the waitqueue succeeds (spoiler alert), without needing to juggle return codes in weird ways. Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250522235223.3178519-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	KVM: Initialize irqfd waitqueue callback when adding to the queue	Sean Christopherson
	Initialize the irqfd waitqueue callback immediately prior to inserting the irqfd into the eventfd's waitqueue. Pre-initializing the state in a completely different context is all kinds of confusing, and incorrectly suggests that the waitqueue function needs to be initialize prior to vfs_poll(). Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250522235223.3178519-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	KVM: Acquire SCRU lock outside of irqfds.lock during assignment	Sean Christopherson
	Acquire SRCU outside of irqfds.lock so that the locking is symmetrical, and add a comment explaining why on earth KVM holds SRCU for so long. Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250522235223.3178519-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	KVM: Use a local struct to do the initial vfs_poll() on an irqfd	Sean Christopherson
	Use a function-local struct for the poll_table passed to vfs_poll(), as nothing in the vfs_poll() callchain grabs a long-term reference to the structure, i.e. its lifetime doesn't need to be tied to the irqfd. Using a local structure will also allow propagating failures out of the polling callback without further polluting kvm_kernel_irqfd. Opportunstically rename irqfd_ptable_queue_proc() to kvm_irqfd_register() to capture what it actually does. Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250522235223.3178519-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	KVM: x86: Rename kvm_set_msi_irq() => kvm_msi_to_lapic_irq()	Sean Christopherson
	Rename kvm_set_msi_irq() to kvm_msi_to_lapic_irq() to better capture what it actually does, e.g. it's _really_ easy to conflate kvm_set_msi_irq() with kvm_set_msi(). Opportunistically delete the public declaration and export, as they are no longer used/needed. No functional change intended. Link: https://lore.kernel.org/r/20250611224604.313496-64-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	KVM: SVM: Generate GA log IRQs only if the associated vCPUs is blocking	Sean Christopherson
	Configure IRTEs to GA log interrupts for device posted IRQs that hit non-running vCPUs if and only if the target vCPU is blocking, i.e. actually needs a wake event. If the vCPU has exited to userspace or was preempted, generating GA log entries and interrupts is wasteful and unnecessary, as the vCPU will be re-loaded and/or scheduled back in irrespective of the GA log notification (avic_ga_log_notifier() is just a fancy wrapper for kvm_vcpu_wake_up()). Use a should-be-zero bit in the vCPU's Physical APIC ID Table Entry to track whether or not the vCPU's associated IRTEs are configured to generate GA logs, but only set the synthetic bit in KVM's "cache", i.e. never set the should-be-zero bit in tables that are used by hardware. Use a synthetic bit instead of a dedicated boolean to minimize the odds of messing up the locking, i.e. so that all the existing rules that apply to avic_physical_id_entry for IS_RUNNING are reused verbatim for GA_LOG_INTR. Note, because KVM (by design) "puts" AVIC state in a "pre-blocking" phase, using kvm_vcpu_is_blocking() to track the need for notifications isn't a viable option. Link: https://lore.kernel.org/r/20250611224604.313496-63-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	iommu/amd: KVM: SVM: Allow KVM to control need for GA log interrupts	Sean Christopherson
	Add plumbing to the AMD IOMMU driver to allow KVM to control whether or not an IRTE is configured to generate GA log interrupts. KVM only needs a notification if the target vCPU is blocking, so the vCPU can be awakened. If a vCPU is preempted or exits to userspace, KVM clears is_run, but will set the vCPU back to running when userspace does KVM_RUN and/or the vCPU task is scheduled back in, i.e. KVM doesn't need a notification. Unconditionally pass "true" in all KVM paths to isolate the IOMMU changes from the KVM changes insofar as possible. Opportunistically swap the ordering of parameters for amd_iommu_update_ga() so that the match amd_iommu_activate_guest_mode(). Note, as of this writing, the AMD IOMMU manual doesn't list GALogIntr as a non-cached field, but per AMD hardware architects, it's not cached and can be safely updated without an invalidation. Link: https://lore.kernel.org/all/b29b8c22-2fd4-4b5e-b755-9198874157c7@amd.com Cc: Vasant Hegde <vasant.hegde@amd.com> Cc: Joao Martins <joao.m.martins@oracle.com> Link: https://lore.kernel.org/r/20250611224604.313496-62-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	KVM: SVM: Consolidate IRTE update when toggling AVIC on/off	Sean Christopherson
	Fold the IRTE modification logic in avic_refresh_apicv_exec_ctrl() into __avic_vcpu_{load,put}(), and add a param to the helpers to communicate whether or not AVIC is being toggled, i.e. if IRTE needs a "full" update, or just a quick update to set the CPU and IsRun. Link: https://lore.kernel.org/r/20250611224604.313496-61-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	KVM: SVM: Don't check vCPU's blocking status when toggling AVIC on/off	Sean Christopherson
	Don't query a vCPU's blocking status when toggling AVIC on/off; barring KVM bugs, the vCPU can't be blocking when refreshing AVIC controls. And if there are KVM bugs, ensuring the vCPU and its associated IRTEs are in the correct state is desirable, i.e. well worth any overhead in a buggy scenario. Isolating the "real" load/put flows will allow moving the IOMMU IRTE (de)activation logic from avic_refresh_apicv_exec_ctrl() to avic_update_iommu_vcpu_affinity(), i.e. will allow updating the vCPU's physical ID entry and its IRTEs in a common path, under a single critical section of ir_list_lock. Tested-by: Sairaj Kodilkar <sarunkod@amd.com> Link: https://lore.kernel.org/r/20250611224604.313496-60-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	KVM: SVM: Fold avic_set_pi_irte_mode() into its sole caller	Sean Christopherson
	Fold avic_set_pi_irte_mode() into avic_refresh_apicv_exec_ctrl() in anticipation of moving the __avic_vcpu_{load,put}() calls into the critical section, and because having a one-off helper with a name that's easily confused with avic_pi_update_irte() is unnecessary. No functional change intended. Tested-by: Sairaj Kodilkar <sarunkod@amd.com> Link: https://lore.kernel.org/r/20250611224604.313496-59-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	iommu/amd: WARN if KVM calls GA IRTE helpers without virtual APIC support	Sean Christopherson
	WARN if KVM attempts to update IRTE entries when virtual APIC isn't fully supported, as KVM should guard all such calls on IRQ posting being enabled. Link: https://lore.kernel.org/r/20250611224604.313496-58-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	KVM: SVM: Use vcpu_idx, not vcpu_id, for GA log tag/metadata	Sean Christopherson
	Use a vCPU's index, not its ID, for the GA log tag/metadata that's used to find and kick vCPUs when a device posted interrupt serves as a wake event. Lookups on a vCPU index are O(fast) (not sure what xa_load() actually provides), whereas a vCPU ID lookup is O(n) if a vCPU's ID doesn't match its index. Unlike the Physical APIC Table, which is accessed by hardware when virtualizing IPIs, hardware doesn't consume the GA tag, i.e. KVM _must_ use APIC IDs to fill the Physical APIC Table, but KVM has free rein over the format/meaning of the GA tag. Tested-by: Sairaj Kodilkar <sarunkod@amd.com> Link: https://lore.kernel.org/r/20250611224604.313496-57-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	KVM: VMX: WARN if VT-d Posted IRQs aren't possible when starting IRQ bypass	Sean Christopherson
	WARN if KVM attempts to "start" IRQ bypass when VT-d Posted IRQs are disabled, to make it obvious that the logic is a sanity check, and so that a bug related to nr_possible_bypass_irqs is more like to cause noisy failures, e.g. so that KVM doesn't silently fail to wake blocking vCPUs. Link: https://lore.kernel.org/r/20250611224604.313496-56-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	KVM: x86: Decouple device assignment from IRQ bypass	Sean Christopherson
	Use a dedicated counter to track the number of IRQs that can utilize IRQ bypass instead of piggybacking the assigned device count. As evidenced by commit 2edd9cb79fb3 ("kvm: detect assigned device via irqbypass manager"), it's possible for a device to be able to post IRQs to a vCPU without said device being assigned to a VM. Leave the calls to kvm_arch_{start,end}_assignment() alone for the moment to avoid regressing the MMIO stale data mitigation. KVM is abusing the assigned device count when applying mmio_stale_data_clear, and it's not at all clear if vDPA devices rely on this behavior. This will hopefully be cleaned up in the future, as the number of assigned devices is a terrible heuristic for detecting if a VM has access to host MMIO. Link: https://lore.kernel.org/r/20250611224604.313496-55-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	KVM: SVM: WARN if ir_list is non-empty at vCPU free	Sean Christopherson
	Now that AVIC IRTE tracking is in a mostly sane state, WARN if a vCPU is freed with ir_list entries, i.e. if KVM leaves a dangling IRTE. Initialize the per-vCPU interrupt remapping list and its lock even if AVIC is disabled so that the WARN doesn't hit false positives (and so that KVM doesn't need to call into AVIC code for a simple sanity check). Link: https://lore.kernel.org/r/20250611224604.313496-54-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	KVM: x86: WARN if IRQ bypass routing is updated without in-kernel local APIC	Sean Christopherson
	Yell if kvm_pi_update_irte() is reached without an in-kernel local APIC, as kvm_arch_irqfd_allowed() should prevent attaching an irqfd and thus any and all postable IRQs to an APIC-less VM. Link: https://lore.kernel.org/r/20250611224604.313496-53-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	KVM: x86: WARN if IRQ bypass isn't supported in kvm_pi_update_irte()	Sean Christopherson
	WARN if kvm_pi_update_irte() is reached without IRQ bypass support, as the code is only reachable if the VM already has an IRQ bypass producer (see kvm_irq_routing_update()), or from kvm_arch_irq_bypass_{add,del}_producer(), which, stating the obvious, are called if and only if KVM enables its IRQ bypass hooks. Cc: David Matlack <dmatlack@google.com> Link: https://lore.kernel.org/r/20250611224604.313496-52-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	KVM: x86: Drop superfluous "has assigned device" check in kvm_pi_update_irte()	Sean Christopherson
	Don't bother checking if the VM has an assigned device when updating IRTE entries. kvm_arch_irq_bypass_add_producer() explicitly increments the assigned device count, kvm_arch_irq_bypass_del_producer() explicitly decrements the count before invoking kvm_pi_update_irte(), and kvm_irq_routing_update() only updates IRTE entries if there's an active IRQ bypass producer. Link: https://lore.kernel.org/r/20250611224604.313496-51-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	KVM: SVM: WARN if updating IRTE GA fields in IOMMU fails	Sean Christopherson
	WARN if updating GA information for an IRTE entry fails as modifying an IRTE should only fail if KVM is buggy, e.g. has stale metadata, and because returning an error that is always ignored is pointless. Link: https://lore.kernel.org/r/20250611224604.313496-50-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	KVM: SVM: Process all IRTEs on affinity change even if one update fails	Sean Christopherson
	When updating IRTE GA fields, keep processing all other IRTEs if an update fails, as not updating later entries risks making a bad situation worse. Link: https://lore.kernel.org/r/20250611224604.313496-49-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	KVM: SVM: WARN if (de)activating guest mode in IOMMU fails	Sean Christopherson
	WARN if (de)activating "guest mode" for an IRTE entry fails as modifying an IRTE should only fail if KVM is buggy, e.g. has stale metadata. Link: https://lore.kernel.org/r/20250611224604.313496-48-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	KVM: SVM: Don't check for assigned device(s) when activating AVIC	Sean Christopherson
	Don't short-circuit IRTE updating when (de)activating AVIC based on the VM having assigned devices, as nothing prevents AVIC (de)activation from racing with device (de)assignment. And from a performance perspective, bailing early when there is no assigned device doesn't add much, as ir_list_lock will never be contended if there's no assigned device. Link: https://lore.kernel.org/r/20250611224604.313496-47-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	KVM: SVM: Don't check for assigned device(s) when updating affinity	Sean Christopherson
	Don't bother checking if a VM has an assigned device when updating AVIC vCPU affinity, querying ir_list is just as cheap and nothing prevents racing with changes in device assignment. Link: https://lore.kernel.org/r/20250611224604.313496-46-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	iommu/amd: KVM: SVM: Add IRTE metadata to affined vCPU's list if AVIC is ↵	Sean Christopherson
	inhibited If an IRQ can be posted to a vCPU, but AVIC is currently inhibited on the vCPU, go through the dance of "affining" the IRTE to the vCPU, but leave the actual IRTE in remapped mode. KVM already handles the case where AVIC is inhibited => uninhibited with posted IRQs (see avic_set_pi_irte_mode()), but doesn't handle the scenario where a postable IRQ comes along while AVIC is inhibited. Link: https://lore.kernel.org/r/20250611224604.313496-45-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	iommu/amd: KVM: SVM: Set pCPU info in IRTE when setting vCPU affinity	Sean Christopherson
	Now that setting vCPU affinity is guarded with ir_list_lock, i.e. now that avic_physical_id_entry can be safely accessed, set the pCPU info straight-away when setting vCPU affinity. Putting the IRTE into posted mode, and then immediately updating the IRTE a second time if the target vCPU is running is wasteful and confusing. This also fixes a flaw where a posted IRQ that arrives between putting the IRTE into guest_mode and setting the correct destination could cause the IOMMU to ring the doorbell on the wrong pCPU. Link: https://lore.kernel.org/r/20250611224604.313496-44-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	iommu/amd: Factor out helper for manipulating IRTE GA/CPU info	Sean Christopherson
	Split the guts of amd_iommu_update_ga() to a dedicated helper so that the logic can be shared with flows that put the IRTE into posted mode. Opportunistically move amd_iommu_update_ga() and its new helper above amd_iommu_activate_guest_mode() so that it's all co-located. Link: https://lore.kernel.org/r/20250611224604.313496-43-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	iommu/amd: KVM: SVM: Infer IsRun from validity of pCPU destination	Sean Christopherson
	Infer whether or not a vCPU should be marked running from the validity of the pCPU on which it is running. amd_iommu_update_ga() already skips the IRTE update if the pCPU is invalid, i.e. passing %true for is_run with an invalid pCPU would be a blatant and egregrious KVM bug. Tested-by: Sairaj Kodilkar <sarunkod@amd.com> Link: https://lore.kernel.org/r/20250611224604.313496-42-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	iommu/amd: Document which IRTE fields amd_iommu_update_ga() can modify	Sean Christopherson
	Add a comment to amd_iommu_update_ga() to document what fields it can safely modify without issuing an invalidation of the IRTE, and to explain its role in keeping GA IRTEs up-to-date. Per page 93 of the IOMMU spec dated Feb 2025: When virtual interrupts are enabled by setting MMIO Offset 0018h[GAEn] and IRTE[GuestMode=1], IRTE[IsRun], IRTE[Destination], and if present IRTE[GATag], are not cached by the IOMMU. Modifications to these fields do not require an invalidation of the Interrupt Remapping Table. Link: https://lore.kernel.org/all/9b7ceea3-8c47-4383-ad9c-1a9bbdc9044a@oracle.com Cc: Joao Martins <joao.m.martins@oracle.com> Link: https://lore.kernel.org/r/20250611224604.313496-41-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	KVM: SVM: Take and hold ir_list_lock across IRTE updates in IOMMU	Sean Christopherson
	Now that svm_ir_list_add() isn't overloaded with all manner of weird things, fold it into avic_pi_update_irte(), and more importantly take ir_list_lock across the irq_set_vcpu_affinity() calls to ensure the info that's shoved into the IRTE is fresh. While preemption (and IRQs) is disabled on the task performing the IRTE update, thanks to irqfds.lock, that task doesn't hold the vCPU's mutex, i.e. preemption being disabled is irrelevant. Link: https://lore.kernel.org/r/20250611224604.313496-40-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	KVM: SVM: Revert IRTE to legacy mode if IOMMU doesn't provide IR metadata	Sean Christopherson
	Revert the IRTE back to remapping mode if the AMD IOMMU driver mucks up and doesn't provide the necessary metadata. Returning an error up the stack without actually handling the error is useless and confusing. Link: https://lore.kernel.org/r/20250611224604.313496-39-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	KVM: x86: Don't update IRTE entries when old and new routes were !MSI	Sean Christopherson
	Skip the entirety of IRTE updates on a GSI routing change if neither the old nor the new routing is for an MSI, i.e. if the neither routing setup allows for posting to a vCPU. If the IRTE isn't already host controlled, KVM has bigger problems. Link: https://lore.kernel.org/r/20250611224604.313496-38-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>