summaryrefslogtreecommitdiff
path: root/include/linux
AgeCommit message (Collapse)Author
2025-06-23iommu/amd: KVM: SVM: Infer IsRun from validity of pCPU destinationSean Christopherson
Infer whether or not a vCPU should be marked running from the validity of the pCPU on which it is running. amd_iommu_update_ga() already skips the IRTE update if the pCPU is invalid, i.e. passing %true for is_run with an invalid pCPU would be a blatant and egregrious KVM bug. Tested-by: Sairaj Kodilkar <sarunkod@amd.com> Link: https://lore.kernel.org/r/20250611224604.313496-42-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23KVM: Fold kvm_arch_irqfd_route_changed() into kvm_arch_update_irqfd_routing()Sean Christopherson
Fold kvm_arch_irqfd_route_changed() into kvm_arch_update_irqfd_routing(). Calling arch code to know whether or not to call arch code is absurd. Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20250611224604.313496-35-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23KVM: Don't WARN if updating IRQ bypass route failsSean Christopherson
Don't bother WARNing if updating an IRTE route fails now that vendor code provides much more precise WARNs. The generic WARN doesn't provide enough information to actually debug the problem, and has obviously done nothing to surface the myriad bugs in KVM x86's implementation. Drop all of the associated return code plumbing that existed just so that common KVM could WARN. Link: https://lore.kernel.org/r/20250611224604.313496-34-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23iommu: KVM: Split "struct vcpu_data" into separate AMD vs. Intel structsSean Christopherson
Split the vcpu_data structure that serves as a handoff from KVM to IOMMU drivers into vendor specific structures. Overloading a single structure makes the code hard to read and maintain, is *very* misleading as it suggests that mixing vendors is actually supported, and bastardizing Intel's posted interrupt descriptor address when AMD's IOMMU already has its own structure is quite unnecessary. Tested-by: Sairaj Kodilkar <sarunkod@amd.com> Link: https://lore.kernel.org/r/20250611224604.313496-33-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23iommu/amd: KVM: SVM: Use pi_desc_addr to derive ga_root_ptrSean Christopherson
Use vcpu_data.pi_desc_addr instead of amd_iommu_pi_data.base to get the GA root pointer. KVM is the only source of amd_iommu_pi_data.base, and KVM's one and only path for writing amd_iommu_pi_data.base computes the exact same value for vcpu_data.pi_desc_addr and amd_iommu_pi_data.base, and fills amd_iommu_pi_data.base if and only if vcpu_data.pi_desc_addr is valid, i.e. amd_iommu_pi_data.base is fully redundant. Cc: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Joao Martins <joao.m.martins@oracle.com> Reviewed-by: Vasant Hegde <vasant.hegde@amd.com> Tested-by: Sairaj Kodilkar <sarunkod@amd.com> Link: https://lore.kernel.org/r/20250611224604.313496-23-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23io_uring: add struct io_cold_def->sqe_copy() methodJens Axboe
Will be called by the core of io_uring, if inline issue is not going to be tried for a request. Opcodes can define this handler to defer copying of SQE data that should remain stable. Only called if IO_URING_F_INLINE is set. If it isn't set, then there's a bug in the core handling of this, and -EFAULT will be returned instead to terminate the request. This will trigger a WARN_ON_ONCE(). Don't expect this to ever trigger, and down the line this can be removed. Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-23io_uring: add IO_URING_F_INLINE issue flagJens Axboe
Set when the execution of the request is done inline from the system call itself. Any deferred issue will never have this flag set. Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-23futex: Initialize futex_phash_new during fork().Sebastian Andrzej Siewior
During a hash resize operation the new private hash is stored in mm_struct::futex_phash_new if the current hash can not be immediately replaced. The new hash must not be copied during fork() into the new task. Doing so will lead to a double-free of the memory by the two tasks. Initialize the mm_struct::futex_phash_new during fork(). Closes: https://lore.kernel.org/all/aFBQ8CBKmRzEqIfS@mozart.vkv.me/ Fixes: bd54df5ea7cad ("futex: Allow to resize the private local hash") Reported-by: Calvin Owens <calvin@wbinvd.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Calvin Owens <calvin@wbinvd.org> Link: https://lkml.kernel.org/r/20250623083408.jTiJiC6_@linutronix.de
2025-06-23fs: introduce FALLOC_FL_WRITE_ZEROES to fallocateZhang Yi
With the development of flash-based storage devices, we can quickly write zeros to SSDs using the WRITE_ZERO command if the devices do not actually write physical zeroes to the media. Therefore, we can use this command to quickly preallocate a real all-zero file with written extents. This approach should be beneficial for subsequent pure overwriting within this file, as it can save on block allocation and, consequently, significant metadata changes, which should greatly improve overwrite performance on certain filesystems. Therefore, introduce a new operation FALLOC_FL_WRITE_ZEROES to fallocate. This flag is used to convert a specified range of a file to zeros by issuing a zeroing operation. Blocks should be allocated for the regions that span holes in the file, and the entire range is converted to written extents. If the underlying device supports the actual offload write zeroes command, the process of zeroing out operation can be accelerated. If it does not, we currently don't prevent the file system from writing actual zeros to the device. This provides users with a new method to quickly generate a zeroed file, users no longer need to write zero data to create a file with written extents. Users can determine whether a disk supports the unmap write zeroes feature through querying this sysfs interface: /sys/block/<disk>/queue/write_zeroes_unmap_max_hw_bytes Users can also enable or disable the unmap write zeroes operation through this sysfs interface: /sys/block/<disk>/queue/write_zeroes_unmap_max_bytes Finally, this flag cannot be specified in conjunction with the FALLOC_FL_KEEP_SIZE since allocating written extents beyond file EOF is not permitted. In addition, filesystems that always require out-of-place writes should not support this flag since they still need to allocated new blocks during subsequent overwrites. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Link: https://lore.kernel.org/20250619111806.3546162-7-yi.zhang@huaweicloud.com Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-23block: introduce max_{hw|user}_wzeroes_unmap_sectors to queue limitsZhang Yi
Currently, disks primarily implement the write zeroes command (aka REQ_OP_WRITE_ZEROES) through two mechanisms: the first involves physically writing zeros to the disk media (e.g., HDDs), while the second performs an unmap operation on the logical blocks, effectively putting them into a deallocated state (e.g., SSDs). The first method is generally slow, while the second method is typically very fast. For example, on certain NVMe SSDs that support NVME_NS_DEAC, submitting REQ_OP_WRITE_ZEROES requests with the NVME_WZ_DEAC bit can accelerate the write zeros operation by placing disk blocks into a deallocated state, which opportunistically avoids writing zeroes to media while still guaranteeing that subsequent reads from the specified block range will return zeroed data. This is a best-effort optimization, not a mandatory requirement, some devices may partially fall back to writing physical zeroes due to factors such as misalignment or being asked to clear a block range smaller than the device's internal allocation unit. Therefore, the speed of this operation is not guaranteed. It is difficult to determine whether the storage device supports unmap write zeroes operation. We cannot determine this by only querying bdev_limits(bdev)->max_write_zeroes_sectors. Therefore, first, add a new hardware queue limit parameters, max_hw_wzeroes_unmap_sectors, to indicate whether a device supports this unmap write zeroes operation. Then, add two new counterpart software queue limits, max_wzeroes_unmap_sectors and max_user_wzeroes_unmap_sectors, which allow users to disable this operation if the speed is very slow on some sepcial devices. Finally, for the stacked devices cases, initialize these two parameters to UINT_MAX. This operation should be enabled by both the stacking driver and all underlying devices. Thanks to Martin K. Petersen for optimizing the documentation of the write_zeroes_unmap sysfs interface. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Link: https://lore.kernel.org/20250619111806.3546162-2-yi.zhang@huaweicloud.com Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-23fs: export anon_inode_make_secure_inode() and fix secretmem LSM bypassShivank Garg
Export anon_inode_make_secure_inode() to allow KVM guest_memfd to create anonymous inodes with proper security context. This replaces the current pattern of calling alloc_anon_inode() followed by inode_init_security_anon() for creating security context manually. This change also fixes a security regression in secretmem where the S_PRIVATE flag was not cleared after alloc_anon_inode(), causing LSM/SELinux checks to be bypassed for secretmem file descriptors. As guest_memfd currently resides in the KVM module, we need to export this symbol for use outside the core kernel. In the future, guest_memfd might be moved to core-mm, at which point the symbols no longer would have to be exported. When/if that happens is still unclear. Fixes: 2bfe15c52612 ("mm: create security context for memfd_secret inodes") Suggested-by: David Hildenbrand <david@redhat.com> Suggested-by: Mike Rapoport <rppt@kernel.org> Signed-off-by: Shivank Garg <shivankg@amd.com> Link: https://lore.kernel.org/20250620070328.803704-3-shivankg@amd.com Acked-by: "Mike Rapoport (Microsoft)" <rppt@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-23docs/vfs: update references to i_mutex to i_rwsemJunxuan Liao
VFS has switched to i_rwsem for ten years now (9902af79c01a: parallel lookups actual switch to rwsem), but the VFS documentation and comments still has references to i_mutex. Signed-off-by: Junxuan Liao <ljx@cs.wisc.edu> Link: https://lore.kernel.org/72223729-5471-474a-af3c-f366691fba82@cs.wisc.edu Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-23crypto: hisilicon - Use fine grained DMA mapping directionZenghui Yu
The following splat was triggered when booting the kernel built with arm64's defconfig + CRYPTO_SELFTESTS + DMA_API_DEBUG. ------------[ cut here ]------------ DMA-API: hisi_sec2 0000:75:00.0: cacheline tracking EEXIST, overlapping mappings aren't supported WARNING: CPU: 24 PID: 1273 at kernel/dma/debug.c:596 add_dma_entry+0x248/0x308 Call trace: add_dma_entry+0x248/0x308 (P) debug_dma_map_sg+0x208/0x3e4 __dma_map_sg_attrs+0xbc/0x118 dma_map_sg_attrs+0x10/0x24 hisi_acc_sg_buf_map_to_hw_sgl+0x80/0x218 [hisi_qm] sec_cipher_map+0xc4/0x338 [hisi_sec2] sec_aead_sgl_map+0x18/0x24 [hisi_sec2] sec_process+0xb8/0x36c [hisi_sec2] sec_aead_crypto+0xe4/0x264 [hisi_sec2] sec_aead_encrypt+0x14/0x20 [hisi_sec2] crypto_aead_encrypt+0x24/0x38 test_aead_vec_cfg+0x480/0x7e4 test_aead_vec+0x84/0x1b8 alg_test_aead+0xc0/0x498 alg_test.part.0+0x518/0x524 alg_test+0x20/0x64 cryptomgr_test+0x24/0x44 kthread+0x130/0x1fc ret_from_fork+0x10/0x20 ---[ end trace 0000000000000000 ]--- DMA-API: Mapped at: debug_dma_map_sg+0x234/0x3e4 __dma_map_sg_attrs+0xbc/0x118 dma_map_sg_attrs+0x10/0x24 hisi_acc_sg_buf_map_to_hw_sgl+0x80/0x218 [hisi_qm] sec_cipher_map+0xc4/0x338 [hisi_sec2] This occurs in selftests where the input and the output scatterlist point to the same underlying memory (e.g., when tested with INPLACE_TWO_SGLISTS mode). The problem is that the hisi_sec2 driver maps these two different scatterlists using the DMA_BIDIRECTIONAL flag which leads to overlapped write mappings which are not supported by the DMA layer. Fix it by using the fine grained and correct DMA mapping directions. While at it, switch the DMA directions used by the hisi_zip driver too. Signed-off-by: Zenghui Yu <yuzenghui@huawei.com> Reviewed-by: Longfang Liu <liulongfang@huawei.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2025-06-23Merge 6.16-rc3 into driver-core-nextGreg Kroah-Hartman
We need the driver-core fixes that are in 6.16-rc3 into here as well to build on top of. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-06-22Merge tag 'x86_urgent_for_v6.16_rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Borislav Petkov: - Make sure the array tracking which kernel text positions need to be alternatives-patched doesn't get mishandled by out-of-order modifications, leading to it overflowing and causing page faults when patching - Avoid an infinite loop when early code does a ranged TLB invalidation before the broadcast TLB invalidation count of how many pages it can flush, has been read from CPUID - Fix a CONFIG_MODULES typo - Disable broadcast TLB invalidation when PTI is enabled to avoid an overflow of the bitmap tracking dynamic ASIDs which need to be flushed when the kernel switches between the user and kernel address space - Handle the case of a CPU going offline and thus reporting zeroes when reading top-level events in the resctrl code * tag 'x86_urgent_for_v6.16_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/alternatives: Fix int3 handling failure from broken text_poke array x86/mm: Fix early boot use of INVPLGB x86/its: Fix an ifdef typo in its_alloc() x86/mm: Disable INVLPGB when PTI is enabled x86,fs/resctrl: Remove inappropriate references to cacheinfo in the resctrl subsystem
2025-06-22Merge tag 'perf_urgent_for_v6.16_rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Borislav Petkov: - Avoid a crash on a heterogeneous machine where not all cores support the same hw events features - Avoid a deadlock when throttling events - Document the perf event states more - Make sure a number of perf paths switching off or rescheduling events call perf_cgroup_event_disable() - Make sure perf does task sampling before its userspace mapping is torn down, and not after * tag 'perf_urgent_for_v6.16_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/intel: Fix crash in icl_update_topdown_event() perf: Fix the throttle error of some clock events perf: Add comment to enum perf_event_state perf/core: Fix WARN in perf_cgroup_switch() perf: Fix dangling cgroup pointer in cpuctx perf: Fix cgroup state vs ERROR perf: Fix sample vs do_exit()
2025-06-22power: supply: core: rename power_supply_get_by_phandle to ↵Sebastian Reichel
power_supply_get_by_reference (devm_)power_supply_get_by_phandle now internally uses fwnode and are no longer DT specific. Thus drop the ifdef check for CONFIG_OF and rename to (devm_)power_supply_get_by_reference to avoid the DT terminology. Reviewed-by: Hans de Goede <hansg@kernel.org> Link: https://lore.kernel.org/r/20250430-psy-core-convert-to-fwnode-v2-5-f9643b958677@collabora.com Signed-off-by: Sebastian Reichel <sebastian.reichel@collabora.com>
2025-06-22power: supply: core: convert to fwnnodeSebastian Reichel
Replace any DT specific code with fwnode in the power-supply core. Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> Reviewed-by: Hans de Goede <hansg@kernel.org> Link: https://lore.kernel.org/r/20250430-psy-core-convert-to-fwnode-v2-4-f9643b958677@collabora.com Signed-off-by: Sebastian Reichel <sebastian.reichel@collabora.com>
2025-06-22power: supply: core: remove of_node from power_supply_configSebastian Reichel
All drivers have been migrated from .of_node to .fwnode, so let's kill the former. Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> Signed-off-by: Sebastian Reichel <sebastian.reichel@collabora.com> Link: https://lore.kernel.org/r/20250430-psy-core-convert-to-fwnode-v2-2-f9643b958677@collabora.com Reviewed-by: Hans de Goede <hansg@kernel.org> Signed-off-by: Hans de Goede <hansg@kernel.org> Signed-off-by: Sebastian Reichel <sebastian.reichel@collabora.com>
2025-06-21net: pse-pd: Fix ethnl_pse_send_ntf() stub parameter typeKory Maincent
The ethnl_pse_send_ntf() stub function has incorrect parameter type when CONFIG_ETHTOOL_NETLINK is disabled. The function should take a net_device pointer instead of phy_device pointer to match the actual implementation. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202506200355.TqFiYUbN-lkp@intel.com/ Fixes: fc0e6db30941 ("net: pse-pd: Add support for reporting events") Signed-off-by: Kory Maincent <kory.maincent@bootlin.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250620091641.2098028-1-kory.maincent@bootlin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-21netmem: fix skb_frag_address_safe with unreadable skbsMina Almasry
skb_frag_address_safe() needs a check that the skb_frag_page exists check similar to skb_frag_address(). Cc: ap420073@gmail.com Signed-off-by: Mina Almasry <almasrymina@google.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250619175239.3039329-1-almasrymina@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-20Merge tag 'mtd/fixes-for-6.16-rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mtd/linux Pull mtd fixes from Miquel Raynal: "The main fix that really needs to get in is the revert of the patch adding the new mtd_master class, because it entirely fails the partitioning if a specific Kconfig option is set. We need to think how to handle that differently, so let's revert it as we need to get back to the pen and paper situation again. Otherwise the definition of some Winbond SPI NAND chips are receiving some fixes (geometry and maximum frequency, mostly). And finally a small memory leak gets also fixed" * tag 'mtd/fixes-for-6.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/mtd/linux: mtd: spinand: fix memory leak of ECC engine conf mtd: spinand: winbond: Prevent unsupported frequencies on dual/quad I/O variants mtd: spinand: winbond: Increase maximum frequency on an octal operation mtd: spinand: winbond: Fix W35N number of planes/LUN Revert "mtd: core: always create master device"
2025-06-20sched_ext: Add support for cgroup bandwidth control interfaceTejun Heo
From 077814f57f8acce13f91dc34bbd2b7e4911fbf25 Mon Sep 17 00:00:00 2001 From: Tejun Heo <tj@kernel.org> Date: Fri, 13 Jun 2025 15:06:47 -1000 - Add CONFIG_GROUP_SCHED_BANDWIDTH which is selected by both CONFIG_CFS_BANDWIDTH and EXT_GROUP_SCHED. - Put bandwidth control interface files for both cgroup v1 and v2 under CONFIG_GROUP_SCHED_BANDWIDTH. - Update tg_bandwidth() to fetch configuration parameters from fair if CONFIG_CFS_BANDWIDTH, SCX otherwise. - Update tg_set_bandwidth() to update the parameters for both fair and SCX. - Add bandwidth control parameters to struct scx_cgroup_init_args. - Add sched_ext_ops.cgroup_set_bandwidth() which is invoked on bandwidth control parameter updates. - Update scx_qmap and maximal selftest to test the new feature. Signed-off-by: Tejun Heo <tj@kernel.org>
2025-06-20sched_ext, sched/core: Factor out struct scx_task_groupTejun Heo
More sched_ext fields will be added to struct task_group. In preparation, factor out sched_ext fields into struct scx_task_group to reduce clutter in the common header. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org>
2025-06-20sched_ext: Merge branch 'for-6.16-fixes' into for-6.17Tejun Heo
Pull sched_ext/for-6.16-fixes to receive: c50784e99f0e ("sched_ext: Make scx_group_set_weight() always update tg->scx.weight") 33796b91871a ("sched_ext, sched/core: Don't call scx_group_set_weight() prematurely from sched_create_group()") which are needed to implement CPU bandwidth control interface support. Signed-off-by: Tejun Heo <tj@kernel.org>
2025-06-20iommu/amd: KVM: SVM: Delete now-unused cached/previous GA tag fieldsSean Christopherson
Delete the amd_ir_data.prev_ga_tag field now that all usage is superfluous. Reviewed-by: Vasant Hegde <vasant.hegde@amd.com> Tested-by: Sairaj Kodilkar <sarunkod@amd.com> Link: https://lore.kernel.org/r/20250611224604.313496-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: SVM: Delete IRTE link from previous vCPU before setting new IRTESean Christopherson
Delete the previous per-vCPU IRTE link prior to modifying the IRTE. If forcing the IRTE back to remapped mode fails, the IRQ is already broken; keeping stale metadata won't change that, and the IOMMU should be sufficiently paranoid to sanitize the IRTE when the IRQ is freed and reallocated. This will allow hoisting the vCPU tracking to common x86, which in turn will allow most of the IRTE update code to be deduplicated. Tested-by: Sairaj Kodilkar <sarunkod@amd.com> Link: https://lore.kernel.org/r/20250611224604.313496-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: SVM: Track per-vCPU IRTEs using kvm_kernel_irqfd structureSean Christopherson
Track the IRTEs that are posting to an SVM vCPU via the associated irqfd structure and GSI routing instead of dynamically allocating a separate data structure. In addition to eliminating an atomic allocation, this will allow hoisting much of the IRTE update logic to common x86. Cc: Sairaj Kodilkar <sarunkod@amd.com> Link: https://lore.kernel.org/r/20250611224604.313496-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: Pass new routing entries and irqfd when updating IRTEsSean Christopherson
When updating IRTEs in response to a GSI routing or IRQ bypass change, pass the new/current routing information along with the associated irqfd. This will allow KVM x86 to harden, simplify, and deduplicate its code. Since adding/removing a bypass producer is now conveniently protected with irqfds.lock, i.e. can't run concurrently with kvm_irq_routing_update(), use the routing information cached in the irqfd instead of looking up the information in the current GSI routing tables. Opportunistically convert an existing printk() to pr_info() and put its string onto a single line (old code that strictly adhered to 80 chars). Link: https://lore.kernel.org/r/20250611224604.313496-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: x86: Add CONFIG_KVM_IOAPIC to allow disabling in-kernel I/O APICSean Christopherson
Add a Kconfig to allow building KVM without support for emulating a I/O APIC, PIC, and PIT, which is desirable for deployments that effectively don't support a fully in-kernel IRQ chip, i.e. never expect any VMM to create an in-kernel I/O APIC. E.g. compiling out support eliminates a few thousand lines of guest-facing code and gives security folks warm fuzzies. As a bonus, wrapping relevant paths with CONFIG_KVM_IOAPIC #ifdefs makes it much easier for readers to understand which bits and pieces exist specifically for fully in-kernel IRQ chips. Opportunistically convert all two in-kernel uses of __KVM_HAVE_IOAPIC to CONFIG_KVM_IOAPIC, e.g. rather than add a second #ifdef to generate a stub for kvm_arch_post_irq_routing_update(). Acked-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20250611213557.294358-15-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: x86: Hardcode the PIT IRQ source ID to '2'Sean Christopherson
Hardcode the PIT's source IRQ ID to '2' instead of "finding" that bit 2 is always the first available bit in irq_sources_bitmap. Bits 0 and 1 are set/reserved by kvm_arch_init_vm(), i.e. long before kvm_create_pit() can be invoked, and KVM allows at most one in-kernel PIT instance, i.e. it's impossible for the PIT to find a different free bit (there are no other users of kvm_request_irq_source_id(). Delete the now-defunct irq_sources_bitmap and all its associated code. Acked-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20250611213557.294358-11-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: x86: Move kvm_{request,free}_irq_source_id() to i8254.c (PIT)Sean Christopherson
Move kvm_{request,free}_irq_source_id() to i8254.c, i.e. the dedicated PIT emulation file, in anticipation of removing them entirely in favor of hardcoding the PIT's "requested" source ID (the source ID can only ever be '2', and the request can never fail). No functional change intended. Acked-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20250611213557.294358-10-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: x86: Trigger I/O APIC route rescan in kvm_arch_irq_routing_update()Sean Christopherson
Trigger the I/O APIC route rescan that's performed for a split IRQ chip after userspace updates IRQ routes in kvm_arch_irq_routing_update(), i.e. before dropping kvm->irq_lock. Calling kvm_make_all_cpus_request() under a mutex is perfectly safe, and the smp_wmb()+smp_mb__after_atomic() pair in __kvm_make_request()+kvm_check_request() ensures the new routing is visible to vCPUs prior to the request being visible to vCPUs. In all likelihood, commit b053b2aef25d ("KVM: x86: Add EOI exit bitmap inference") somewhat arbitrarily made the request outside of irq_lock to avoid holding irq_lock any longer than is strictly necessary. And then commit abdb080f7ac8 ("kvm/irqchip: kvm_arch_irq_routing_update renaming split") took the easy route of adding another arch hook instead of risking a functional change. Note, the call to synchronize_srcu_expedited() does NOT provide ordering guarantees with respect to vCPUs scanning the new routing; as above, the request infrastructure provides the necessary ordering. I.e. there's no need to wait for kvm_scan_ioapic_routes() to complete if it's actively running, because regardless of whether it grabs the old or new table, the vCPU will have another KVM_REQ_SCAN_IOAPIC pending, i.e. will rescan again and see the new mappings. Acked-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20250611213557.294358-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20irqbypass: Require producers to pass in Linux IRQ number during registrationSean Christopherson
Pass in the Linux IRQ associated with an IRQ bypass producer instead of relying on the caller to set the field prior to registration, as there's no benefit to relying on callers to do the right thing. Take care to set producer->irq before __connect(), as KVM expects the IRQ to be valid as soon as a connection is possible. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Alex Williamson <alex.williamson@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Link: https://lore.kernel.org/r/20250516230734.2564775-9-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20irqbypass: Use xarray to track producers and consumersSean Christopherson
Track IRQ bypass producers and consumers using an xarray to avoid the O(2n) insertion time associated with walking a list to check for duplicate entries, and to search for an partner. At low (tens or few hundreds) total producer/consumer counts, using a list is faster due to the need to allocate backing storage for xarray. But as count creeps into the thousands, xarray wins easily, and can provide several orders of magnitude better latency at high counts. E.g. hundreds of nanoseconds vs. hundreds of milliseconds. Cc: Oliver Upton <oliver.upton@linux.dev> Cc: David Matlack <dmatlack@google.com> Cc: Like Xu <like.xu.linux@gmail.com> Cc: Binbin Wu <binbin.wu@linux.intel.com> Reported-by: Yong He <alexyonghe@tencent.com> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217379 Link: https://lore.kernel.org/all/20230801115646.33990-1-likexu@tencent.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Alex Williamson <alex.williamson@redhat.com> Link: https://lore.kernel.org/r/20250516230734.2564775-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20irqbypass: Explicitly track producer and consumer bindingsSean Christopherson
Explicitly track IRQ bypass producer:consumer bindings. This will allow making removal an O(1) operation; searching through the list to find information that is trivially tracked (and useful for debug) is wasteful. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Alex Williamson <alex.williamson@redhat.com> Link: https://lore.kernel.org/r/20250516230734.2564775-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20irqbypass: Take ownership of producer/consumer token trackingSean Christopherson
Move ownership of IRQ bypass token tracking into irqbypass.ko, and explicitly require callers to pass an eventfd_ctx structure instead of a completely opaque token. Relying on producers and consumers to set the token appropriately is error prone, and hiding the fact that the token must be an eventfd_ctx pointer (for all intents and purposes) unnecessarily obfuscates the code and makes it more brittle. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Alex Williamson <alex.williamson@redhat.com> Link: https://lore.kernel.org/r/20250516230734.2564775-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: arm64: WARN if unmapping a vLPI fails in any pathSean Christopherson
When unmapping a vLPI, WARN if nullifying vCPU affinity fails, not just if failure occurs when freeing an ITE. If undoing vCPU affinity fails, then odds are very good that vLPI state tracking has has gotten out of whack, i.e. that KVM and the GIC disagree on the state of an IRQ/vLPI. At best, inconsistent state means there is a lurking bug/flaw somewhere. At worst, the inconsistency could eventually be fatal to the host, e.g. if an ITS command fails because KVM's view of things doesn't match reality/hardware. Note, only the call from kvm_arch_irq_bypass_del_producer() by way of kvm_vgic_v4_unset_forwarding() doesn't already WARN. Common KVM's kvm_irq_routing_update() WARNs if kvm_arch_update_irqfd_routing() fails. For that path, if its_unmap_vlpi() fails in kvm_vgic_v4_unset_forwarding(), the only possible causes are that the GIC doesn't have a v4 ITS (from its_irq_set_vcpu_affinity()): /* Need a v4 ITS */ if (!is_v4(its_dev->its)) return -EINVAL; guard(raw_spinlock)(&its_dev->event_map.vlpi_lock); /* Unmap request? */ if (!info) return its_vlpi_unmap(d); or that KVM has gotten out of sync with the GIC/ITS (from its_vlpi_unmap()): if (!its_dev->event_map.vm || !irqd_is_forwarded_to_vcpu(d)) return -EINVAL; All of the above failure scenarios are warnable offences, as they should never occur absent a kernel/KVM bug. Acked-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/all/aFWY2LTVIxz5rfhh@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: Bound the number of dirty ring entries in a single reset at INT_MAXSean Christopherson
Cap the number of ring entries that are reset in a single ioctl to INT_MAX to ensure userspace isn't confused by a wrap into negative space, and so that, in a truly pathological scenario, KVM doesn't miss a TLB flush due to the count wrapping to zero. While the size of the ring is fixed at 0x10000 entries and KVM (currently) supports at most 4096, userspace is allowed to harvest entries from the ring while the reset is in-progress, i.e. it's possible for the ring to always have harvested entries. Opportunistically return an actual error code from the helper so that a future fix to handle pending signals can gracefully return -EINTR. Drop the function comment now that the return code is a stanard 0/-errno (and because a future commit will add a proper lockdep assertion). Opportunistically drop a similarly stale comment for kvm_dirty_ring_push(). Cc: Peter Xu <peterx@redhat.com> Cc: Yan Zhao <yan.y.zhao@intel.com> Cc: Maxim Levitsky <mlevitsk@redhat.com> Cc: Binbin Wu <binbin.wu@linux.intel.com> Fixes: fb04a1eddb1a ("KVM: X86: Implement ring-based dirty memory tracking") Reviewed-by: James Houghton <jthoughton@google.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Reviewed-by: Peter Xu <peterx@redhat.com> Link: https://lore.kernel.org/r/20250516213540.2546077-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20HID: rate-limit hid_warn to prevent log floodingLi Chen
Syzkaller can create many uhid devices that trigger repeated warnings like: "hid-generic xxxx: unknown main item tag 0x0" These messages can flood the system log, especially if a crash occurs (e.g., with a slow UART console, leading to soft lockups). To mitigate this, convert `hid_warn()` to use `dev_warn_ratelimited()`. This helps reduce log noise and improves system stability under fuzzing or faulty device scenarios. Signed-off-by: Li Chen <chenl311@chinatelecom.cn> Signed-off-by: Jiri Kosina <jkosina@suse.com>
2025-06-20wifi: ieee80211: add Radio Measurement action fieldsAditya Kumar Singh
Drivers that support Tx power insertion could examine the outgoing Radio measurement packet and depending on the packet type, the driver can insert specific data fields in it. These action field values will help drivers classify the action code within the Radio Measurement action packet. These action fields are defined in IEEE 802.11-2024 - Table 9-470, Radio Measurement Action field values. Signed-off-by: Aditya Kumar Singh <aditya.kumar.singh@oss.qualcomm.com> Link: https://patch.msgid.link/20250528-add_rrm_action_code-v1-1-6b7c78b5bbaf@oss.qualcomm.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-06-20Merge tag 'drm-misc-next-2025-06-19' of ↵Dave Airlie
https://gitlab.freedesktop.org/drm/misc/kernel into drm-next drm-misc-next for 6.17: UAPI Changes: - Add Task Information for the wedge API Cross-subsystem Changes: Core Changes: - Fix warnings related to export.h - fbdev: Make CONFIG_FIRMWARE_EDID available on all architectures - fence: Fix UAF issues - format-helper: Improve tests Driver Changes: - ivpu: Add turbo flag, Add Wildcat Lake Support - rz-du: Improve MIPI-DSI Support - vmwgfx: fence improvement Signed-off-by: Dave Airlie <airlied@redhat.com> From: Maxime Ripard <mripard@redhat.com> Link: https://lore.kernel.org/r/20250619-perfect-industrious-whippet-8ed3db@houat
2025-06-19clk: add a clk_hw helpers to get the clock device or device_nodeJerome Brunet
Add helpers to get the device or device_node associated with clk_hw. This can be used by clock drivers to access various device related functionality such as devres, dev_ prints, etc ... Signed-off-by: Jerome Brunet <jbrunet@baylibre.com> Link: https://lore.kernel.org/r/20250417-clk-hw-get-helpers-v1-1-7743e509612a@baylibre.com Reviewed-by: Brian Masney <bmasney@redhat.com> Signed-off-by: Stephen Boyd <sboyd@kernel.org>
2025-06-19Merge branch ↵Jakub Kicinski
'ref_tracker-add-ability-to-register-a-debugfs-file-for-a-ref_tracker_dir' Jeff Layton says: ==================== ref_tracker: add ability to register a debugfs file for a ref_tracker_dir For those just joining in, this series adds a new top-level "ref_tracker" debugfs directory, and has each ref_tracker_dir register a file in there as part of its initialization. It also adds the ability to register a symlink with a more human-usable name that points to the file, and does some general cleanup of how the ref_tracker object names are handled. v14: https://lore.kernel.org/20250610-reftrack-dbgfs-v14-0-efb532861428@kernel.org v13: https://lore.kernel.org/20250603-reftrack-dbgfs-v13-0-7b2a425019d8@kernel.org v12: https://lore.kernel.org/20250529-reftrack-dbgfs-v12-0-11b93c0c0b6e@kernel.org v11: https://lore.kernel.org/20250528-reftrack-dbgfs-v11-0-94ae0b165841@kernel.org v10: https://lore.kernel.org/20250527-reftrack-dbgfs-v10-0-dc55f7705691@kernel.org v9: https://lore.kernel.org/20250509-reftrack-dbgfs-v9-0-8ab888a4524d@kernel.org v8: https://lore.kernel.org/20250507-reftrack-dbgfs-v8-0-607717d3bb98@kernel.org v7: https://lore.kernel.org/20250505-reftrack-dbgfs-v7-0-f78c5d97bcca@kernel.org v6: https://lore.kernel.org/20250430-reftrack-dbgfs-v6-0-867c29aff03a@kernel.org v5: https://lore.kernel.org/20250428-reftrack-dbgfs-v5-0-1cbbdf2038bd@kernel.org v4: https://lore.kernel.org/20250418-reftrack-dbgfs-v4-0-5ca5c7899544@kernel.org v3: https://lore.kernel.org/20250417-reftrack-dbgfs-v3-0-c3159428c8fb@kernel.org v2: https://lore.kernel.org/20250415-reftrack-dbgfs-v2-0-b18c4abd122f@kernel.org v1: https://lore.kernel.org/20250414-reftrack-dbgfs-v1-0-f03585832203@kernel.org ==================== Link: https://patch.msgid.link/20250618-reftrack-dbgfs-v15-0-24fc37ead144@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-19ref_tracker: eliminate the ref_tracker_dir name fieldJeff Layton
Now that we have dentries and the ability to create meaningful symlinks to them, don't keep a name string in each tracker. Switch the output format to print "class@address", and drop the name field. Also, add a kerneldoc header for ref_tracker_dir_init(). Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://patch.msgid.link/20250618-reftrack-dbgfs-v15-9-24fc37ead144@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-19ref_tracker: add a way to create a symlink to the ref_tracker_dir debugfs fileJeff Layton
Add the ability for a subsystem to add a user-friendly symlink that points to a ref_tracker_dir's debugfs file. Add a separate debugfs_symlinks xarray and use that to track symlinks. The reaper workqueue job will remove symlinks before their corresponding dentries. Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://patch.msgid.link/20250618-reftrack-dbgfs-v15-7-24fc37ead144@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-19ref_tracker: automatically register a file in debugfs for a ref_tracker_dirJeff Layton
Currently, there is no convenient way to see the info that the ref_tracking infrastructure collects. Attempt to create a file in debugfs when called from ref_tracker_dir_init(). The file is given the name "class@%px", as having the unmodified address is helpful for debugging. This should be safe since this directory is only accessible by root While ref_tracker_dir_init() is generally called from a context where sleeping is OK, ref_tracker_dir_exit() can be called from anywhere. Thus, dentry cleanup must be handled asynchronously. Add a new global xarray that has entries with the ref_tracker_dir pointer as the index and the corresponding debugfs dentry pointer as the value. Instead of removing the debugfs dentry, have ref_tracker_dir_exit() set a mark on the xarray entry and schedule a workqueue job. The workqueue job then walks the xarray looking for marked entries, and removes their xarray entries and the debugfs dentries. Because of this, the debugfs dentry can outlive the corresponding ref_tracker_dir. Have ref_tracker_debugfs_show() take extra care to ensure that it's safe to dereference the dir pointer before using it. Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://patch.msgid.link/20250618-reftrack-dbgfs-v15-6-24fc37ead144@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-19ref_tracker: add a static classname string to each ref_tracker_dirJeff Layton
A later patch in the series will be adding debugfs files for each ref_tracker that get created in ref_tracker_dir_init(). The format will be "class@%px". The current "name" string can vary between ref_tracker_dir objects of the same type, so it's not suitable for this purpose. Add a new "class" string to the ref_tracker dir that describes the the type of object (sans any individual info for that object). Also, in the i915 driver, gate the creation of debugfs files on whether the dentry pointer is still set to NULL. CI has shown that the ref_tracker_dir can be initialized more than once. Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://patch.msgid.link/20250618-reftrack-dbgfs-v15-4-24fc37ead144@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-19ref_tracker: have callers pass output function to pr_ostream()Jeff Layton
In a later patch, we'll be adding a 3rd mechanism for outputting ref_tracker info via seq_file. Instead of a conditional, have the caller set a pointer to an output function in struct ostream. As part of this, the log prefix must be explicitly passed in, as it's too late for the pr_fmt macro. Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://patch.msgid.link/20250618-reftrack-dbgfs-v15-3-24fc37ead144@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-19net: add dev_dstats_rx_dropped_add() helperBreno Leitao
Introduce the dev_dstats_rx_dropped_add() helper to allow incrementing the rx_drops per-CPU statistic by an arbitrary value, rather than just one. This is useful for drivers or code paths that need to account for multiple dropped packets at once, such as when dropping entire queues. Reviewed-by: Joe Damato <joe@dama.to> Signed-off-by: Breno Leitao <leitao@debian.org> Link: https://patch.msgid.link/20250618-netdevsim_stat-v4-3-19fe0d35e28e@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>