linux.git - Linus' kernel tree

Age	Commit message (Collapse)	Author
2024-03-11	Merge tag 'kvm-x86-xen-6.9' of https://github.com/kvm-x86/linux into HEAD	Paolo Bonzini
	KVM Xen and pfncache changes for 6.9: - Rip out the half-baked support for using gfn_to_pfn caches to manage pages that are "mapped" into guests via physical addresses. - Add support for using gfn_to_pfn caches with only a host virtual address, i.e. to bypass the "gfn" stage of the cache. The primary use case is overlay pages, where the guest may change the gfn used to reference the overlay page, but the backing hva+pfn remains the same. - Add an ioctl() to allow mapping Xen's shared_info page using an hva instead of a gpa, so that userspace doesn't need to reconfigure and invalidate the cache/mapping if the guest changes the gpa (but userspace keeps the resolved hva the same). - When possible, use a single host TSC value when computing the deadline for Xen timers in order to improve the accuracy of the timer emulation. - Inject pending upcall events when the vCPU software-enables its APIC to fix a bug where an upcall can be lost (and to follow Xen's behavior). - Fall back to the slow path instead of warning if "fast" IRQ delivery of Xen events fails, e.g. if the guest has aliased xAPIC IDs. - Extend gfn_to_pfn_cache's mutex to cover (de)activation (in addition to refresh), and drop a now-redundant acquisition of xen_lock (that was protecting the shared_info cache) to fix a deadlock due to recursively acquiring xen_lock.
2024-03-11	Merge tag 'kvm-x86-pmu-6.9' of https://github.com/kvm-x86/linux into HEAD	Paolo Bonzini
	KVM x86 PMU changes for 6.9: - Fix several bugs where KVM speciously prevents the guest from utilizing fixed counters and architectural event encodings based on whether or not guest CPUID reports support for the _architectural_ encoding. - Fix a variety of bugs in KVM's emulation of RDPMC, e.g. for "fast" reads, priority of VMX interception vs #GP, PMC types in architectural PMUs, etc. - Add a selftest to verify KVM correctly emulates RDMPC, counter availability, and a variety of other PMC-related behaviors that depend on guest CPUID, i.e. are difficult to validate via KVM-Unit-Tests. - Zero out PMU metadata on AMD if the virtual PMU is disabled to avoid wasting cycles, e.g. when checking if a PMC event needs to be synthesized when skipping an instruction. - Optimize triggering of emulated events, e.g. for "count instructions" events when skipping an instruction, which yields a ~10% performance improvement in VM-Exit microbenchmarks when a vPMU is exposed to the guest. - Tighten the check for "PMI in guest" to reduce false positives if an NMI arrives in the host while KVM is handling an IRQ VM-Exit.
2024-03-11	Merge tag 'kvm-x86-vmx-6.9' of https://github.com/kvm-x86/linux into HEAD	Paolo Bonzini
	KVM VMX changes for 6.9: - Fix a bug where KVM would report stale/bogus exit qualification information when exiting to userspace due to an unexpected VM-Exit while the CPU was vectoring an exception. - Add a VMX flag in /proc/cpuinfo to report 5-level EPT support. - Clean up the logic for massaging the passthrough MSR bitmaps when userspace changes its MSR filter.
2024-03-11	Merge tag 'kvm-x86-mmu-6.9' of https://github.com/kvm-x86/linux into HEAD	Paolo Bonzini
	KVM x86 MMU changes for 6.9: - Clean up code related to unprotecting shadow pages when retrying a guest instruction after failed #PF-induced emulation. - Zap TDP MMU roots at 4KiB granularity to minimize the delay in yielding if a reschedule is needed, e.g. if a high priority task needs to run. Because KVM doesn't support yielding in the middle of processing a zapped non-leaf SPTE, zapping at 1GiB granularity can result in multi-millisecond lag when attempting to schedule in a high priority. - Rework TDP MMU root unload, free, and alloc to run with mmu_lock held for read, e.g. to avoid serializing vCPUs when userspace deletes a memslot. - Allocate write-tracking metadata on-demand to avoid the memory overhead when running kernels built with KVMGT support (external write-tracking enabled), but for workloads that don't use nested virtualization (shadow paging) or KVMGT.
2024-03-11	Merge tag 'kvm-x86-misc-6.9' of https://github.com/kvm-x86/linux into HEAD	Paolo Bonzini
	KVM x86 misc changes for 6.9: - Explicitly initialize a variety of on-stack variables in the emulator that triggered KMSAN false positives (though in fairness in KMSAN, it's comically difficult to see that the uninitialized memory is never truly consumed). - Fix the deubgregs ABI for 32-bit KVM, and clean up code related to reading DR6 and DR7. - Rework the "force immediate exit" code so that vendor code ultimately decides how and when to force the exit. This allows VMX to further optimize handling preemption timer exits, and allows SVM to avoid sending a duplicate IPI (SVM also has a need to force an exit). - Fix a long-standing bug where kvm_has_noapic_vcpu could be left elevated if vCPU creation ultimately failed, and add WARN to guard against similar bugs. - Provide a dedicated arch hook for checking if a different vCPU was in-kernel (for directed yield), and simplify the logic for checking if the currently loaded vCPU is in-kernel. - Misc cleanups and fixes.
2024-03-11	Merge tag 'kvm-x86-generic-6.9' of https://github.com/kvm-x86/linux into HEAD	Paolo Bonzini
	KVM common MMU changes for 6.9: - Harden KVM against underflowing the active mmu_notifier invalidation count, so that "bad" invalidations (usually due to bugs elsehwere in the kernel) are detected earlier and are less likely to hang the kernel. - Fix a benign bug in __kvm_mmu_topup_memory_cache() where the object size and number of objects parameters to kvmalloc_array() were swapped.
2024-03-11	Merge tag 'kvm-x86-asyncpf-6.9' of https://github.com/kvm-x86/linux into HEAD	Paolo Bonzini
	KVM async page fault changes for 6.9: - Always flush the async page fault workqueue when a work item is being removed, especially during vCPU destruction, to ensure that there are no workers running in KVM code when all references to KVM-the-module are gone, i.e. to prevent a use-after-free if kvm.ko is unloaded. - Grab a reference to the VM's mm_struct in the async #PF worker itself instead of gifting the worker a reference, e.g. so that there's no need to remember to conditionally clean up after the worker.
2024-03-11	Merge tag 'kvm-x86-selftests-6.9' of https://github.com/kvm-x86/linux into HEAD	Paolo Bonzini
	KVM selftests changes for 6.9: - Add macros to reduce the amount of boilerplate code needed to write "simple" selftests, and to utilize selftest TAP infrastructure, which is especially beneficial for KVM selftests with multiple testcases. - Add basic smoke tests for SEV and SEV-ES, along with a pile of library support for handling private/encrypted/protected memory. - Fix benign bugs where tests neglect to close() guest_memfd files.
2024-03-11	Merge tag 'kvm-riscv-6.9-1' of https://github.com/kvm-riscv/linux into HEAD	Paolo Bonzini
	KVM/riscv changes for 6.9 - Exception and interrupt handling for selftests - Sstc (aka arch_timer) selftest - Forward seed CSR access to KVM userspace - Ztso extension support for Guest/VM - Zacas extension support for Guest/VM
2024-03-11	Merge tag 'kvmarm-6.9' of ↵	Paolo Bonzini
	https://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 updates for 6.9 - Infrastructure for building KVM's trap configuration based on the architectural features (or lack thereof) advertised in the VM's ID registers - Support for mapping vfio-pci BARs as Normal-NC (vaguely similar to x86's WC) at stage-2, improving the performance of interacting with assigned devices that can tolerate it - Conversion of KVM's representation of LPIs to an xarray, utilized to address serialization some of the serialization on the LPI injection path - Support for _architectural_ VHE-only systems, advertised through the absence of FEAT_E2H0 in the CPU's ID register - Miscellaneous cleanups, fixes, and spelling corrections to KVM and selftests
2024-03-11	Merge tag 'loongarch-kvm-6.9' of ↵	Paolo Bonzini
	git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson into HEAD LoongArch KVM changes for v6.9 * Set reserved bits as zero in CPUCFG. * Start SW timer only when vcpu is blocking. * Do not restart SW timer when it is expired. * Remove unnecessary CSR register saving during enter guest.
2024-03-09	Merge tag 'kvm-x86-guest_memfd_fixes-6.8' of ↵	Paolo Bonzini
	https://github.com/kvm-x86/linux into HEAD KVM GUEST_MEMFD fixes for 6.8: - Make KVM_MEM_GUEST_MEMFD mutually exclusive with KVM_MEM_READONLY to avoid creating ABI that KVM can't sanely support. - Update documentation for KVM_SW_PROTECTED_VM to make it abundantly clear that such VMs are purely a development and testing vehicle, and come with zero guarantees. - Limit KVM_SW_PROTECTED_VM guests to the TDP MMU, as the long term plan is to support confidential VMs with deterministic private memory (SNP and TDX) only in the TDP MMU. - Fix a bug in a GUEST_MEMFD negative test that resulted in false passes when verifying that KVM_MEM_GUEST_MEMFD memslots can't be dirty logged.
2024-03-07	Merge branch kvm-arm64/kerneldoc into kvmarm/next	Oliver Upton
	* kvm-arm64/kerneldoc: : kerneldoc warning fixes, courtesy of Randy Dunlap : : Fixes addressing the widespread misuse of kerneldoc-style comments : throughout KVM/arm64. KVM: arm64: vgic: fix a kernel-doc warning KVM: arm64: vgic-its: fix kernel-doc warnings KVM: arm64: vgic-init: fix a kernel-doc warning KVM: arm64: sys_regs: fix kernel-doc warnings KVM: arm64: PMU: fix kernel-doc warnings KVM: arm64: mmu: fix a kernel-doc warning KVM: arm64: vhe: fix a kernel-doc warning KVM: arm64: hyp/aarch32: fix kernel-doc warnings KVM: arm64: guest: fix kernel-doc warnings KVM: arm64: debug: fix kernel-doc warnings Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-03-07	Merge branch kvm-arm64/vfio-normal-nc into kvmarm/next	Oliver Upton
	* kvm-arm64/vfio-normal-nc: : Normal-NC support for vfio-pci @ stage-2, courtesy of Ankit Agrawal : : KVM's policy to date has been that any and all MMIO mapping at stage-2 : is treated as Device-nGnRE. This is primarily done due to concerns of : the guest triggering uncontainable failures in the system if they manage : to tickle the device / memory system the wrong way, though this is : unnecessarily restrictive for devices that can be reasoned as 'safe'. : : Unsurprisingly, the Device-* mapping can really hurt the performance of : assigned devices that can handle Gathering, and can be an outright : correctness issue if the guest driver does unaligned accesses. : : Rather than opening the floodgates to the full ecosystem of devices that : can be exposed to VMs, take the conservative approach and allow PCI : devices to be mapped as Normal-NC since it has been determined to be : 'safe'. vfio: Convey kvm that the vfio-pci device is wc safe KVM: arm64: Set io memory s2 pte as normalnc for vfio pci device mm: Introduce new flag to indicate wc safe KVM: arm64: Introduce new flag for non-cacheable IO memory Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-03-07	Merge branch kvm-arm64/lpi-xarray into kvmarm/next	Oliver Upton
	* kvm-arm64/lpi-xarray: : xarray-based representation of vgic LPIs : : KVM's linked-list of LPI state has proven to be a bottleneck in LPI : injection paths, due to lock serialization when acquiring / releasing a : reference on an IRQ. : : Start the tedious process of reworking KVM's LPI injection by replacing : the LPI linked-list with an xarray, leveraging this to allow RCU readers : to walk it outside of the spinlock. KVM: arm64: vgic: Don't acquire the lpi_list_lock in vgic_put_irq() KVM: arm64: vgic: Ensure the irq refcount is nonzero when taking a ref KVM: arm64: vgic: Rely on RCU protection in vgic_get_lpi() KVM: arm64: vgic: Free LPI vgic_irq structs in an RCU-safe manner KVM: arm64: vgic: Use atomics to count LPIs KVM: arm64: vgic: Get rid of the LPI linked-list KVM: arm64: vgic-its: Walk the LPI xarray in vgic_copy_lpi_list() KVM: arm64: vgic-v3: Iterate the xarray to find pending LPIs KVM: arm64: vgic: Use xarray to find LPI in vgic_get_lpi() KVM: arm64: vgic: Store LPIs in an xarray Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-03-07	Merge branch kvm-arm64/vm-configuration into kvmarm/next	Oliver Upton
	* kvm-arm64/vm-configuration: (29 commits) : VM configuration enforcement, courtesy of Marc Zyngier : : Userspace has gained the ability to control the features visible : through the ID registers, yet KVM didn't take this into account as the : effective feature set when determing trap / emulation behavior. This : series adds: : : - Mechanism for testing the presence of a particular CPU feature in the : guest's ID registers : : - Infrastructure for computing the effective value of VNCR-backed : registers, taking into account the RES0 / RES1 bits for a particular : VM configuration : : - Implementation of 'fine-grained UNDEF' controls that shadow the FGT : register definitions. KVM: arm64: Don't initialize idreg debugfs w/ preemption disabled KVM: arm64: Fail the idreg iterator if idregs aren't initialized KVM: arm64: Make build-time check of RES0/RES1 bits optional KVM: arm64: Add debugfs file for guest's ID registers KVM: arm64: Snapshot all non-zero RES0/RES1 sysreg fields for later checking KVM: arm64: Make FEAT_MOPS UNDEF if not advertised to the guest KVM: arm64: Make AMU sysreg UNDEF if FEAT_AMU is not advertised to the guest KVM: arm64: Make PIR{,E0}_EL1 UNDEF if S1PIE is not advertised to the guest KVM: arm64: Make TLBI OS/Range UNDEF if not advertised to the guest KVM: arm64: Streamline save/restore of HFG[RW]TR_EL2 KVM: arm64: Move existing feature disabling over to FGU infrastructure KVM: arm64: Propagate and handle Fine-Grained UNDEF bits KVM: arm64: Add Fine-Grained UNDEF tracking information KVM: arm64: Rename __check_nv_sr_forward() to triage_sysreg_trap() KVM: arm64: Use the xarray as the primary sysreg/sysinsn walker KVM: arm64: Register AArch64 system register entries with the sysreg xarray KVM: arm64: Always populate the trap configuration xarray KVM: arm64: nv: Move system instructions to their own sys_reg_desc array KVM: arm64: Drop the requirement for XARRAY_MULTI KVM: arm64: nv: Turn encoding ranges into discrete XArray stores ... Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-03-07	Merge branch kvm-arm64/misc into kvmarm/next	Oliver Upton
	* kvm-arm64/misc: : Miscellaneous updates : : - Fix handling of features w/ nonzero safe values in set_id_regs : selftest : : - Cleanup the unused kern_hyp_va() asm macro : : - Differentiate nVHE and hVHE in boot-time message : : - Several selftests cleanups : : - Drop bogus return value from kvm_arch_create_vm_debugfs() : : - Make save/restore of SPE and TRBE control registers affect EL1 state : in hVHE mode : : - Typos KVM: arm64: Fix TRFCR_EL1/PMSCR_EL1 access in hVHE mode KVM: selftests: aarch64: Remove unused functions from vpmu test KVM: arm64: Fix typos KVM: Get rid of return value from kvm_arch_create_vm_debugfs() KVM: selftests: Print timer ctl register in ISTATUS assertion KVM: selftests: Fix GUEST_PRINTF() format warnings in ARM code KVM: arm64: removed unused kern_hyp_va asm macro KVM: arm64: add comments to __kern_hyp_va KVM: arm64: print Hyp mode KVM: arm64: selftests: Handle feature fields with nonzero minimum value correctly Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-03-07	Merge branch kvm-arm64/feat_e2h0 into kvmarm/next	Oliver Upton
	* kvm-arm64/feat_e2h0: : Support for FEAT_E2H0, courtesy of Marc Zyngier : : As described in the cover letter: : : Since ARMv8.1, the architecture has grown the VHE feature, which makes : EL2 a superset of EL1. With ARMv9.5 (and retroactively allowed from : ARMv8.1), the architecture allows implementations to have VHE as the : only implemented behaviour, meaning that HCR_EL2.E2H can be : implemented as RES1. As a follow-up, HCR_EL2.NV1 can also be : implemented as RES0, making the VHE-ness of the architecture : recursive. : : This series adds support for detecting the architectural feature of E2H : being RES1, leveraging the existing infrastructure for handling : out-of-spec CPUs that are VHE-only. Additionally, the (incomplete) NV : infrastructure in KVM is updated to enforce E2H=1 for guest hypervisors : on implementations that do not support NV1. arm64: cpufeatures: Fix FEAT_NV check when checking for FEAT_NV1 arm64: cpufeatures: Only check for NV1 if NV is present arm64: cpufeatures: Add missing ID_AA64MMFR4_EL1 to __read_sysreg_by_encoding() KVM: arm64: Handle Apple M2 as not having HCR_EL2.NV1 implemented KVM: arm64: Force guest's HCR_EL2.E2H RES1 when NV1 is not implemented KVM: arm64: Expose ID_AA64MMFR4_EL1 to guests arm64: Treat HCR_EL2.E2H as RES1 when ID_AA64MMFR4_EL1.E2H0 is negative arm64: cpufeature: Detect HCR_EL2.NV1 being RES0 arm64: cpufeature: Add ID_AA64MMFR4_EL1 handling arm64: sysreg: Add layout for ID_AA64MMFR4_EL1 arm64: cpufeature: Correctly display signed override values arm64: cpufeatures: Correctly handle signed values arm64: Add macro to compose a sysreg field value Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-03-06	KVM: riscv: selftests: Add Zacas extension to get-reg-list test	Anup Patel
	The KVM RISC-V allows Zacas extension for Guest/VM so add this extension to get-reg-list test. Signed-off-by: Anup Patel <apatel@ventanamicro.com> Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Signed-off-by: Anup Patel <anup@brainfault.org>
2024-03-06	RISC-V: KVM: Allow Zacas extension for Guest/VM	Anup Patel
	Extend the KVM ISA extension ONE_REG interface to allow KVM user space to detect and enable Zacas extension for Guest/VM. Signed-off-by: Anup Patel <apatel@ventanamicro.com> Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Signed-off-by: Anup Patel <anup@brainfault.org>
2024-03-06	KVM: riscv: selftests: Add Ztso extension to get-reg-list test	Anup Patel
	The KVM RISC-V allows Ztso extension for Guest/VM so add this extension to get-reg-list test. Signed-off-by: Anup Patel <apatel@ventanamicro.com> Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Signed-off-by: Anup Patel <anup@brainfault.org>
2024-03-06	RISC-V: KVM: Allow Ztso extension for Guest/VM	Anup Patel
	Extend the KVM ISA extension ONE_REG interface to allow KVM user space to detect and enable Ztso extension for Guest/VM. Signed-off-by: Anup Patel <apatel@ventanamicro.com> Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Signed-off-by: Anup Patel <anup@brainfault.org>
2024-03-06	RISC-V: KVM: Forward SEED CSR access to user space	Anup Patel
	The SEED CSR access from VS/VU mode (guest) will always trap to HS-mode (KVM) when Zkr extension is available to the Guest/VM. Forward this CSR access to KVM user space so that it can be emulated based on the method chosen by VMM. Fixes: f370b4e668f0 ("RISC-V: KVM: Allow scalar crypto extensions for Guest/VM") Signed-off-by: Anup Patel <apatel@ventanamicro.com> Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Signed-off-by: Anup Patel <anup@brainfault.org>
2024-03-06	KVM: riscv: selftests: Add sstc timer test	Haibo Xu
	Add a KVM selftests to validate the Sstc timer functionality. The test was ported from arm64 arch timer test. Signed-off-by: Haibo Xu <haibo1.xu@intel.com> Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Signed-off-by: Anup Patel <anup@brainfault.org>
2024-03-06	KVM: riscv: selftests: Change vcpu_has_ext to a common function	Haibo Xu
	Move vcpu_has_ext to the processor.c and rename it to __vcpu_has_ext so that other test cases can use it for vCPU extension check. Signed-off-by: Haibo Xu <haibo1.xu@intel.com> Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Signed-off-by: Anup Patel <anup@brainfault.org>
2024-03-06	KVM: riscv: selftests: Add guest helper to get vcpu id	Haibo Xu
	Add guest_get_vcpuid() helper to simplify accessing to per-cpu private data. The sscratch CSR was used to store the vcpu id. Signed-off-by: Haibo Xu <haibo1.xu@intel.com> Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Signed-off-by: Anup Patel <anup@brainfault.org>
2024-03-06	KVM: riscv: selftests: Add exception handling support	Haibo Xu
	Add the infrastructure for guest exception handling in riscv selftests. Customized handlers can be enabled by vm_install_exception_handler(vector) or vm_install_interrupt_handler(). The code is inspired from that of x86/arm64. Signed-off-by: Haibo Xu <haibo1.xu@intel.com> Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Signed-off-by: Anup Patel <anup@brainfault.org>
2024-03-06	LoongArch: KVM: Remove unnecessary CSR register saving during enter guest	Bibo Mao
	Some CSR registers like CRMD/PRMD are saved during enter VM mode now. However they are not restored for actual use, so saving for these CSR registers can be removed. Reviewed-by: Tianrui Zhao <zhaotianrui@loongson.cn> Signed-off-by: Bibo Mao <maobibo@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2024-03-06	LoongArch: KVM: Do not restart SW timer when it is expired	Bibo Mao
	LoongArch VCPUs have their own separate HW timers. SW timer is to wake up blocked vcpu thread, rather than HW timer emulation. When blocking vcpu scheduled out, SW timer is used to wakeup blocked vcpu thread and injects timer interrupt. It does not care about whether guest timer is in period mode or oneshot mode, and SW timer needs not to be restarted since vcpu has been woken. This patch does not restart SW timer when it is expired. Signed-off-by: Bibo Mao <maobibo@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2024-03-06	LoongArch: KVM: Start SW timer only when vcpu is blocking	Bibo Mao
	SW timer is enabled when vcpu thread is scheduled out, and it is to wake up vcpu from blocked queue. If vcpu thread is scheduled out but is not blocked, such as it is preempted by other threads, it is not necessary to enable SW timer. Since vcpu thread is still on running queue if it is preempted and SW timer is only to wake up vcpu on blocking queue, so SW timer is not useful in this situation. This patch enables SW timer only when vcpu is scheduled out and is blocking. Signed-off-by: Bibo Mao <maobibo@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2024-03-06	LoongArch: KVM: Set reserved bits as zero in CPUCFG	Bibo Mao
	Supported CPUCFG information comes from function _kvm_get_cpucfg_mask(). A bit should be zero if it is reserved by HW or if it is not supported by KVM. Also LoongArch software page table walk feature defined in CPUCFG2_LSPW is supported by KVM, it should be enabled by default. Signed-off-by: Bibo Mao <maobibo@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2024-03-05	KVM: selftests: Explicitly close guest_memfd files in some gmem tests	Dongli Zhang
	Explicitly close() guest_memfd files in various guest_memfd and private_mem_conversions tests, there's no reason to keep the files open until the test exits. Fixes: 8a89efd43423 ("KVM: selftests: Add basic selftest for guest_memfd()") Fixes: 43f623f350ce ("KVM: selftests: Add x86-only selftest for private memory conversions") Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> Link: https://lore.kernel.org/r/20240227015716.27284-1-dongli.zhang@oracle.com [sean: massage changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-03-04	KVM: x86/xen: fix recursive deadlock in timer injection	David Woodhouse
	The fast-path timer delivery introduced a recursive locking deadlock when userspace configures a timer which has already expired and is delivered immediately. The call to kvm_xen_inject_timer_irqs() can call to kvm_xen_set_evtchn() which may take kvm->arch.xen.xen_lock, which is already held in kvm_xen_vcpu_get_attr(). ============================================ WARNING: possible recursive locking detected 6.8.0-smp--5e10b4d51d77-drs #232 Tainted: G O -------------------------------------------- xen_shinfo_test/250013 is trying to acquire lock: ffff938c9930cc30 (&kvm->arch.xen.xen_lock){+.+.}-{3:3}, at: kvm_xen_set_evtchn+0x74/0x170 [kvm] but task is already holding lock: ffff938c9930cc30 (&kvm->arch.xen.xen_lock){+.+.}-{3:3}, at: kvm_xen_vcpu_get_attr+0x38/0x250 [kvm] Now that the gfn_to_pfn_cache has its own self-sufficient locking, its callers no longer need to ensure serialization, so just stop taking kvm->arch.xen.xen_lock from kvm_xen_set_evtchn(). Fixes: 77c9b9dea4fb ("KVM: x86/xen: Use fast path for Xen timer delivery") Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Paul Durrant <paul@xen.org> Link: https://lore.kernel.org/r/20240227115648.3104-6-dwmw2@infradead.org Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-03-04	KVM: pfncache: simplify locking and make more self-contained	David Woodhouse
	The locking on the gfn_to_pfn_cache is... interesting. And awful. There is a rwlock in ->lock which readers take to ensure protection against concurrent changes. But __kvm_gpc_refresh() makes assumptions that certain fields will not change even while it drops the write lock and performs MM operations to revalidate the target PFN and kernel mapping. Commit 93984f19e7bc ("KVM: Fully serialize gfn=>pfn cache refresh via mutex") partly addressed that — not by fixing it, but by adding a new mutex, ->refresh_lock. This prevented concurrent __kvm_gpc_refresh() calls on a given gfn_to_pfn_cache, but is still only a partial solution. There is still a theoretical race where __kvm_gpc_refresh() runs in parallel with kvm_gpc_deactivate(). While __kvm_gpc_refresh() has dropped the write lock, kvm_gpc_deactivate() clears the ->active flag and unmaps ->khva. Then __kvm_gpc_refresh() determines that the previous ->pfn and ->khva are still valid, and reinstalls those values into the structure. This leaves the gfn_to_pfn_cache with the ->valid bit set, but ->active clear. And a ->khva which looks like a reasonable kernel address but is actually unmapped. All it takes is a subsequent reactivation to cause that ->khva to be dereferenced. This would theoretically cause an oops which would look something like this: [1724749.564994] BUG: unable to handle page fault for address: ffffaa3540ace0e0 [1724749.565039] RIP: 0010:__kvm_xen_has_interrupt+0x8b/0xb0 I say "theoretically" because theoretically, that oops that was seen in production cannot happen. The code which uses the gfn_to_pfn_cache is supposed to have its own locking, to further paper over the fact that the gfn_to_pfn_cache's own papering-over (->refresh_lock) of its own rwlock abuse is not sufficient. For the Xen vcpu_info that external lock is the vcpu->mutex, and for the shared info it's kvm->arch.xen.xen_lock. Those locks ought to protect the gfn_to_pfn_cache against concurrent deactivation vs. refresh in all but the cases where the vcpu or kvm object is being destroyed, in which case the subsequent reactivation should never happen. Theoretically. Nevertheless, this locking abuse is awful and should be fixed, even if no clear explanation can be found for how the oops happened. So expand the use of the ->refresh_lock mutex to ensure serialization of activate/deactivate vs. refresh and make the pfncache locking entirely self-sufficient. This means that a future commit can simplify the locking in the callers, such as the Xen emulation code which has an outstanding problem with recursive locking of kvm->arch.xen.xen_lock, which will no longer be necessary. The rwlock abuse described above is still not best practice, although it's harmless now that the ->refresh_lock is held for the entire duration while the offending code drops the write lock, does some other stuff, then takes the write lock again and assumes nothing changed. That can also be fixed^W cleaned up in a subsequent commit, but this commit is a simpler basis for the Xen deadlock fix mentioned above. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Paul Durrant <paul@xen.org> Link: https://lore.kernel.org/r/20240227115648.3104-5-dwmw2@infradead.org [sean: use guard(mutex) to fix a missed unlock] Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-03-04	KVM: x86/xen: remove WARN_ON_ONCE() with false positives in evtchn delivery	David Woodhouse
	The kvm_xen_inject_vcpu_vector() function has a comment saying "the fast version will always work for physical unicast", justifying its use of kvm_irq_delivery_to_apic_fast() and the WARN_ON_ONCE() when that fails. In fact that assumption isn't true if X2APIC isn't in use by the guest and there is (8-bit x)APIC ID aliasing. A single "unicast" destination APIC ID may then be delivered to multiple vCPUs. Remove the warning, and in fact it might as well just call kvm_irq_delivery_to_apic(). Reported-by: Michal Luczaj <mhal@rbox.co> Fixes: fde0451be8fb3 ("KVM: x86/xen: Support per-vCPU event channel upcall via local APIC") Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Paul Durrant <paul@xen.org> Link: https://lore.kernel.org/r/20240227115648.3104-4-dwmw2@infradead.org Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-03-04	KVM: x86/xen: inject vCPU upcall vector when local APIC is enabled	David Woodhouse
	Linux guests since commit b1c3497e604d ("x86/xen: Add support for HVMOP_set_evtchn_upcall_vector") in v6.0 onwards will use the per-vCPU upcall vector when it's advertised in the Xen CPUID leaves. This upcall is injected through the guest's local APIC as an MSI, unlike the older system vector which was merely injected by the hypervisor any time the CPU was able to receive an interrupt and the upcall_pending flags is set in its vcpu_info. Effectively, that makes the per-CPU upcall edge triggered instead of level triggered, which results in the upcall being lost if the MSI is delivered when the local APIC is disabled. Xen checks the vcpu_info->evtchn_upcall_pending flag when the local APIC for a vCPU is software enabled (in fact, on any write to the SPIV register which doesn't disable the APIC). Do the same in KVM since KVM doesn't provide a way for userspace to intervene and trap accesses to the SPIV register of a local APIC emulated by KVM. Fixes: fde0451be8fb3 ("KVM: x86/xen: Support per-vCPU event channel upcall via local APIC") Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Paul Durrant <paul@xen.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240227115648.3104-3-dwmw2@infradead.org Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-03-04	KVM: x86/xen: improve accuracy of Xen timers	David Woodhouse
	A test program such as http://david.woodhou.se/timerlat.c confirms user reports that timers are increasingly inaccurate as the lifetime of a guest increases. Reporting the actual delay observed when asking for 100µs of sleep, it starts off OK on a newly-launched guest but gets worse over time, giving incorrect sleep times: root@ip-10-0-193-21:~# ./timerlat -c -n 5 00000000 latency 103243/100000 (3.2430%) 00000001 latency 103243/100000 (3.2430%) 00000002 latency 103242/100000 (3.2420%) 00000003 latency 103245/100000 (3.2450%) 00000004 latency 103245/100000 (3.2450%) The biggest problem is that get_kvmclock_ns() returns inaccurate values when the guest TSC is scaled. The guest sees a TSC value scaled from the host TSC by a mul/shift conversion (hopefully done in hardware). The guest then converts that guest TSC value into nanoseconds using the mul/shift conversion given to it by the KVM pvclock information. But get_kvmclock_ns() performs only a single conversion directly from host TSC to nanoseconds, giving a different result. A test program at http://david.woodhou.se/tsdrift.c demonstrates the cumulative error over a day. It's non-trivial to fix get_kvmclock_ns(), although I'll come back to that. The actual guest hv_clock is per-CPU, and theoretically each vCPU could be running at a different frequency. But this patch is needed anyway because... The other issue with Xen timers was that the code would snapshot the host CLOCK_MONOTONIC at some point in time, and then... after a few interrupts may have occurred, some preemption perhaps... would also read the guest's kvmclock. Then it would proceed under the false assumption that those two happened at the same time. Any time which actually elapsed between reading the two clocks was introduced as inaccuracies in the time at which the timer fired. Fix it to use a variant of kvm_get_time_and_clockread(), which reads the host TSC just once, then use the returned TSC value to calculate the kvmclock (making sure to do that the way the guest would instead of making the same mistake get_kvmclock_ns() does). Sadly, hrtimers based on CLOCK_MONOTONIC_RAW are not supported, so Xen timers still have to use CLOCK_MONOTONIC. In practice the difference between the two won't matter over the timescales involved, as the absolute values don't matter; just the delta. This does mean a new variant of kvm_get_time_and_clockread() is needed; called kvm_get_monotonic_and_clockread() because that's what it does. Fixes: 536395260582 ("KVM: x86/xen: handle PV timers oneshot mode") Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Paul Durrant <paul@xen.org> Link: https://lore.kernel.org/r/20240227115648.3104-2-dwmw2@infradead.org [sean: massage moved comment, tweak if statement formatting] Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-03-03	Linux 6.8-rc7v6.8-rc7	Linus Torvalds

2024-03-03	Merge tag 'phy-fixes2-6.8' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/phy/linux-phy Pull phy fixes from Vinod Koul: - qcom: m31 pointer err fix, eusb2 fix redundant zero-out loop and v3 offset fix on qmp-usb - freescale: fix for dphy alias * tag 'phy-fixes2-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/phy/linux-phy: phy: qcom-qmp-usb: fix v3 offsets data phy: qualcomm: eusb2-repeater: Rework init to drop redundant zero-out loop phy: qcom: phy-qcom-m31: fix wrong pointer pass to PTR_ERR() phy: freescale: phy-fsl-imx8-mipi-dphy: Fix alias name to use dashes
2024-03-03	Merge tag 'dmaengine-fix2-6.8' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/dmaengine Pull dmaengine fixes from Vinod Koul: - dw-edma fixes to improve driver and remote HDMA setup - fsl-edma fixes for SoC hange, irq init and byte calculations and sparse fixes - idxd: safe user copy of completion record fix - ptdma: consistent DMA mask fix * tag 'dmaengine-fix2-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/dmaengine: dmaengine: ptdma: use consistent DMA masks dmaengine: fsl-qdma: add __iomem and struct in union to fix sparse warning dmaengine: idxd: Ensure safe user copy of completion record dmaengine: fsl-edma: correct max_segment_size setting dmaengine: idxd: Remove shadow Event Log head stored in idxd dmaengine: fsl-edma: correct calculation of 'nbytes' in multi-fifo scenario dmaengine: fsl-qdma: init irq after reg initialization dmaengine: fsl-qdma: fix SoC may hang on 16 byte unaligned read dmaengine: dw-edma: eDMA: Add sync read before starting the DMA transfer in remote setup dmaengine: dw-edma: HDMA: Add sync read before starting the DMA transfer in remote setup dmaengine: dw-edma: Add HDMA remote interrupt configuration dmaengine: dw-edma: HDMA_V0_REMOTEL_STOP_INT_EN typo fix dmaengine: dw-edma: Fix wrong interrupt bit set for HDMA dmaengine: dw-edma: Fix the ch_count hdma callback
2024-03-03	Merge tag 'powerpc-6.8-5' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc fixes from Michael Ellerman: - Fix IOMMU table initialisation when doing kdump over SR-IOV - Fix incorrect RTAS function name for resetting TCE tables - Fix fpu_signal selftest failures since a recent change Thanks to Gaurav Batra and Nathan Lynch. * tag 'powerpc-6.8-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: selftests/powerpc: Fix fpu_signal failures powerpc/rtas: use correct function name for resetting TCE tables powerpc/pseries/iommu: IOMMU table is not initialized for kdump over SR-IOV
2024-03-03	Merge tag 'x86_urgent_for_v6.8_rc7' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Borislav Petkov: - Do not reserve SETUP_RNG_SEED setup data in the e820 map as it should be used by kexec only - Make sure MKTME feature detection happens at an earlier time in the boot process so that the physical address size supported by the CPU is properly corrected and MTRR masks are programmed properly, leading to TDX systems booting without disable_mtrr_cleanup on the cmdline - Make sure the different address sizes supported by the CPU are read out as early as possible * tag 'x86_urgent_for_v6.8_rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/e820: Don't reserve SETUP_RNG_SEED in e820 x86/cpu/intel: Detect TME keyid bits before setting MTRR mask registers x86/cpu: Allow reducing x86_phys_bits during early_identify_cpu()
2024-03-02	Merge tag 'firewire-fixes-6.8-rc7' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394 Pull firewire fixes from Takashi Sakamoto: "A workaround to suppress the continuous bus resets in the case that older devices are connected to the modern 1394 OHCI hardware and devices In IEEE 1394 Amendment (IEEE 1394a-2000), the short bus reset is added to resolve the shortcomings of the long bus reset in IEEE 1394-1995. However, it is well-known that the solution is not necessarily effective in the mixing environment that both IEEE 1394-1995 PHY and IEEE 1394a-2000 (or later) PHY exist, as described in section 8.4.6.2 of IEEE 1394a-2000. The current implementation of firewire stack schedules the short bus reset when attempting to resolve the mismatch of gap count in the certain generation of bus topology. It can cause the continuous bus reset in the issued environment. The workaround simply uses the long bus reset instead of the short bus reset. It is desirable to detect whether the issued environment or not. However, the way to access PHY registers from remote note is firstly defined in IEEE 1394a-2000, thus it is not available in the case" * tag 'firewire-fixes-6.8-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394: firewire: core: use long bus reset on gap count error
2024-03-02	Merge tag 'xfs-6.8-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux	Linus Torvalds
	Pull xfs fix from Chandan Babu: "Drop experimental warning message when mounting an xfs filesystem on an fsdax device. We now consider xfs on fsdax to be stable" * tag 'xfs-6.8-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: drop experimental warning for FSDAX
2024-03-02	Merge tag 'gpio-fixes-for-v6.8-rc7' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux Pull gpio fixes from Bartosz Golaszewski: - fix resource freeing ordering in error path when adding a GPIO chip - only set pins to output after the reset is complete in gpio-74x164 * tag 'gpio-fixes-for-v6.8-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux: gpio: fix resource unwinding order in error path gpiolib: Fix the error path order in gpiochip_add_data_with_key() gpio: 74x164: Enable output pins after registers are reset
2024-03-02	block: define bvec_iter as __packed __aligned(4)	Ming Lei
	In commit 19416123ab3e ("block: define 'struct bvec_iter' as packed"), what we need is to save the 4byte padding, and avoid `bio` to spread on one extra cache line. It is enough to define it as '__packed __aligned(4)', as '__packed' alone means byte aligned, and can cause compiler to generate horrible code on architectures that don't support unaligned access in case that bvec_iter is embedded in other structures. Cc: Mikulas Patocka <mpatocka@redhat.com> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Fixes: 19416123ab3e ("block: define 'struct bvec_iter' as packed") Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-03-02	Merge tag 'scsi-fixes' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi Pull SCSI fixes from James Bottomley: "Two small fixes, all in drivers (the more obsolete mpt3sas and the newer mpi3mr)" * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: scsi: mpt3sas: Prevent sending diag_reset when the controller is ready scsi: mpi3mr: Reduce stack usage in mpi3mr_refresh_sas_ports()
2024-03-01	Merge tag 'for-v6.8-rc2' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/sre/linux-power-supply Pull power supply fixes from Sebastian Reichel: - Kconfig dependency fix - bq27xxx-i2c: do not free non-existing IRQ * tag 'for-v6.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/sre/linux-power-supply: power: supply: bq27xxx-i2c: Do not free non existing IRQ power: supply: mm8013: select REGMAP_I2C
2024-03-01	Merge tag 'for-linus-iommufd' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd Pull iommufd fixes from Jason Gunthorpe: "Four syzkaller found bugs: - Corruption during error unwind in iommufd_access_change_ioas() - Overlapping IDs in the test suite due to out of order destruction - Missing locking for access->ioas in the test suite - False failures in the test suite validation logic with huge pages" * tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd: iommufd/selftest: Don't check map/unmap pairing with HUGE_PAGES iommufd: Fix protection fault in iommufd_test_syz_conv_iova iommufd/selftest: Fix mock_dev_num bug iommufd: Fix iopt_access_list_id overwrite bug
2024-03-01	Merge tag 'devicetree-fixes-for-6.8-2' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux Pull devicetree fix from Rob Herring: "One fix for a bug in fw_devlink handling of OF graph. This doesn't completely fix the reported problems, but it's with users adding out of tree code" * tag 'devicetree-fixes-for-6.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux: of: property: fw_devlink: Fix stupid bug in remote-endpoint parsing