summaryrefslogtreecommitdiff
path: root/arch/arm64/kvm
AgeCommit message (Collapse)Author
2024-10-31KVM: arm64: Add kvm_has_s1poe() helperMarc Zyngier
Just like we have kvm_has_s1pie(), add its S1POE counterpart, making the code slightly more readable. Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20241023145345.1613824-31-maz@kernel.org Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-10-31KVM: arm64: Subject S1PIE/S1POE registers to HCR_EL2.{TVM,TRVM}Marc Zyngier
All the El0/EL1 S1PIE/S1POE system register are caught by the HCR_EL2 TVM and TRVM bits. Reflect this in the coarse grained trap table. Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20241023145345.1613824-30-maz@kernel.org Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-10-31KVM: arm64: Drop bogus CPTR_EL2.E0POE trap routingMarc Zyngier
It took me some time to realise it, but CPTR_EL2.E0POE does not apply to a guest, only to EL0 when InHost(). And when InHost(), CPCR_EL2 is mapped to CPACR_EL1, maning that the E0POE bit naturally takes effect without any trap. To sum it up, this trap bit is better left ignored, we will never have to hanedle it. Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20241023145345.1613824-29-maz@kernel.org Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-10-31KVM: arm64: Rely on visibility to let PIR*_ELx/TCR2_ELx UNDEFMarc Zyngier
With a visibility defined for these registers, there is no need to check again for S1PIE or TCRX being implemented as perform_access() already handles it. Reviewed-by: Mark Brown <broonie@kernel.org> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20241023145345.1613824-27-maz@kernel.org Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-10-31KVM: arm64: Hide S1PIE registers from userspace when disabled for guestsMark Brown
When the guest does not support S1PIE we should not allow any access to the system registers it adds in order to ensure that we do not create spurious issues with guest migration. Add a visibility operation for these registers. Fixes: 86f9de9db178 ("KVM: arm64: Save/restore PIE registers") Signed-off-by: Mark Brown <broonie@kernel.org> Link: https://lore.kernel.org/r/20240822-kvm-arm64-hide-pie-regs-v2-3-376624fa829c@kernel.org [maz: simplify by using __el2_visibility(), kvm_has_s1pie() throughout] Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20241023145345.1613824-26-maz@kernel.org Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-10-31KVM: arm64: Hide TCR2_EL1 from userspace when disabled for guestsMark Brown
When the guest does not support FEAT_TCR2 we should not allow any access to it in order to ensure that we do not create spurious issues with guest migration. Add a visibility operation for it. Fixes: fbff56068232 ("KVM: arm64: Save/restore TCR2_EL1") Signed-off-by: Mark Brown <broonie@kernel.org> Link: https://lore.kernel.org/r/20240822-kvm-arm64-hide-pie-regs-v2-2-376624fa829c@kernel.org [maz: simplify by using __el2_visibility(), kvm_has_tcr2() throughout] Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20241023145345.1613824-25-maz@kernel.org Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-10-31KVM: arm64: Define helper for EL2 registers with custom visibilityMark Brown
In preparation for adding more visibility filtering for EL2 registers add a helper macro like EL2_REG() which allows specification of a custom visibility operation. Signed-off-by: Mark Brown <broonie@kernel.org> Link: https://lore.kernel.org/r/20240822-kvm-arm64-hide-pie-regs-v2-1-376624fa829c@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20241023145345.1613824-24-maz@kernel.org Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-10-31KVM: arm64: Add a composite EL2 visibility helperMarc Zyngier
We are starting to have a bunch of visibility helpers checking for EL2 + something else, and we are going to add more. Simplify things somehow by introducing a helper that implement extractly that by taking a visibility helper as a parameter, and convert the existing ones to that. Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20241023145345.1613824-23-maz@kernel.org Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-10-31KVM: arm64: Implement AT S1PIE supportMarc Zyngier
It doesn't take much effort to implement S1PIE support in AT. It is only a matter of using the AArch64.S1IndirectBasePermissions() encodings for the permission, ignoring GCS which has no impact on AT, and enforce FEAT_PAN3 being enabled as this is a requirement of FEAT_S1PIE. Signed-off-by: Marc Zyngier <maz@kernel.org> Reviewed-by: Joey Gouly <joey.gouly@arm.com> Link: https://lore.kernel.org/r/20241023145345.1613824-22-maz@kernel.org Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-10-31KVM: arm64: Disable hierarchical permissions when S1PIE is enabledMarc Zyngier
S1PIE implicitly disables hierarchical permissions, as specified in R_JHSVW, by making TCR_ELx.HPDn RES1. Add a predicate for S1PIE being enabled for a given translation regime, and emulate this behaviour by forcing the hpd field to true if S1PIE is enabled for that translation regime. Signed-off-by: Marc Zyngier <maz@kernel.org> Reviewed-by: Joey Gouly <joey.gouly@arm.com> Link: https://lore.kernel.org/r/20241023145345.1613824-21-maz@kernel.org Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-10-31KVM: arm64: Split S1 permission evaluation into direct and hierarchical partsMarc Zyngier
The AArch64.S1DirectBasePermissions() pseudocode deals with both direct and hierarchical S1 permission evaluation. While this is probably convenient in the pseudocode, we would like a bit more flexibility to slot things like indirect permissions. To that effect, split the two permission check parts out of handle_at_slow() and into their own functions. The permissions are passed around as part of the walk_result structure. Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20241023145345.1613824-20-maz@kernel.org Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-10-31KVM: arm64: Add AT fast-path support for S1PIEMarc Zyngier
Emulating AT using AT instructions requires that the live state matches the translation regime the AT instruction targets. If targeting the EL1&0 translation regime and that S1PIE is supported, we also need to restore that state (covering TCR2_EL1, PIR_EL1, and PIRE0_EL1). Add the required system register switcheroo. Signed-off-by: Marc Zyngier <maz@kernel.org> Reviewed-by: Joey Gouly <joey.gouly@arm.com> Link: https://lore.kernel.org/r/20241023145345.1613824-19-maz@kernel.org Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-10-31KVM: arm64: Handle PIR{,E0}_EL2 trapsMarc Zyngier
Add the FEAT_S1PIE EL2 registers the sysreg descriptor array so that they can be handled as a trap. Access to these registers is conditional based on ID_AA64MMFR3_EL1.S1PIE being advertised. Similarly to other other changes, PIRE0_EL2 is guaranteed to trap thanks to the D22677 update to the architecture. Reviewed-by: Joey Gouly <joey.gouly@arm.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20241023145345.1613824-17-maz@kernel.org Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-10-31KVM: arm64: Add save/restore for PIR{,E0}_EL2Marc Zyngier
Like their EL1 equivalent, the EL2-specific FEAT_S1PIE registers are context-switched. This is made conditional on both FEAT_TCRX and FEAT_S1PIE being adversised. Note that this change only makes sense if read together with the issue D22677 contained in 102105_K.a_04_en. Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20241023145345.1613824-16-maz@kernel.org Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-10-31KVM: arm64: Add PIR{,E0}_EL2 to the sysreg arraysMarc Zyngier
Add the FEAT_S1PIE EL2 registers to the per-vcpu sysreg register array. Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20241023145345.1613824-15-maz@kernel.org Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-10-31KVM: arm64: Extend masking facility to arbitrary registersMarc Zyngier
We currently only use the masking (RES0/RES1) facility for VNCR registers, as they are memory-based and thus easy to sanitise. But we could apply the same thing to other registers if we: - split the sanitisation from __VNCR_START__ - apply the sanitisation when reading from a HW register This involves a new "marker" in the vcpu_sysreg enum, which defines the point at which the sanitisation applies (the VNCR registers being of course after this marker). Whle we are at it, rename kvm_vcpu_sanitise_vncr_reg() to kvm_vcpu_apply_reg_masks(), which is vaguely more explicit, and harden set_sysreg_masks() against setting masks for random registers... Signed-off-by: Marc Zyngier <maz@kernel.org> Reviewed-by: Joey Gouly <joey.gouly@arm.com> Link: https://lore.kernel.org/r/20241023145345.1613824-10-maz@kernel.org Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-10-31KVM: arm64: Add save/restore for TCR2_EL2Marc Zyngier
Like its EL1 equivalent, TCR2_EL2 gets context-switched. This is made conditional on FEAT_TCRX being adversised. Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20241023145345.1613824-14-maz@kernel.org Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-10-31KVM: arm64: Sanitise TCR2_EL2Marc Zyngier
TCR2_EL2 is a bag of control bits, all of which are only valid if certain features are present, and RES0 otherwise. Describe these constraints and register them with the masking infrastructure. Signed-off-by: Marc Zyngier <maz@kernel.org> Reviewed-by: Joey Gouly <joey.gouly@arm.com> Link: https://lore.kernel.org/r/20241023145345.1613824-13-maz@kernel.org Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-10-31KVM: arm64: nv: Save/Restore vEL2 sysregsMarc Zyngier
Whenever we need to restore the guest's system registers to the CPU, we now need to take care of the EL2 system registers as well. Most of them are accessed via traps only, but some have an immediate effect and also a guest running in VHE mode would expect them to be accessible via their EL1 encoding, which we do not trap. For vEL2 we write the virtual EL2 registers with an identical format directly into their EL1 counterpart, and translate the few registers that have a different format for the same effect on the execution when running a non-VHE guest guest hypervisor. Based on an initial patch from Andre Przywara, rewritten many times since. Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20241023145345.1613824-8-maz@kernel.org Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-10-31KVM: arm64: Add TCR2_EL2 to the sysreg arraysMarc Zyngier
Add the TCR2_EL2 register to the per-vcpu sysreg register array, the sysreg descriptor array, and advertise it as mapped to TCR2_EL1 for NV purposes. Access to this register is conditional based on ID_AA64MMFR3_EL1.TCRX being advertised. Reviewed-by: Joey Gouly <joey.gouly@arm.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20241023145345.1613824-12-maz@kernel.org Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-10-31KVM: arm64: nv: Handle CNTHCTL_EL2 speciallyMarc Zyngier
Accessing CNTHCTL_EL2 is fraught with danger if running with HCR_EL2.E2H=1: half of the bits are held in CNTKCTL_EL1, and thus can be changed behind our back, while the rest lives in the CNTHCTL_EL2 shadow copy that is memory-based. Yes, this is a lot of fun! Make sure that we merge the two on read access, while we can write to CNTKCTL_EL1 in a more straightforward manner. Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20241023145345.1613824-7-maz@kernel.org Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-10-31KVM: arm64: nv: Add missing EL2->EL1 mappings in get_el2_to_el1_mapping()Marc Zyngier
As KVM has grown a bunch of new system register for NV, it appears that we are missing them in the get_el2_to_el1_mapping() list. Most of them are not crucial as they don't tend to be accessed via vcpu_read_sys_reg() and vcpu_write_sys_reg(). Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20241023145345.1613824-6-maz@kernel.org Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-10-31KVM: arm64: Drop useless struct s2_mmu in __kvm_at_s1e2()Marc Zyngier
__kvm_at_s1e2() contains the definition of an s2_mmu for the current context, but doesn't make any use of it. Drop it. Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20241023145345.1613824-5-maz@kernel.org Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-10-26KVM: arm64: Don't retire aborted MMIO instructionOliver Upton
Returning an abort to the guest for an unsupported MMIO access is a documented feature of the KVM UAPI. Nevertheless, it's clear that this plumbing has seen limited testing, since userspace can trivially cause a WARN in the MMIO return: WARNING: CPU: 0 PID: 30558 at arch/arm64/include/asm/kvm_emulate.h:536 kvm_handle_mmio_return+0x46c/0x5c4 arch/arm64/include/asm/kvm_emulate.h:536 Call trace: kvm_handle_mmio_return+0x46c/0x5c4 arch/arm64/include/asm/kvm_emulate.h:536 kvm_arch_vcpu_ioctl_run+0x98/0x15b4 arch/arm64/kvm/arm.c:1133 kvm_vcpu_ioctl+0x75c/0xa78 virt/kvm/kvm_main.c:4487 __do_sys_ioctl fs/ioctl.c:51 [inline] __se_sys_ioctl fs/ioctl.c:893 [inline] __arm64_sys_ioctl+0x14c/0x1c8 fs/ioctl.c:893 __invoke_syscall arch/arm64/kernel/syscall.c:35 [inline] invoke_syscall+0x98/0x2b8 arch/arm64/kernel/syscall.c:49 el0_svc_common+0x1e0/0x23c arch/arm64/kernel/syscall.c:132 do_el0_svc+0x48/0x58 arch/arm64/kernel/syscall.c:151 el0_svc+0x38/0x68 arch/arm64/kernel/entry-common.c:712 el0t_64_sync_handler+0x90/0xfc arch/arm64/kernel/entry-common.c:730 el0t_64_sync+0x190/0x194 arch/arm64/kernel/entry.S:598 The splat is complaining that KVM is advancing PC while an exception is pending, i.e. that KVM is retiring the MMIO instruction despite a pending synchronous external abort. Womp womp. Fix the glaring UAPI bug by skipping over all the MMIO emulation in case there is a pending synchronous exception. Note that while userspace is capable of pending an asynchronous exception (SError, IRQ, or FIQ), it is still safe to retire the MMIO instruction in this case as (1) they are by definition asynchronous, and (2) KVM relies on hardware support for pending/delivering these exceptions instead of the software state machine for advancing PC. Cc: stable@vger.kernel.org Fixes: da345174ceca ("KVM: arm/arm64: Allow user injection of external data aborts") Reported-by: Alexander Potapenko <glider@google.com> Reviewed-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20241025203106.3529261-2-oliver.upton@linux.dev Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-10-25KVM: arm64: Don't mark "struct page" accessed when making SPTE youngSean Christopherson
Don't mark pages/folios as accessed in the primary MMU when making a SPTE young in KVM's secondary MMU, as doing so relies on kvm_pfn_to_refcounted_page(), and generally speaking is unnecessary and wasteful. KVM participates in page aging via mmu_notifiers, so there's no need to push "accessed" updates to the primary MMU. Dropping use of kvm_set_pfn_accessed() also paves the way for removing kvm_pfn_to_refcounted_page() and all its users. Tested-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-84-seanjc@google.com>
2024-10-25KVM: arm64: Use __gfn_to_page() when copying MTE tags to/from userspaceSean Christopherson
Use __gfn_to_page() instead when copying MTE tags between guest and userspace. This will eventually allow removing gfn_to_pfn_prot(), gfn_to_pfn(), kvm_pfn_to_refcounted_page(), and related APIs. Tested-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-78-seanjc@google.com>
2024-10-25KVM: arm64: Use __kvm_faultin_pfn() to handle memory abortsSean Christopherson
Convert arm64 to use __kvm_faultin_pfn()+kvm_release_faultin_page(). Three down, six to go. Tested-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-57-seanjc@google.com>
2024-10-25KVM: arm64: Mark "struct page" pfns accessed/dirty before dropping mmu_lockSean Christopherson
Mark pages/folios accessed+dirty prior to dropping mmu_lock, as marking a page/folio dirty after it has been written back can make some filesystems unhappy (backing KVM guests will such filesystem files is uncommon, and the race is minuscule, hence the lack of complaints). While scary sounding, practically speaking the worst case scenario is that KVM would trigger this WARN in filemap_unaccount_folio(): /* * At this point folio must be either written or cleaned by * truncate. Dirty folio here signals a bug and loss of * unwritten data - on ordinary filesystems. * * But it's harmless on in-memory filesystems like tmpfs; and can * occur when a driver which did get_user_pages() sets page dirty * before putting it, while the inode is being finally evicted. * * Below fixes dirty accounting after removing the folio entirely * but leaves the dirty flag set: it has no effect for truncated * folio and anyway will be cleared before returning folio to * buddy allocator. */ if (WARN_ON_ONCE(folio_test_dirty(folio) && mapping_can_writeback(mapping))) folio_account_cleaned(folio, inode_to_wb(mapping->host)); KVM won't actually write memory because the stage-2 mappings are protected by the mmu_notifier, i.e. there is no risk of loss of data, even if the VM were backed by memory that needs writeback. See the link below for additional details. This will also allow converting arm64 to kvm_release_faultin_page(), which requires that mmu_lock be held (for the aforementioned reason). Link: https://lore.kernel.org/all/cover.1683044162.git.lstoakes@gmail.com Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-56-seanjc@google.com>
2024-10-25KVM: Drop unused "hva" pointer from __gfn_to_pfn_memslot()Sean Christopherson
Drop @hva from __gfn_to_pfn_memslot() now that all callers pass NULL. No functional change intended. Tested-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-19-seanjc@google.com>
2024-10-25KVM: Drop @atomic param from gfn=>pfn and hva=>pfn APIsSean Christopherson
Drop @atomic from the myriad "to_pfn" APIs now that all callers pass "false", and remove a comment blurb about KVM running only the "GUP fast" part in atomic context. No functional change intended. Reviewed-by: Alex Bennée <alex.bennee@linaro.org> Tested-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-13-seanjc@google.com>
2024-10-24KVM: arm64: nvhe: Pass through PSCI v1.3 SYSTEM_OFF2 callDavid Woodhouse
Pass through the SYSTEM_OFF2 function for hibernation, just like SYSTEM_OFF. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Miguel Luis <miguel.luis@oracle.com> Link: https://lore.kernel.org/r/20241019172459.2241939-6-dwmw2@infradead.org Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-10-24KVM: arm64: Add support for PSCI v1.2 and v1.3David Woodhouse
As with PSCI v1.1 in commit 512865d83fd9 ("KVM: arm64: Bump guest PSCI version to 1.1"), expose v1.3 to the guest by default. The SYSTEM_OFF2 call which is exposed by doing so is compatible for userspace because it's just a new flag in the event that KVM raises, in precisely the same way that SYSTEM_RESET2 was compatible when v1.1 was enabled by default. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Miguel Luis <miguel.luis@oracle.com> Link: https://lore.kernel.org/r/20241019172459.2241939-4-dwmw2@infradead.org Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-10-24KVM: arm64: Add PSCI v1.3 SYSTEM_OFF2 function for hibernationDavid Woodhouse
The PSCI v1.3 specification adds support for a SYSTEM_OFF2 function which is analogous to ACPI S4 state. This will allow hosting environments to determine that a guest is hibernated rather than just powered off, and ensure that they preserve the virtual environment appropriately to allow the guest to resume safely (or bump the hardware_signature in the FACS to trigger a clean reboot instead). This feature is safe to enable unconditionally (in a subsequent commit) because it is exposed to userspace through the existing KVM_SYSTEM_EVENT_SHUTDOWN event, just with an additional flag which userspace can use to know that the instance intended hibernation instead of a plain power-off. As with SYSTEM_RESET2, there is only one type available (in this case HIBERNATE_OFF), and it is not explicitly reported to userspace through the event; userspace can get it from the registers if it cares). Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Miguel Luis <miguel.luis@oracle.com> Link: https://lore.kernel.org/r/20241019172459.2241939-3-dwmw2@infradead.org [oliver: slight cleanup of comments] Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-10-23KVM: arm64: Don't map 'kvm_vgic_global_state' at EL2 with pKVMWill Deacon
Now that 'kvm_vgic_global_state' is no longer needed for ICC_CTLR_EL1 emulation on machines with a broken SEIS implementation, drop the pKVM hypervisor mapping of the page. Note that kvm_vgic_global_state is still mapped in non-protected hypervisor configurations (i.e. {n,h}VHE) through the rodata section mapping. Cc: Marc Zyngier <maz@kernel.org> Cc: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Will Deacon <will@kernel.org> Reviewed-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20241022144016.27350-3-will@kernel.org Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-10-23KVM: arm64: Just advertise SEIS as 0 when emulating ICC_CTLR_EL1Will Deacon
ICC_CTLR_EL1 accesses from a guest are trapped and emulated on systems with broken SEIS support and without FEAT_GICv3_TDIR. On such systems, we mask SEIS support in 'kvm_vgic_global_state.ich_vtr_el2' and so the value of ICC_CTLR_EL1.SEIS visible to the guest is always zero. Simplify the ICC_CTLR_EL1 read emulation to return 0 for the SEIS field, rather than reading an always-zero value from the global state. Cc: Marc Zyngier <maz@kernel.org> Cc: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Will Deacon <will@kernel.org> Reviewed-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20241022144016.27350-2-will@kernel.org Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-10-17KVM: arm64: Ensure vgic_ready() is ordered against MMIO registrationOliver Upton
kvm_vgic_map_resources() prematurely marks the distributor as 'ready', potentially allowing vCPUs to enter the guest before the distributor's MMIO registration has been made visible. Plug the race by marking the distributor as ready only after MMIO registration is completed. Rely on the implied ordering of synchronize_srcu() to ensure the MMIO registration is visible before vgic_dist::ready. This also means that writers to vgic_dist::ready are now serialized by the slots_lock, which was effectively the case already as all writers held the slots_lock in addition to the config_lock. Fixes: 59112e9c390b ("KVM: arm64: vgic: Fix a circular locking issue") Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20241017001947.2707312-3-oliver.upton@linux.dev Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-10-17KVM: arm64: vgic: Don't check for vgic_ready() when setting NR_IRQSOliver Upton
KVM commits to a particular sizing of SPIs when the vgic is initialized, which is before the point a vgic becomes ready. On top of that, KVM supplies a default amount of SPIs should userspace not explicitly configure this. As such, the check for vgic_ready() in the handling of KVM_DEV_ARM_VGIC_GRP_NR_IRQS is completely wrong, and testing if nr_spis is nonzero is sufficient for preventing userspace from playing games with us. Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20241017001947.2707312-2-oliver.upton@linux.dev Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-10-17KVM: arm64: Fix shift-out-of-bounds bugIlkka Koskinen
Fix a shift-out-of-bounds bug reported by UBSAN when running VM with MTE enabled host kernel. UBSAN: shift-out-of-bounds in arch/arm64/kvm/sys_regs.c:1988:14 shift exponent 33 is too large for 32-bit type 'int' CPU: 26 UID: 0 PID: 7629 Comm: qemu-kvm Not tainted 6.12.0-rc2 #34 Hardware name: IEI NF5280R7/Mitchell MB, BIOS 00.00. 2024-10-12 09:28:54 10/14/2024 Call trace: dump_backtrace+0xa0/0x128 show_stack+0x20/0x38 dump_stack_lvl+0x74/0x90 dump_stack+0x18/0x28 __ubsan_handle_shift_out_of_bounds+0xf8/0x1e0 reset_clidr+0x10c/0x1c8 kvm_reset_sys_regs+0x50/0x1c8 kvm_reset_vcpu+0xec/0x2b0 __kvm_vcpu_set_target+0x84/0x158 kvm_vcpu_set_target+0x138/0x168 kvm_arch_vcpu_ioctl_vcpu_init+0x40/0x2b0 kvm_arch_vcpu_ioctl+0x28c/0x4b8 kvm_vcpu_ioctl+0x4bc/0x7a8 __arm64_sys_ioctl+0xb4/0x100 invoke_syscall+0x70/0x100 el0_svc_common.constprop.0+0x48/0xf0 do_el0_svc+0x24/0x38 el0_svc+0x3c/0x158 el0t_64_sync_handler+0x120/0x130 el0t_64_sync+0x194/0x198 Fixes: 7af0c2534f4c ("KVM: arm64: Normalize cache configuration") Cc: stable@vger.kernel.org Reviewed-by: Gavin Shan <gshan@redhat.com> Signed-off-by: Ilkka Koskinen <ilkka@os.amperecomputing.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Link: https://lore.kernel.org/r/20241017025701.67936-1-ilkka@os.amperecomputing.com Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-10-17KVM: arm64: Shave a few bytes from the EL2 idmap codeMarc Zyngier
Our idmap is becoming too big, to the point where it doesn't fit in a 4kB page anymore. There are some low-hanging fruits though, such as the el2_init_state horror that is expanded 3 times in the kernel. Let's at least limit ourselves to two copies, which makes the kernel link again. At some point, we'll have to have a better way of doing this. Reported-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20241009204903.GA3353168@thelio-3990X
2024-10-16hugetlb: arm64: add mte supportYang Shi
Enable MTE support for hugetlb. The MTE page flags will be set on the folio only. When copying hugetlb folio (for example, CoW), the tags for all subpages will be copied when copying the first subpage. When freeing hugetlb folio, the MTE flags will be cleared. Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Yang Shi <yang@os.amperecomputing.com> Link: https://lore.kernel.org/r/20241001225220.271178-1-yang@os.amperecomputing.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-10-11KVM: arm64: Don't eagerly teardown the vgic on init errorMarc Zyngier
As there is very little ordering in the KVM API, userspace can instanciate a half-baked GIC (missing its memory map, for example) at almost any time. This means that, with the right timing, a thread running vcpu-0 can enter the kernel without a GIC configured and get a GIC created behind its back by another thread. Amusingly, it will pick up that GIC and start messing with the data structures without the GIC having been fully initialised. Similarly, a thread running vcpu-1 can enter the kernel, and try to init the GIC that was previously created. Since this GIC isn't properly configured (no memory map), it fails to correctly initialise. And that's the point where we decide to teardown the GIC, freeing all its resources. Behind vcpu-0's back. Things stop pretty abruptly, with a variety of symptoms. Clearly, this isn't good, we should be a bit more careful about this. It is obvious that this guest is not viable, as it is missing some important part of its configuration. So instead of trying to tear bits of it down, let's just mark it as *dead*. It means that any further interaction from userspace will result in -EIO. The memory will be released on the "normal" path, when userspace gives up. Cc: stable@vger.kernel.org Reported-by: Alexander Potapenko <glider@google.com> Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20241009183603.3221824-1-maz@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-10-08KVM: arm64: Expose S1PIE to guestsMark Brown
Prior to commit 70ed7238297f ("KVM: arm64: Sanitise ID_AA64MMFR3_EL1") we just exposed the santised view of ID_AA64MMFR3_EL1 to guests, meaning that they saw both TCRX and S1PIE if present on the host machine. That commit added VMM control over the contents of the register and exposed S1POE but removed S1PIE, meaning that the extension is no longer visible to guests. Reenable support for S1PIE with VMM control. Fixes: 70ed7238297f ("KVM: arm64: Sanitise ID_AA64MMFR3_EL1") Signed-off-by: Mark Brown <broonie@kernel.org> Reviewed-by: Joey Gouly <joey.gouly@arm.com> Link: https://lore.kernel.org/r/20241005-kvm-arm64-fix-s1pie-v1-1-5901f02de749@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-10-08KVM: arm64: nv: Clarify safety of allowing TLBI unmaps to rescheduleOliver Upton
There's been a decent amount of attention around unmaps of nested MMUs, and TLBI handling is no exception to this. Add a comment clarifying why it is safe to reschedule during a TLBI unmap, even without a reference on the MMU in progress. Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20241007233028.2236133-5-oliver.upton@linux.dev Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-10-08KVM: arm64: nv: Punt stage-2 recycling to a vCPU requestOliver Upton
Currently, when a nested MMU is repurposed for some other MMU context, KVM unmaps everything during vcpu_load() while holding the MMU lock for write. This is quite a performance bottleneck for large nested VMs, as all vCPU scheduling will spin until the unmap completes. Start punting the MMU cleanup to a vCPU request, where it is then possible to periodically release the MMU lock and CPU in the presence of contention. Ensure that no vCPU winds up using a stale MMU by tracking the pending unmap on the S2 MMU itself and requesting an unmap on every vCPU that finds it. Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20241007233028.2236133-4-oliver.upton@linux.dev Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-10-08KVM: arm64: nv: Do not block when unmapping stage-2 if disallowedOliver Upton
Right now the nested code allows unmap operations on a shadow stage-2 to block unconditionally. This is wrong in a couple places, such as a non-blocking MMU notifier or on the back of a sched_in() notifier as part of shadow MMU recycling. Carry through whether or not blocking is allowed to kvm_pgtable_stage2_unmap(). This 'fixes' an issue where stage-2 MMU reclaim would precipitate a stack overflow from a pile of kvm_sched_in() callbacks, all trying to recycle a stage-2 MMU. Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20241007233028.2236133-3-oliver.upton@linux.dev Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-10-08KVM: arm64: nv: Keep reference on stage-2 MMU when scheduled outOliver Upton
If a vCPU is scheduling out and not in WFI emulation, it is highly likely it will get scheduled again soon and reuse the MMU it had before. Dropping the MMU at vcpu_put() can have some unfortunate consequences, as the MMU could get reclaimed and used in a different context, forcing another 'cold start' on an otherwise active MMU. Avoid that altogether by keeping a reference on the MMU if the vCPU is scheduling out, ensuring that another vCPU cannot reclaim it while the current vCPU is away. Since there are more MMUs than vCPUs, this does not affect the guarantee that an unused MMU is available at any time. Furthermore, this makes the vcpu->arch.hw_mmu ~stable in preemptible code, at least for where it matters in the stage-2 abort path. Yes, the MMU can change across WFI emulation, but there isn't even a use case where this would matter. Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20241007233028.2236133-2-oliver.upton@linux.dev Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-10-08KVM: arm64: Unregister redistributor for failed vCPU creationOliver Upton
Alex reports that syzkaller has managed to trigger a use-after-free when tearing down a VM: BUG: KASAN: slab-use-after-free in kvm_put_kvm+0x300/0xe68 virt/kvm/kvm_main.c:5769 Read of size 8 at addr ffffff801c6890d0 by task syz.3.2219/10758 CPU: 3 UID: 0 PID: 10758 Comm: syz.3.2219 Not tainted 6.11.0-rc6-dirty #64 Hardware name: linux,dummy-virt (DT) Call trace: dump_backtrace+0x17c/0x1a8 arch/arm64/kernel/stacktrace.c:317 show_stack+0x2c/0x3c arch/arm64/kernel/stacktrace.c:324 __dump_stack lib/dump_stack.c:93 [inline] dump_stack_lvl+0x94/0xc0 lib/dump_stack.c:119 print_report+0x144/0x7a4 mm/kasan/report.c:377 kasan_report+0xcc/0x128 mm/kasan/report.c:601 __asan_report_load8_noabort+0x20/0x2c mm/kasan/report_generic.c:381 kvm_put_kvm+0x300/0xe68 virt/kvm/kvm_main.c:5769 kvm_vm_release+0x4c/0x60 virt/kvm/kvm_main.c:1409 __fput+0x198/0x71c fs/file_table.c:422 ____fput+0x20/0x30 fs/file_table.c:450 task_work_run+0x1cc/0x23c kernel/task_work.c:228 do_notify_resume+0x144/0x1a0 include/linux/resume_user_mode.h:50 el0_svc+0x64/0x68 arch/arm64/kernel/entry-common.c:169 el0t_64_sync_handler+0x90/0xfc arch/arm64/kernel/entry-common.c:730 el0t_64_sync+0x190/0x194 arch/arm64/kernel/entry.S:598 Upon closer inspection, it appears that we do not properly tear down the MMIO registration for a vCPU that fails creation late in the game, e.g. a vCPU w/ the same ID already exists in the VM. It is important to consider the context of commit that introduced this bug by moving the unregistration out of __kvm_vgic_vcpu_destroy(). That change correctly sought to avoid an srcu v. config_lock inversion by breaking up the vCPU teardown into two parts, one guarded by the config_lock. Fix the use-after-free while avoiding lock inversion by adding a special-cased unregistration to __kvm_vgic_vcpu_destroy(). This is safe because failed vCPUs are torn down outside of the config_lock. Cc: stable@vger.kernel.org Fixes: f616506754d3 ("KVM: arm64: vgic: Don't hold config_lock while unregistering redistributors") Reported-by: Alexander Potapenko <glider@google.com> Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20241007223909.2157336-1-oliver.upton@linux.dev Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-10-08Merge branch kvm-arm64/idregs-6.12 into kvmarm/fixesMarc Zyngier
* kvm-arm64/idregs-6.12: : . : Make some fields of ID_AA64DFR0_EL1 and ID_AA64PFR1_EL1 : writable from userspace, so that a VMM can influence the : set of guest-visible features. : : - for ID_AA64DFR0_EL1: DoubleLock, WRPs, PMUVer and DebugVer : are writable (courtesy of Shameer Kolothum) : : - for ID_AA64PFR1_EL1: BT, SSBS, CVS2_frac are writable : (courtesy of Shaoqin Huang) : . KVM: selftests: aarch64: Add writable test for ID_AA64PFR1_EL1 KVM: arm64: Allow userspace to change ID_AA64PFR1_EL1 KVM: arm64: Use kvm_has_feat() to check if FEAT_SSBS is advertised to the guest KVM: arm64: Disable fields that KVM doesn't know how to handle in ID_AA64PFR1_EL1 KVM: arm64: Make the exposed feature bits in AA64DFR0_EL1 writable from userspace Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-10-01KVM: arm64: Constrain the host to the maximum shared SVE VL with pKVMMark Brown
When pKVM saves and restores the host floating point state on a SVE system, it programs the vector length in ZCR_EL2.LEN to be whatever the maximum VL for the PE is. But it uses a buffer allocated with kvm_host_sve_max_vl, the maximum VL shared by all PEs in the system. This means that if we run on a system where the maximum VLs are not consistent, we will overflow the buffer on PEs which support larger VLs. Since the host will not currently attempt to make use of non-shared VLs, fix this by explicitly setting the EL2 VL to be the maximum shared VL when we save and restore. This will enforce the limit on host VL usage. Should we wish to support asymmetric VLs, this code will need to be updated along with the required changes for the host: https://lore.kernel.org/r/20240730-kvm-arm64-fix-pkvm-sve-vl-v6-0-cae8a2e0bd66@kernel.org Fixes: b5b9955617bc ("KVM: arm64: Eagerly restore host fpsimd/sve state in pKVM") Signed-off-by: Mark Brown <broonie@kernel.org> Tested-by: Fuad Tabba <tabba@google.com> Reviewed-by: Fuad Tabba <tabba@google.com> Link: https://lore.kernel.org/r/20240912-kvm-arm64-limit-guest-vl-v2-1-dd2c29cb2ac9@kernel.org [maz: added punctuation to the commit message] Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-10-01KVM: arm64: Fix __pkvm_init_vcpu cptr_el2 error pathVincent Donnefort
On an error, hyp_vcpu will be accessed while this memory has already been relinquished to the host and unmapped from the hypervisor. Protect the CPTR assignment with an early return. Fixes: b5b9955617bc ("KVM: arm64: Eagerly restore host fpsimd/sve state in pKVM") Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Vincent Donnefort <vdonnefort@google.com> Link: https://lore.kernel.org/r/20240919110500.2345927-1-vdonnefort@google.com Signed-off-by: Marc Zyngier <maz@kernel.org>