summaryrefslogtreecommitdiff
path: root/arch
AgeCommit message (Collapse)Author
2022-05-20kvm: x86/pmu: Fix the compare function used by the pmu event filterAaron Lewis
When returning from the compare function the u64 is truncated to an int. This results in a loss of the high nybble[1] in the event select and its sign if that nybble is in use. Switch from using a result that can end up being truncated to a result that can only be: 1, 0, -1. [1] bits 35:32 in the event select register and bits 11:8 in the event select. Fixes: 7ff775aca48ad ("KVM: x86/pmu: Use binary search to check filtered events") Signed-off-by: Aaron Lewis <aaronlewis@google.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220517051238.2566934-1-aaronlewis@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-17Merge tag 'kvmarm-fixes-5.18-3' of ↵Paolo Bonzini
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 fixes for 5.18, take #3 - Correctly expose GICv3 support even if no irqchip is created so that userspace doesn't observe it changing pointlessly (fixing a regression with QEMU) - Don't issue a hypercall to set the id-mapped vectors when protected mode is enabled (fix for pKVM in combination with CPUs affected by Spectre-v3a)
2022-05-15KVM: arm64: Don't hypercall before EL2 initQuentin Perret
Will reported the following splat when running with Protected KVM enabled: [ 2.427181] ------------[ cut here ]------------ [ 2.427668] WARNING: CPU: 3 PID: 1 at arch/arm64/kvm/mmu.c:489 __create_hyp_private_mapping+0x118/0x1ac [ 2.428424] Modules linked in: [ 2.429040] CPU: 3 PID: 1 Comm: swapper/0 Not tainted 5.18.0-rc2-00084-g8635adc4efc7 #1 [ 2.429589] Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015 [ 2.430286] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 2.430734] pc : __create_hyp_private_mapping+0x118/0x1ac [ 2.431091] lr : create_hyp_exec_mappings+0x40/0x80 [ 2.431377] sp : ffff80000803baf0 [ 2.431597] x29: ffff80000803bb00 x28: 0000000000000000 x27: 0000000000000000 [ 2.432156] x26: 0000000000000000 x25: 0000000000000000 x24: 0000000000000000 [ 2.432561] x23: ffffcd96c343b000 x22: 0000000000000000 x21: ffff80000803bb40 [ 2.433004] x20: 0000000000000004 x19: 0000000000001800 x18: 0000000000000000 [ 2.433343] x17: 0003e68cf7efdd70 x16: 0000000000000004 x15: fffffc81f602a2c8 [ 2.434053] x14: ffffdf8380000000 x13: ffffcd9573200000 x12: ffffcd96c343b000 [ 2.434401] x11: 0000000000000004 x10: ffffcd96c1738000 x9 : 0000000000000004 [ 2.434812] x8 : ffff80000803bb40 x7 : 7f7f7f7f7f7f7f7f x6 : 544f422effff306b [ 2.435136] x5 : 000000008020001e x4 : ffff207d80a88c00 x3 : 0000000000000005 [ 2.435480] x2 : 0000000000001800 x1 : 000000014f4ab800 x0 : 000000000badca11 [ 2.436149] Call trace: [ 2.436600] __create_hyp_private_mapping+0x118/0x1ac [ 2.437576] create_hyp_exec_mappings+0x40/0x80 [ 2.438180] kvm_init_vector_slots+0x180/0x194 [ 2.458941] kvm_arch_init+0x80/0x274 [ 2.459220] kvm_init+0x48/0x354 [ 2.459416] arm_init+0x20/0x2c [ 2.459601] do_one_initcall+0xbc/0x238 [ 2.459809] do_initcall_level+0x94/0xb4 [ 2.460043] do_initcalls+0x54/0x94 [ 2.460228] do_basic_setup+0x1c/0x28 [ 2.460407] kernel_init_freeable+0x110/0x178 [ 2.460610] kernel_init+0x20/0x1a0 [ 2.460817] ret_from_fork+0x10/0x20 [ 2.461274] ---[ end trace 0000000000000000 ]--- Indeed, the Protected KVM mode promotes __create_hyp_private_mapping() to a hypercall as EL1 no longer has access to the hypervisor's stage-1 page-table. However, the call from kvm_init_vector_slots() happens after pKVM has been initialized on the primary CPU, but before it has been initialized on secondaries. As such, if the KVM initcall procedure is migrated from one CPU to another in this window, the hypercall may end up running on a CPU for which EL2 has not been initialized. Fortunately, the pKVM hypervisor doesn't rely on the host to re-map the vectors in the private range, so the hypercall in question is in fact superfluous. Skip it when pKVM is enabled. Reported-by: Will Deacon <will@kernel.org> Signed-off-by: Quentin Perret <qperret@google.com> [maz: simplified the checks slightly] Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20220513092607.35233-1-qperret@google.com
2022-05-15KVM: arm64: vgic-v3: Consistently populate ID_AA64PFR0_EL1.GICMarc Zyngier
When adding support for the slightly wonky Apple M1, we had to populate ID_AA64PFR0_EL1.GIC==1 to present something to the guest, as the HW itself doesn't advertise the feature. However, we gated this on the in-kernel irqchip being created. This causes some trouble for QEMU, which snapshots the state of the registers before creating a virtual GIC, and then tries to restore these registers once the GIC has been created. Obviously, between the two stages, ID_AA64PFR0_EL1.GIC has changed value, and the write fails. The fix is to actually emulate the HW, and always populate the field if the HW is capable of it. Fixes: 562e530fd770 ("KVM: arm64: Force ID_AA64PFR0_EL1.GIC=1 when exposing a virtual GICv3") Cc: stable@vger.kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org> Reported-by: Peter Maydell <peter.maydell@linaro.org> Reviewed-by: Oliver Upton <oupton@google.com> Link: https://lore.kernel.org/r/20220503211424.3375263-1-maz@kernel.org
2022-05-12KVM: x86/mmu: Update number of zapped pages even if page list is stableSean Christopherson
When zapping obsolete pages, update the running count of zapped pages regardless of whether or not the list has become unstable due to zapping a shadow page with its own child shadow pages. If the VM is backed by mostly 4kb pages, KVM can zap an absurd number of SPTEs without bumping the batch count and thus without yielding. In the worst case scenario, this can cause a soft lokcup. watchdog: BUG: soft lockup - CPU#12 stuck for 22s! [dirty_log_perf_:13020] RIP: 0010:workingset_activation+0x19/0x130 mark_page_accessed+0x266/0x2e0 kvm_set_pfn_accessed+0x31/0x40 mmu_spte_clear_track_bits+0x136/0x1c0 drop_spte+0x1a/0xc0 mmu_page_zap_pte+0xef/0x120 __kvm_mmu_prepare_zap_page+0x205/0x5e0 kvm_mmu_zap_all_fast+0xd7/0x190 kvm_mmu_invalidate_zap_pages_in_memslot+0xe/0x10 kvm_page_track_flush_slot+0x5c/0x80 kvm_arch_flush_shadow_memslot+0xe/0x10 kvm_set_memslot+0x1a8/0x5d0 __kvm_set_memory_region+0x337/0x590 kvm_vm_ioctl+0xb08/0x1040 Fixes: fbb158cb88b6 ("KVM: x86/mmu: Revert "Revert "KVM: MMU: zap pages in batch""") Reported-by: David Matlack <dmatlack@google.com> Reviewed-by: Ben Gardon <bgardon@google.com> Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220511145122.3133334-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-06KVM: VMX: Exit to userspace if vCPU has injected exception and invalid stateSean Christopherson
Exit to userspace with an emulation error if KVM encounters an injected exception with invalid guest state, in addition to the existing check of bailing if there's a pending exception (KVM doesn't support emulating exceptions except when emulating real mode via vm86). In theory, KVM should never get to such a situation as KVM is supposed to exit to userspace before injecting an exception with invalid guest state. But in practice, userspace can intervene and manually inject an exception and/or stuff registers to force invalid guest state while a previously injected exception is awaiting reinjection. Fixes: fc4fad79fc3d ("KVM: VMX: Reject KVM_RUN if emulation is required with pending exception") Reported-by: syzbot+cfafed3bb76d3e37581b@syzkaller.appspotmail.com Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220502221850.131873-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-06KVM: SEV: Mark nested locking of vcpu->lockPeter Gonda
svm_vm_migrate_from() uses sev_lock_vcpus_for_migration() to lock all source and target vcpu->locks. Unfortunately there is an 8 subclass limit, so a new subclass cannot be used for each vCPU. Instead maintain ownership of the first vcpu's mutex.dep_map using a role specific subclass: source vs target. Release the other vcpu's mutex.dep_maps. Fixes: b56639318bb2b ("KVM: SEV: Add support for SEV intra host migration") Reported-by: John Sperbeck<jsperbeck@google.com> Suggested-by: David Rientjes <rientjes@google.com> Suggested-by: Sean Christopherson <seanjc@google.com> Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Cc: Hillf Danton <hdanton@sina.com> Cc: kvm@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Peter Gonda <pgonda@google.com> Message-Id: <20220502165807.529624-1-pgonda@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-03Merge branch 'kvm-amd-pmu-fixes' into HEADPaolo Bonzini
2022-05-03kvm: x86/cpuid: Only provide CPUID leaf 0xA if host has architectural PMUSandipan Das
On some x86 processors, CPUID leaf 0xA provides information on Architectural Performance Monitoring features. It advertises a PMU version which Qemu uses to determine the availability of additional MSRs to manage the PMCs. Upon receiving a KVM_GET_SUPPORTED_CPUID ioctl request for the same, the kernel constructs return values based on the x86_pmu_capability irrespective of the vendor. This leaf and the additional MSRs are not supported on AMD and Hygon processors. If AMD PerfMonV2 is detected, the PMU version is set to 2 and guest startup breaks because of an attempt to access a non-existent MSR. Return zeros to avoid this. Fixes: a6c06ed1a60a ("KVM: Expose the architectural performance monitoring CPUID leaf") Reported-by: Vasant Hegde <vasant.hegde@amd.com> Signed-off-by: Sandipan Das <sandipan.das@amd.com> Message-Id: <3fef83d9c2b2f7516e8ff50d60851f29a4bcb716.1651058600.git.sandipan.das@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-03KVM: x86/svm: Account for family 17h event renumberings in amd_pmc_perf_hw_idKyle Huey
Zen renumbered some of the performance counters that correspond to the well known events in perf_hw_id. This code in KVM was never updated for that, so guest that attempt to use counters on Zen that correspond to the pre-Zen perf_hw_id values will silently receive the wrong values. This has been observed in the wild with rr[0] when running in Zen 3 guests. rr uses the retired conditional branch counter 00d1 which is incorrectly recognized by KVM as PERF_COUNT_HW_STALLED_CYCLES_BACKEND. [0] https://rr-project.org/ Signed-off-by: Kyle Huey <me@kylehuey.com> Message-Id: <20220503050136.86298-1-khuey@kylehuey.com> Cc: stable@vger.kernel.org [Check guest family, not host. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-03Merge branch 'kvm-tdp-mmu-atomicity-fix' into HEADPaolo Bonzini
We are dropping A/D bits (and W bits) in the TDP MMU. Even if mmu_lock is held for write, as volatile SPTEs can be written by other tasks/vCPUs outside of mmu_lock. Attempting to prove that bug exposed another notable goof, which has been lurking for a decade, give or take: KVM treats _all_ MMU-writable SPTEs as volatile, even though KVM never clears WRITABLE outside of MMU lock. As a result, the legacy MMU (and the TDP MMU if not fixed) uses XCHG to update writable SPTEs. The fix does not seem to have an easily-measurable affect on performance; page faults are so slow that wasting even a few hundred cycles is dwarfed by the base cost.
2022-05-03KVM: x86/mmu: Use atomic XCHG to write TDP MMU SPTEs with volatile bitsSean Christopherson
Use an atomic XCHG to write TDP MMU SPTEs that have volatile bits, even if mmu_lock is held for write, as volatile SPTEs can be written by other tasks/vCPUs outside of mmu_lock. If a vCPU uses the to-be-modified SPTE to write a page, the CPU can cache the translation as WRITABLE in the TLB despite it being seen by KVM as !WRITABLE, and/or KVM can clobber the Accessed/Dirty bits and not properly tag the backing page. Exempt non-leaf SPTEs from atomic updates as KVM itself doesn't modify non-leaf SPTEs without holding mmu_lock, they do not have Dirty bits, and KVM doesn't consume the Accessed bit of non-leaf SPTEs. Dropping the Dirty and/or Writable bits is most problematic for dirty logging, as doing so can result in a missed TLB flush and eventually a missed dirty page. In the unlikely event that the only dirty page(s) is a clobbered SPTE, clear_dirty_gfn_range() will see the SPTE as not dirty (based on the Dirty or Writable bit depending on the method) and so not update the SPTE and ultimately not flush. If the SPTE is cached in the TLB as writable before it is clobbered, the guest can continue writing the associated page without ever taking a write-protect fault. For most (all?) file back memory, dropping the Dirty bit is a non-issue. The primary MMU write-protects its PTEs on writeback, i.e. KVM's dirty bit is effectively ignored because the primary MMU will mark that page dirty when the write-protection is lifted, e.g. when KVM faults the page back in for write. The Accessed bit is a complete non-issue. Aside from being unused for non-leaf SPTEs, KVM doesn't do a TLB flush when aging SPTEs, i.e. the Accessed bit may be dropped anyways. Lastly, the Writable bit is also problematic as an extension of the Dirty bit, as KVM (correctly) treats the Dirty bit as volatile iff the SPTE is !DIRTY && WRITABLE. If KVM fixes an MMU-writable, but !WRITABLE, SPTE out of mmu_lock, then it can allow the CPU to set the Dirty bit despite the SPTE being !WRITABLE when it is checked by KVM. But that all depends on the Dirty bit being problematic in the first place. Fixes: 2f2fad0897cb ("kvm: x86/mmu: Add functions to handle changed TDP SPTEs") Cc: stable@vger.kernel.org Cc: Ben Gardon <bgardon@google.com> Cc: David Matlack <dmatlack@google.com> Cc: Venkatesh Srinivas <venkateshs@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220423034752.1161007-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-03KVM: x86/mmu: Move shadow-present check out of spte_has_volatile_bits()Sean Christopherson
Move the is_shadow_present_pte() check out of spte_has_volatile_bits() and into its callers. Well, caller, since only one of its two callers doesn't already do the shadow-present check. Opportunistically move the helper to spte.c/h so that it can be used by the TDP MMU, which is also the primary motivation for the shadow-present change. Unlike the legacy MMU, the TDP MMU uses a single path for clear leaf and non-leaf SPTEs, and to avoid unnecessary atomic updates, the TDP MMU will need to check is_last_spte() prior to calling spte_has_volatile_bits(), and calling is_last_spte() without first calling is_shadow_present_spte() is at best odd, and at worst a violation of KVM's loosely defines SPTE rules. Note, mmu_spte_clear_track_bits() could likely skip the write entirely for SPTEs that are not shadow-present. Leave that cleanup for a future patch to avoid introducing a functional change, and because the shadow-present check can likely be moved further up the stack, e.g. drop_large_spte() appears to be the only path that doesn't already explicitly check for a shadow-present SPTE. No functional change intended. Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220423034752.1161007-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-03KVM: x86/mmu: Don't treat fully writable SPTEs as volatile (modulo A/D)Sean Christopherson
Don't treat SPTEs that are truly writable, i.e. writable in hardware, as being volatile (unless they're volatile for other reasons, e.g. A/D bits). KVM _sets_ the WRITABLE bit out of mmu_lock, but never _clears_ the bit out of mmu_lock, so if the WRITABLE bit is set, it cannot magically get cleared just because the SPTE is MMU-writable. Rename the wrapper of MMU-writable to be more literal, the previous name of spte_can_locklessly_be_made_writable() is wrong and misleading. Fixes: c7ba5b48cc8d ("KVM: MMU: fast path of handling guest page fault") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220423034752.1161007-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-29KVM: x86: work around QEMU issue with synthetic CPUID leavesPaolo Bonzini
Synthesizing AMD leaves up to 0x80000021 caused problems with QEMU, which assumes the *host* CPUID[0x80000000].EAX is higher or equal to what KVM_GET_SUPPORTED_CPUID reports. This causes QEMU to issue bogus host CPUIDs when preparing the input to KVM_SET_CPUID2. It can even get into an infinite loop, which is only terminated by an abort(): cpuid_data is full, no space for cpuid(eax:0x8000001d,ecx:0x3e) To work around this, only synthesize those leaves if 0x8000001d exists on the host. The synthetic 0x80000021 leaf is mostly useful on Zen2, which satisfies the condition. Fixes: f144c49e8c39 ("KVM: x86: synthesize CPUID leaf 0x80000021h if useful") Reported-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-29Revert "x86/mm: Introduce lookup_address_in_mm()"Sean Christopherson
Drop lookup_address_in_mm() now that KVM is providing it's own variant of lookup_address_in_pgd() that is safe for use with user addresses, e.g. guards against page tables being torn down. A variant that provides a non-init mm is inherently dangerous and flawed, as the only reason to use an mm other than init_mm is to walk a userspace mapping, and lookup_address_in_pgd() does not play nice with userspace mappings, e.g. doesn't disable IRQs to block TLB shootdowns and doesn't use READ_ONCE() to ensure an upper level entry isn't converted to a huge page between checking the PAGE_SIZE bit and grabbing the address of the next level down. This reverts commit 13c72c060f1ba6f4eddd7b1c4f52a8aded43d6d9. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <YmwIi3bXr/1yhYV/@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-29Merge branch 'kvm-fixes-for-5.18-rc5' into HEADPaolo Bonzini
Fixes for (relatively) old bugs, to be merged in both the -rc and next development trees: * Fix potential races when walking host page table * Fix bad user ABI for KVM_EXIT_SYSTEM_EVENT * Fix shadow page table leak when KVM runs nested
2022-04-29KVM: x86/mmu: fix potential races when walking host page tableMingwei Zhang
KVM uses lookup_address_in_mm() to detect the hugepage size that the host uses to map a pfn. The function suffers from several issues: - no usage of READ_ONCE(*). This allows multiple dereference of the same page table entry. The TOCTOU problem because of that may cause KVM to incorrectly treat a newly generated leaf entry as a nonleaf one, and dereference the content by using its pfn value. - the information returned does not match what KVM needs; for non-present entries it returns the level at which the walk was terminated, as long as the entry is not 'none'. KVM needs level information of only 'present' entries, otherwise it may regard a non-present PXE entry as a present large page mapping. - the function is not safe for mappings that can be torn down, because it does not disable IRQs and because it returns a PTE pointer which is never safe to dereference after the function returns. So implement the logic for walking host page tables directly in KVM, and stop using lookup_address_in_mm(). Cc: Sean Christopherson <seanjc@google.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Mingwei Zhang <mizhang@google.com> Message-Id: <20220429031757.2042406-1-mizhang@google.com> [Inline in host_pfn_mapping_level, ensure no semantic change for its callers. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-29KVM: fix bad user ABI for KVM_EXIT_SYSTEM_EVENTPaolo Bonzini
When KVM_EXIT_SYSTEM_EVENT was introduced, it included a flags member that at the time was unused. Unfortunately this extensibility mechanism has several issues: - x86 is not writing the member, so it would not be possible to use it on x86 except for new events - the member is not aligned to 64 bits, so the definition of the uAPI struct is incorrect for 32- on 64-bit userspace. This is a problem for RISC-V, which supports CONFIG_KVM_COMPAT, but fortunately usage of flags was only introduced in 5.18. Since padding has to be introduced, place a new field in there that tells if the flags field is valid. To allow further extensibility, in fact, change flags to an array of 16 values, and store how many of the values are valid. The availability of the new ndata field is tied to a system capability; all architectures are changed to fill in the field. To avoid breaking compilation of userspace that was using the flags field, provide a userspace-only union to overlap flags with data[0]. The new field is placed at the same offset for both 32- and 64-bit userspace. Cc: Will Deacon <will@kernel.org> Cc: Marc Zyngier <maz@kernel.org> Cc: Peter Gonda <pgonda@google.com> Cc: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reported-by: kernel test robot <lkp@intel.com> Message-Id: <20220422103013.34832-1-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-29KVM: x86/mmu: Do not create SPTEs for GFNs that exceed host.MAXPHYADDRSean Christopherson
Disallow memslots and MMIO SPTEs whose gpa range would exceed the host's MAXPHYADDR, i.e. don't create SPTEs for gfns that exceed host.MAXPHYADDR. The TDP MMU bounds its zapping based on host.MAXPHYADDR, and so if the guest, possibly with help from userspace, manages to coerce KVM into creating a SPTE for an "impossible" gfn, KVM will leak the associated shadow pages (page tables): WARNING: CPU: 10 PID: 1122 at arch/x86/kvm/mmu/tdp_mmu.c:57 kvm_mmu_uninit_tdp_mmu+0x4b/0x60 [kvm] Modules linked in: kvm_intel kvm irqbypass CPU: 10 PID: 1122 Comm: set_memory_regi Tainted: G W 5.18.0-rc1+ #293 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:kvm_mmu_uninit_tdp_mmu+0x4b/0x60 [kvm] Call Trace: <TASK> kvm_arch_destroy_vm+0x130/0x1b0 [kvm] kvm_destroy_vm+0x162/0x2d0 [kvm] kvm_vm_release+0x1d/0x30 [kvm] __fput+0x82/0x240 task_work_run+0x5b/0x90 exit_to_user_mode_prepare+0xd2/0xe0 syscall_exit_to_user_mode+0x1d/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xae </TASK> On bare metal, encountering an impossible gpa in the page fault path is well and truly impossible, barring CPU bugs, as the CPU will signal #PF during the gva=>gpa translation (or a similar failure when stuffing a physical address into e.g. the VMCS/VMCB). But if KVM is running as a VM itself, the MAXPHYADDR enumerated to KVM may not be the actual MAXPHYADDR of the underlying hardware, in which case the hardware will not fault on the illegal-from-KVM's-perspective gpa. Alternatively, KVM could continue allowing the dodgy behavior and simply zap the max possible range. But, for hosts with MAXPHYADDR < 52, that's a (minor) waste of cycles, and more importantly, KVM can't reasonably support impossible memslots when running on bare metal (or with an accurate MAXPHYADDR as a VM). Note, limiting the overhead by checking if KVM is running as a guest is not a safe option as the host isn't required to announce itself to the guest in any way, e.g. doesn't need to set the HYPERVISOR CPUID bit. A second alternative to disallowing the memslot behavior would be to disallow creating a VM with guest.MAXPHYADDR > host.MAXPHYADDR. That restriction is undesirable as there are legitimate use cases for doing so, e.g. using the highest host.MAXPHYADDR out of a pool of heterogeneous systems so that VMs can be migrated between hosts with different MAXPHYADDRs without running afoul of the allow_smaller_maxphyaddr mess. Note that any guest.MAXPHYADDR is valid with shadow paging, and it is even useful in order to test KVM with MAXPHYADDR=52 (i.e. without any reserved physical address bits). The now common kvm_mmu_max_gfn() is inclusive instead of exclusive. The memslot and TDP MMU code want an exclusive value, but the name implies the returned value is inclusive, and the MMIO path needs an inclusive check. Fixes: faaf05b00aec ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU") Fixes: 524a1e4e381f ("KVM: x86/mmu: Don't leak non-leaf SPTEs when zapping all SPTEs") Cc: stable@vger.kernel.org Cc: Maxim Levitsky <mlevitsk@redhat.com> Cc: Ben Gardon <bgardon@google.com> Cc: David Matlack <dmatlack@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220428233416.2446833-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-29Merge tag 'kvmarm-fixes-5.18-2' of ↵Paolo Bonzini
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 fixes for 5.18, take #2 - Take care of faults occuring between the PARange and IPA range by injecting an exception - Fix S2 faults taken from a host EL0 in protected mode - Work around Oops caused by a PMU access from a 32bit guest when PMU has been created. This is a temporary bodge until we fix it for good.
2022-04-27KVM: arm64: Inject exception on out-of-IPA-range translation faultMarc Zyngier
When taking a translation fault for an IPA that is outside of the range defined by the hypervisor (between the HW PARange and the IPA range), we stupidly treat it as an IO and forward the access to userspace. Of course, userspace can't do much with it, and things end badly. Arguably, the guest is braindead, but we should at least catch the case and inject an exception. Check the faulting IPA against: - the sanitised PARange: inject an address size fault - the IPA size: inject an abort Reported-by: Christoffer Dall <christoffer.dall@arm.com> Signed-off-by: Marc Zyngier <maz@kernel.org>
2022-04-27KVM/arm64: Don't emulate a PMU for 32-bit guests if feature not setAlexandru Elisei
kvm->arch.arm_pmu is set when userspace attempts to set the first PMU attribute. As certain attributes are mandatory, arm_pmu ends up always being set to a valid arm_pmu, otherwise KVM will refuse to run the VCPU. However, this only happens if the VCPU has the PMU feature. If the VCPU doesn't have the feature bit set, kvm->arch.arm_pmu will be left uninitialized and equal to NULL. KVM doesn't do ID register emulation for 32-bit guests and accesses to the PMU registers aren't gated by the pmu_visibility() function. This is done to prevent injecting unexpected undefined exceptions in guests which have detected the presence of a hardware PMU. But even though the VCPU feature is missing, KVM still attempts to emulate certain aspects of the PMU when PMU registers are accessed. This leads to a NULL pointer dereference like this one, which happens on an odroid-c4 board when running the kvm-unit-tests pmu-cycle-counter test with kvmtool and without the PMU feature being set: [ 454.402699] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000150 [ 454.405865] Mem abort info: [ 454.408596] ESR = 0x96000004 [ 454.411638] EC = 0x25: DABT (current EL), IL = 32 bits [ 454.416901] SET = 0, FnV = 0 [ 454.419909] EA = 0, S1PTW = 0 [ 454.423010] FSC = 0x04: level 0 translation fault [ 454.427841] Data abort info: [ 454.430687] ISV = 0, ISS = 0x00000004 [ 454.434484] CM = 0, WnR = 0 [ 454.437404] user pgtable: 4k pages, 48-bit VAs, pgdp=000000000c924000 [ 454.443800] [0000000000000150] pgd=0000000000000000, p4d=0000000000000000 [ 454.450528] Internal error: Oops: 96000004 [#1] PREEMPT SMP [ 454.456036] Modules linked in: [ 454.459053] CPU: 1 PID: 267 Comm: kvm-vcpu-0 Not tainted 5.18.0-rc4 #113 [ 454.465697] Hardware name: Hardkernel ODROID-C4 (DT) [ 454.470612] pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 454.477512] pc : kvm_pmu_event_mask.isra.0+0x14/0x74 [ 454.482427] lr : kvm_pmu_set_counter_event_type+0x2c/0x80 [ 454.487775] sp : ffff80000a9839c0 [ 454.491050] x29: ffff80000a9839c0 x28: ffff000000a83a00 x27: 0000000000000000 [ 454.498127] x26: 0000000000000000 x25: 0000000000000000 x24: ffff00000a510000 [ 454.505198] x23: ffff000000a83a00 x22: ffff000003b01000 x21: 0000000000000000 [ 454.512271] x20: 000000000000001f x19: 00000000000003ff x18: 0000000000000000 [ 454.519343] x17: 000000008003fe98 x16: 0000000000000000 x15: 0000000000000000 [ 454.526416] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000 [ 454.533489] x11: 000000008003fdbc x10: 0000000000009d20 x9 : 000000000000001b [ 454.540561] x8 : 0000000000000000 x7 : 0000000000000d00 x6 : 0000000000009d00 [ 454.547633] x5 : 0000000000000037 x4 : 0000000000009d00 x3 : 0d09000000000000 [ 454.554705] x2 : 000000000000001f x1 : 0000000000000000 x0 : 0000000000000000 [ 454.561779] Call trace: [ 454.564191] kvm_pmu_event_mask.isra.0+0x14/0x74 [ 454.568764] kvm_pmu_set_counter_event_type+0x2c/0x80 [ 454.573766] access_pmu_evtyper+0x128/0x170 [ 454.577905] perform_access+0x34/0x80 [ 454.581527] kvm_handle_cp_32+0x13c/0x160 [ 454.585495] kvm_handle_cp15_32+0x1c/0x30 [ 454.589462] handle_exit+0x70/0x180 [ 454.592912] kvm_arch_vcpu_ioctl_run+0x1c4/0x5e0 [ 454.597485] kvm_vcpu_ioctl+0x23c/0x940 [ 454.601280] __arm64_sys_ioctl+0xa8/0xf0 [ 454.605160] invoke_syscall+0x48/0x114 [ 454.608869] el0_svc_common.constprop.0+0xd4/0xfc [ 454.613527] do_el0_svc+0x28/0x90 [ 454.616803] el0_svc+0x34/0xb0 [ 454.619822] el0t_64_sync_handler+0xa4/0x130 [ 454.624049] el0t_64_sync+0x18c/0x190 [ 454.627675] Code: a9be7bfd 910003fd f9000bf3 52807ff3 (b9415001) [ 454.633714] ---[ end trace 0000000000000000 ]--- In this particular case, Linux hasn't detected the presence of a hardware PMU because the PMU node is missing from the DTB, so userspace would have been unable to set the VCPU PMU feature even if it attempted it. What happens is that the 32-bit guest reads ID_DFR0, which advertises the presence of the PMU, and when it tries to program a counter, it triggers the NULL pointer dereference because kvm->arch.arm_pmu is NULL. kvm-arch.arm_pmu was introduced by commit 46b187821472 ("KVM: arm64: Keep a per-VM pointer to the default PMU"). Until that commit, this error would be triggered instead: [ 73.388140] ------------[ cut here ]------------ [ 73.388189] Unknown PMU version 0 [ 73.390420] WARNING: CPU: 1 PID: 264 at arch/arm64/kvm/pmu-emul.c:36 kvm_pmu_event_mask.isra.0+0x6c/0x74 [ 73.399821] Modules linked in: [ 73.402835] CPU: 1 PID: 264 Comm: kvm-vcpu-0 Not tainted 5.17.0 #114 [ 73.409132] Hardware name: Hardkernel ODROID-C4 (DT) [ 73.414048] pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 73.420948] pc : kvm_pmu_event_mask.isra.0+0x6c/0x74 [ 73.425863] lr : kvm_pmu_event_mask.isra.0+0x6c/0x74 [ 73.430779] sp : ffff80000a8db9b0 [ 73.434055] x29: ffff80000a8db9b0 x28: ffff000000dbaac0 x27: 0000000000000000 [ 73.441131] x26: ffff000000dbaac0 x25: 00000000c600000d x24: 0000000000180720 [ 73.448203] x23: ffff800009ffbe10 x22: ffff00000b612000 x21: 0000000000000000 [ 73.455276] x20: 000000000000001f x19: 0000000000000000 x18: ffffffffffffffff [ 73.462348] x17: 000000008003fe98 x16: 0000000000000000 x15: 0720072007200720 [ 73.469420] x14: 0720072007200720 x13: ffff800009d32488 x12: 00000000000004e6 [ 73.476493] x11: 00000000000001a2 x10: ffff800009d32488 x9 : ffff800009d32488 [ 73.483565] x8 : 00000000ffffefff x7 : ffff800009d8a488 x6 : ffff800009d8a488 [ 73.490638] x5 : ffff0000f461a9d8 x4 : 0000000000000000 x3 : 0000000000000001 [ 73.497710] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff000000dbaac0 [ 73.504784] Call trace: [ 73.507195] kvm_pmu_event_mask.isra.0+0x6c/0x74 [ 73.511768] kvm_pmu_set_counter_event_type+0x2c/0x80 [ 73.516770] access_pmu_evtyper+0x128/0x16c [ 73.520910] perform_access+0x34/0x80 [ 73.524532] kvm_handle_cp_32+0x13c/0x160 [ 73.528500] kvm_handle_cp15_32+0x1c/0x30 [ 73.532467] handle_exit+0x70/0x180 [ 73.535917] kvm_arch_vcpu_ioctl_run+0x20c/0x6e0 [ 73.540489] kvm_vcpu_ioctl+0x2b8/0x9e0 [ 73.544283] __arm64_sys_ioctl+0xa8/0xf0 [ 73.548165] invoke_syscall+0x48/0x114 [ 73.551874] el0_svc_common.constprop.0+0xd4/0xfc [ 73.556531] do_el0_svc+0x28/0x90 [ 73.559808] el0_svc+0x28/0x80 [ 73.562826] el0t_64_sync_handler+0xa4/0x130 [ 73.567054] el0t_64_sync+0x1a0/0x1a4 [ 73.570676] ---[ end trace 0000000000000000 ]--- [ 73.575382] kvm: pmu event creation failed -2 The root cause remains the same: kvm->arch.pmuver was never set to something sensible because the VCPU feature itself was never set. The odroid-c4 is somewhat of a special case, because Linux doesn't probe the PMU. But the above errors can easily be reproduced on any hardware, with or without a PMU driver, as long as userspace doesn't set the PMU feature. Work around the fact that KVM advertises a PMU even when the VCPU feature is not set by gating all PMU emulation on the feature. The guest can still access the registers without KVM injecting an undefined exception. Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20220425145530.723858-1-alexandru.elisei@arm.com
2022-04-27KVM: arm64: Handle host stage-2 faults from 32-bit EL0Will Deacon
When pKVM is enabled, host memory accesses are translated by an identity mapping at stage-2, which is populated lazily in response to synchronous exceptions from 64-bit EL1 and EL0. Extend this handling to cover exceptions originating from 32-bit EL0 as well. Although these are very unlikely to occur in practice, as the kernel typically ensures that user pages are initialised before mapping them in, drivers could still map previously untouched device pages into userspace and expect things to work rather than panic the system. Cc: Quentin Perret <qperret@google.com> Cc: Marc Zyngier <maz@kernel.org> Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20220427171332.13635-1-will@kernel.org
2022-04-21KVM: SEV: add cache flush to solve SEV cache incoherency issuesMingwei Zhang
Flush the CPU caches when memory is reclaimed from an SEV guest (where reclaim also includes it being unmapped from KVM's memslots). Due to lack of coherency for SEV encrypted memory, failure to flush results in silent data corruption if userspace is malicious/broken and doesn't ensure SEV guest memory is properly pinned and unpinned. Cache coherency is not enforced across the VM boundary in SEV (AMD APM vol.2 Section 15.34.7). Confidential cachelines, generated by confidential VM guests have to be explicitly flushed on the host side. If a memory page containing dirty confidential cachelines was released by VM and reallocated to another user, the cachelines may corrupt the new user at a later time. KVM takes a shortcut by assuming all confidential memory remain pinned until the end of VM lifetime. Therefore, KVM does not flush cache at mmu_notifier invalidation events. Because of this incorrect assumption and the lack of cache flushing, malicous userspace can crash the host kernel: creating a malicious VM and continuously allocates/releases unpinned confidential memory pages when the VM is running. Add cache flush operations to mmu_notifier operations to ensure that any physical memory leaving the guest VM get flushed. In particular, hook mmu_notifier_invalidate_range_start and mmu_notifier_release events and flush cache accordingly. The hook after releasing the mmu lock to avoid contention with other vCPUs. Cc: stable@vger.kernel.org Suggested-by: Sean Christpherson <seanjc@google.com> Reported-by: Mingwei Zhang <mizhang@google.com> Signed-off-by: Mingwei Zhang <mizhang@google.com> Message-Id: <20220421031407.2516575-4-mizhang@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-21KVM: SVM: Flush when freeing encrypted pages even on SME_COHERENT CPUsMingwei Zhang
Use clflush_cache_range() to flush the confidential memory when SME_COHERENT is supported in AMD CPU. Cache flush is still needed since SME_COHERENT only support cache invalidation at CPU side. All confidential cache lines are still incoherent with DMA devices. Cc: stable@vger.kerel.org Fixes: add5e2f04541 ("KVM: SVM: Add support for the SEV-ES VMSA") Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Mingwei Zhang <mizhang@google.com> Message-Id: <20220421031407.2516575-3-mizhang@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-21KVM: SVM: Simplify and harden helper to flush SEV guest page(s)Sean Christopherson
Rework sev_flush_guest_memory() to explicitly handle only a single page, and harden it to fall back to WBINVD if VM_PAGE_FLUSH fails. Per-page flushing is currently used only to flush the VMSA, and in its current form, the helper is completely broken with respect to flushing actual guest memory, i.e. won't work correctly for an arbitrary memory range. VM_PAGE_FLUSH takes a host virtual address, and is subject to normal page walks, i.e. will fault if the address is not present in the host page tables or does not have the correct permissions. Current AMD CPUs also do not honor SMAP overrides (undocumented in kernel versions of the APM), so passing in a userspace address is completely out of the question. In other words, KVM would need to manually walk the host page tables to get the pfn, ensure the pfn is stable, and then use the direct map to invoke VM_PAGE_FLUSH. And the latter might not even work, e.g. if userspace is particularly evil/clever and backs the guest with Secret Memory (which unmaps memory from the direct map). Signed-off-by: Sean Christopherson <seanjc@google.com> Fixes: add5e2f04541 ("KVM: SVM: Add support for the SEV-ES VMSA") Reported-by: Mingwei Zhang <mizhang@google.com> Cc: stable@vger.kernel.org Signed-off-by: Mingwei Zhang <mizhang@google.com> Message-Id: <20220421031407.2516575-2-mizhang@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-21KVM: x86/pmu: Update AMD PMC sample period to fix guest NMI-watchdogLike Xu
NMI-watchdog is one of the favorite features of kernel developers, but it does not work in AMD guest even with vPMU enabled and worse, the system misrepresents this capability via /proc. This is a PMC emulation error. KVM does not pass the latest valid value to perf_event in time when guest NMI-watchdog is running, thus the perf_event corresponding to the watchdog counter will enter the old state at some point after the first guest NMI injection, forcing the hardware register PMC0 to be constantly written to 0x800000000001. Meanwhile, the running counter should accurately reflect its new value based on the latest coordinated pmc->counter (from vPMC's point of view) rather than the value written directly by the guest. Fixes: 168d918f2643 ("KVM: x86: Adjust counter sample period after a wrmsr") Reported-by: Dongli Cao <caodongli@kingsoft.com> Signed-off-by: Like Xu <likexu@tencent.com> Reviewed-by: Yanan Wang <wangyanan55@huawei.com> Tested-by: Yanan Wang <wangyanan55@huawei.com> Reviewed-by: Jim Mattson <jmattson@google.com> Message-Id: <20220409015226.38619-1-likexu@tencent.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-21x86/kvm: Preserve BSP MSR_KVM_POLL_CONTROL across suspend/resumeWanpeng Li
MSR_KVM_POLL_CONTROL is cleared on reset, thus reverting guests to host-side polling after suspend/resume. Non-bootstrap CPUs are restored correctly by the haltpoll driver because they are hot-unplugged during suspend and hot-plugged during resume; however, the BSP is not hotpluggable and remains in host-sde polling mode after the guest resume. The makes the guest pay for the cost of vmexits every time the guest enters idle. Fix it by recording BSP's haltpoll state and resuming it during guest resume. Cc: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Message-Id: <1650267752-46796-1-git-send-email-wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-21KVM: x86: Skip KVM_GUESTDBG_BLOCKIRQ APICv update if APICv is disabledSean Christopherson
Skip the APICv inhibit update for KVM_GUESTDBG_BLOCKIRQ if APICv is disabled at the module level to avoid having to acquire the mutex and potentially process all vCPUs. The DISABLE inhibit will (barring bugs) never be lifted, so piling on more inhibits is unnecessary. Fixes: cae72dcc3b21 ("KVM: x86: inhibit APICv when KVM_GUESTDBG_BLOCKIRQ active") Cc: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20220420013732.3308816-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-21KVM: x86: Pend KVM_REQ_APICV_UPDATE during vCPU creation to fix a raceSean Christopherson
Make a KVM_REQ_APICV_UPDATE request when creating a vCPU with an in-kernel local APIC and APICv enabled at the module level. Consuming kvm_apicv_activated() and stuffing vcpu->arch.apicv_active directly can race with __kvm_set_or_clear_apicv_inhibit(), as vCPU creation happens before the vCPU is fully onlined, i.e. it won't get the request made to "all" vCPUs. If APICv is globally inhibited between setting apicv_active and onlining the vCPU, the vCPU will end up running with APICv enabled and trigger KVM's sanity check. Mark APICv as active during vCPU creation if APICv is enabled at the module level, both to be optimistic about it's final state, e.g. to avoid additional VMWRITEs on VMX, and because there are likely bugs lurking since KVM checks apicv_active in multiple vCPU creation paths. While keeping the current behavior of consuming kvm_apicv_activated() is arguably safer from a regression perspective, force apicv_active so that vCPU creation runs with deterministic state and so that if there are bugs, they are found sooner than later, i.e. not when some crazy race condition is hit. WARNING: CPU: 0 PID: 484 at arch/x86/kvm/x86.c:9877 vcpu_enter_guest+0x2ae3/0x3ee0 arch/x86/kvm/x86.c:9877 Modules linked in: CPU: 0 PID: 484 Comm: syz-executor361 Not tainted 5.16.13 #2 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1~cloud0 04/01/2014 RIP: 0010:vcpu_enter_guest+0x2ae3/0x3ee0 arch/x86/kvm/x86.c:9877 Call Trace: <TASK> vcpu_run arch/x86/kvm/x86.c:10039 [inline] kvm_arch_vcpu_ioctl_run+0x337/0x15e0 arch/x86/kvm/x86.c:10234 kvm_vcpu_ioctl+0x4d2/0xc80 arch/x86/kvm/../../../virt/kvm/kvm_main.c:3727 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:874 [inline] __se_sys_ioctl fs/ioctl.c:860 [inline] __x64_sys_ioctl+0x16d/0x1d0 fs/ioctl.c:860 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x38/0x90 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae The bug was hit by a syzkaller spamming VM creation with 2 vCPUs and a call to KVM_SET_GUEST_DEBUG. r0 = openat$kvm(0xffffffffffffff9c, &(0x7f0000000000), 0x0, 0x0) r1 = ioctl$KVM_CREATE_VM(r0, 0xae01, 0x0) ioctl$KVM_CAP_SPLIT_IRQCHIP(r1, 0x4068aea3, &(0x7f0000000000)) (async) r2 = ioctl$KVM_CREATE_VCPU(r1, 0xae41, 0x0) (async) r3 = ioctl$KVM_CREATE_VCPU(r1, 0xae41, 0x400000000000002) ioctl$KVM_SET_GUEST_DEBUG(r3, 0x4048ae9b, &(0x7f00000000c0)={0x5dda9c14aa95f5c5}) ioctl$KVM_RUN(r2, 0xae80, 0x0) Reported-by: Gaoning Pan <pgn@zju.edu.cn> Reported-by: Yongkang Jia <kangel@zju.edu.cn> Fixes: 8df14af42f00 ("kvm: x86: Add support for dynamic APICv activation") Cc: stable@vger.kernel.org Cc: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20220420013732.3308816-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-21KVM: nVMX: Defer APICv updates while L2 is active until L1 is activeSean Christopherson
Defer APICv updates that occur while L2 is active until nested VM-Exit, i.e. until L1 regains control. vmx_refresh_apicv_exec_ctrl() assumes L1 is active and (a) stomps all over vmcs02 and (b) neglects to ever updated vmcs01. E.g. if vmcs12 doesn't enable the TPR shadow for L2 (and thus no APICv controls), L1 performs nested VM-Enter APICv inhibited, and APICv becomes unhibited while L2 is active, KVM will set various APICv controls in vmcs02 and trigger a failed VM-Entry. The kicker is that, unless running with nested_early_check=1, KVM blames L1 and chaos ensues. In all cases, ignoring vmcs02 and always deferring the inhibition change to vmcs01 is correct (or at least acceptable). The ABSENT and DISABLE inhibitions cannot truly change while L2 is active (see below). IRQ_BLOCKING can change, but it is firmly a best effort debug feature. Furthermore, only L2's APIC is accelerated/virtualized to the full extent possible, e.g. even if L1 passes through its APIC to L2, normal MMIO/MSR interception will apply to the virtual APIC managed by KVM. The exception is the SELF_IPI register when x2APIC is enabled, but that's an acceptable hole. Lastly, Hyper-V's Auto EOI can technically be toggled if L1 exposes the MSRs to L2, but for that to work in any sane capacity, L1 would need to pass through IRQs to L2 as well, and IRQs must be intercepted to enable virtual interrupt delivery. I.e. exposing Auto EOI to L2 and enabling VID for L2 are, for all intents and purposes, mutually exclusive. Lack of dynamic toggling is also why this scenario is all but impossible to encounter in KVM's current form. But a future patch will pend an APICv update request _during_ vCPU creation to plug a race where a vCPU that's being created doesn't get included in the "all vCPUs request" because it's not yet visible to other vCPUs. If userspaces restores L2 after VM creation (hello, KVM selftests), the first KVM_RUN will occur while L2 is active and thus service the APICv update request made during VM creation. Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220420013732.3308816-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-21KVM: x86: Tag APICv DISABLE inhibit, not ABSENT, if APICv is disabledSean Christopherson
Set the DISABLE inhibit, not the ABSENT inhibit, if APICv is disabled via module param. A recent refactoring to add a wrapper for setting/clearing inhibits unintentionally changed the flag, probably due to a copy+paste goof. Fixes: 4f4c4a3ee53c ("KVM: x86: Trace all APICv inhibit changes and capture overall status") Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20220420013732.3308816-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-21KVM: Add helpers to wrap vcpu->srcu_idx and yell if it's abusedSean Christopherson
Add wrappers to acquire/release KVM's SRCU lock when stashing the index in vcpu->src_idx, along with rudimentary detection of illegal usage, e.g. re-acquiring SRCU and thus overwriting vcpu->src_idx. Because the SRCU index is (currently) either 0 or 1, illegal nesting bugs can go unnoticed for quite some time and only cause problems when the nested lock happens to get a different index. Wrap the WARNs in PROVE_RCU=y, and make them ONCE, otherwise KVM will likely yell so loudly that it will bring the kernel to its knees. Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Fabiano Rosas <farosas@linux.ibm.com> Message-Id: <20220415004343.2203171-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-21KVM: RISC-V: Use kvm_vcpu.srcu_idx, drop RISC-V's unnecessary copySean Christopherson
Use the generic kvm_vcpu's srcu_idx instead of using an indentical field in RISC-V's version of kvm_vcpu_arch. Generic KVM very intentionally does not touch vcpu->srcu_idx, i.e. there's zero chance of running afoul of common code. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220415004343.2203171-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-21KVM: x86: Don't re-acquire SRCU lock in complete_emulated_io()Sean Christopherson
Don't re-acquire SRCU in complete_emulated_io() now that KVM acquires the lock in kvm_arch_vcpu_ioctl_run(). More importantly, don't overwrite vcpu->srcu_idx. If the index acquired by complete_emulated_io() differs from the one acquired by kvm_arch_vcpu_ioctl_run(), KVM will effectively leak a lock and hang if/when synchronize_srcu() is invoked for the relevant grace period. Fixes: 8d25b7beca7e ("KVM: x86: pull kvm->srcu read-side to kvm_arch_vcpu_ioctl_run") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20220415004343.2203171-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-20RISC-V: KVM: Restrict the extensions that can be disabledAtish Patra
Currently, the config isa register allows us to disable all allowed single letter ISA extensions. It shouldn't be the case as vmm shouldn't be able to disable base extensions (imac). These extensions should always be enabled as long as they are enabled in the host ISA. Signed-off-by: Atish Patra <atishp@rivosinc.com> Signed-off-by: Anup Patel <anup@brainfault.org> Fixes: 92ad82002c39 ("RISC-V: KVM: Implement KVM_GET_ONE_REG/KVM_SET_ONE_REG ioctls")
2022-04-20RISC-V: KVM: Remove 's' & 'u' as valid ISA extensionAtish Patra
There are no ISA extension defined as 's' & 'u' in RISC-V specifications. The misa register defines 's' & 'u' bit as Supervisor/User privilege mode enabled. But it should not appear in the ISA extension in the device tree. Remove those from the allowed ISA extension for kvm. Fixes: a33c72faf2d7 ("RISC-V: KVM: Implement VCPU create, init and destroy functions") Signed-off-by: Atish Patra <atishp@rivosinc.com> Signed-off-by: Anup Patel <anup@brainfault.org>
2022-04-17Merge tag 'x86-urgent-2022-04-17' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Thomas Gleixner: "Two x86 fixes related to TSX: - Use either MSR_TSX_FORCE_ABORT or MSR_IA32_TSX_CTRL to disable TSX to cover all CPUs which allow to disable it. - Disable TSX development mode at boot so that a microcode update which provides TSX development mode does not suddenly make the system vulnerable to TSX Asynchronous Abort" * tag 'x86-urgent-2022-04-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/tsx: Disable TSX development mode at boot x86/tsx: Use MSR_TSX_CTRL to clear CPUID bits
2022-04-16Merge tag 'soc-fixes-5.18-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc Pull ARM SoC fixes from Arnd Bergmann: "There are a number of SoC bugfixes that came in since the merge window, and more of them are already pending. This batch includes: - A boot time regression fix for davinci that triggered on multi_v5_defconfig when booting any platform - Defconfig updates to address removed features, changed symbol names or dependencies, for gemini, ux500, and pxa - Email address changes for Krzysztof Kozlowski - Build warning fixes for ep93xx and iop32x - Devicetree warning fixes across many platforms - Minor bugfixes for the reset controller, memory controller and SCMI firmware subsystems plus the versatile-express board" * tag 'soc-fixes-5.18-2' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (34 commits) ARM: config: Update Gemini defconfig arm64: dts: qcom/sdm845-shift-axolotl: Fix boolean properties with values ARM: dts: align SPI NOR node name with dtschema ARM: dts: Fix more boolean properties with values arm/arm64: dts: qcom: Fix boolean properties with values arm64: dts: imx: Fix imx8*-var-som touchscreen property sizes arm: dts: imx: Fix boolean properties with values arm64: dts: tegra: Fix boolean properties with values arm: dts: at91: Fix boolean properties with values arm: configs: imote2: Drop defconfig as board support dropped. ep93xx: clock: Don't use plain integer as NULL pointer ep93xx: clock: Fix UAF in ep93xx_clk_register_gate() ARM: vexpress/spc: Fix all the kernel-doc build warnings ARM: vexpress/spc: Fix kernel-doc build warning for ve_spc_cpu_in_wfi ARM: config: u8500: Re-enable AB8500 battery charging ARM: config: u8500: Add some common hardware memory: fsl_ifc: populate child nodes of buses and mfd devices ARM: config: Refresh U8500 defconfig firmware: arm_scmi: Fix sparse warnings in OPTEE transport driver firmware: arm_scmi: Replace zero-length array with flexible-array member ...
2022-04-15mm/vmalloc: fix spinning drain_vmap_work after reading from /proc/vmcoreOmar Sandoval
Commit 3ee48b6af49c ("mm, x86: Saving vmcore with non-lazy freeing of vmas") introduced set_iounmap_nonlazy(), which sets vmap_lazy_nr to lazy_max_pages() + 1, ensuring that any future vunmaps() immediately purge the vmap areas instead of doing it lazily. Commit 690467c81b1a ("mm/vmalloc: Move draining areas out of caller context") moved the purging from the vunmap() caller to a worker thread. Unfortunately, set_iounmap_nonlazy() can cause the worker thread to spin (possibly forever). For example, consider the following scenario: 1. Thread reads from /proc/vmcore. This eventually calls __copy_oldmem_page() -> set_iounmap_nonlazy(), which sets vmap_lazy_nr to lazy_max_pages() + 1. 2. Then it calls free_vmap_area_noflush() (via iounmap()), which adds 2 pages (one page plus the guard page) to the purge list and vmap_lazy_nr. vmap_lazy_nr is now lazy_max_pages() + 3, so the drain_vmap_work is scheduled. 3. Thread returns from the kernel and is scheduled out. 4. Worker thread is scheduled in and calls drain_vmap_area_work(). It frees the 2 pages on the purge list. vmap_lazy_nr is now lazy_max_pages() + 1. 5. This is still over the threshold, so it tries to purge areas again, but doesn't find anything. 6. Repeat 5. If the system is running with only one CPU (which is typicial for kdump) and preemption is disabled, then this will never make forward progress: there aren't any more pages to purge, so it hangs. If there is more than one CPU or preemption is enabled, then the worker thread will spin forever in the background. (Note that if there were already pages to be purged at the time that set_iounmap_nonlazy() was called, this bug is avoided.) This can be reproduced with anything that reads from /proc/vmcore multiple times. E.g., vmcore-dmesg /proc/vmcore. It turns out that improvements to vmap() over the years have obsoleted the need for this "optimization". I benchmarked `dd if=/proc/vmcore of=/dev/null` with 4k and 1M read sizes on a system with a 32GB vmcore. The test was run on 5.17, 5.18-rc1 with a fix that avoided the hang, and 5.18-rc1 with set_iounmap_nonlazy() removed entirely: |5.17 |5.18+fix|5.18+removal 4k|40.86s| 40.09s| 26.73s 1M|24.47s| 23.98s| 21.84s The removal was the fastest (by a wide margin with 4k reads). This patch removes set_iounmap_nonlazy(). Link: https://lkml.kernel.org/r/52f819991051f9b865e9ce25605509bfdbacadcd.1649277321.git.osandov@fb.com Fixes: 690467c81b1a ("mm/vmalloc: Move draining areas out of caller context") Signed-off-by: Omar Sandoval <osandov@fb.com> Acked-by: Chris Down <chris@chrisdown.name> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Baoquan He <bhe@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-04-14ARM: config: Update Gemini defconfigLinus Walleij
The Gemini defconfig needs to be updated due to DSA driver Kconfig changes in the v5.18 merge window: CONFIG_NET_DSA_REALTEK_SMI is now behind CONFIG_NET_DSA_REALTEK and the desired DSA switch need to be selected explicitly with CONFIG_NET_DSA_REALTEK_RTL8366RB. Take this opportunity to update some other minor config options: - CONFIG_MARVELL_PHY moved around because of Kconfig changes. - CONFIG_SENSORS_DRIVETEMP should be selected since is regulates the system critical alert temperature on some devices, which is nice if it is handled even if initramfs or root fails to mount. Fixes: 319a70a5fea9 ("net: dsa: realtek-smi: move to subdirectory") Fixes: 765c39a4fafe ("net: dsa: realtek: convert subdrivers into modules") Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Cc: Hans Ulli Kroll <ulli.kroll@googlemail.com> Cc: Luiz Angelo Daros de Luca <luizluca@gmail.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2022-04-14arm64: dts: qcom/sdm845-shift-axolotl: Fix boolean properties with valuesRob Herring
Boolean properties in DT are present or not present and don't take a value. A property such as 'foo = <0>;' evaluated to true. IOW, the value doesn't matter. It may have been intended that 0 values are false, but there is no change in behavior with this patch. Signed-off-by: Rob Herring <robh@kernel.org> Cc: Andy Gross <agross@kernel.org> Cc: Bjorn Andersson <bjorn.andersson@linaro.org> Cc: Krzysztof Kozlowski <krzk+dt@kernel.org> Cc: linux-arm-msm@vger.kernel.org Link: https://lore.kernel.org/r/20220407225254.2178644-1-robh@kernel.org' Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2022-04-14ARM: dts: align SPI NOR node name with dtschemaKrzysztof Kozlowski
The node names should be generic and SPI NOR dtschema expects "flash". Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Link: https://lore.kernel.org/r/20220407143027.294678-1-krzysztof.kozlowski@linaro.org' Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2022-04-14ARM: dts: Fix more boolean properties with valuesRob Herring
Boolean properties in DT are present or not present and don't take a value. A property such as 'foo = <0>;' evaluated to true. IOW, the value doesn't matter. It may have been intended that 0 values are false, but there is no change in behavior with this patch. Signed-off-by: Rob Herring <robh@kernel.org> Reviewed-by: Claudiu Beznea <claudiu.beznea@microchip.com> Reviewed-by: Bjorn Andersson <bjorn.andersson@linaro.org> Cc: Krzysztof Kozlowski <krzk+dt@kernel.org> Cc: Nicolas Ferre <nicolas.ferre@microchip.com> Cc: Alexandre Belloni <alexandre.belloni@bootlin.com> Cc: Claudiu Beznea <claudiu.beznea@microchip.com> Cc: Shawn Guo <shawnguo@kernel.org> Cc: Sascha Hauer <s.hauer@pengutronix.de> Cc: Pengutronix Kernel Team <kernel@pengutronix.de> Cc: Fabio Estevam <festevam@gmail.com> Cc: NXP Linux Team <linux-imx@nxp.com> Cc: "Benoît Cousson" <bcousson@baylibre.com> Cc: Tony Lindgren <tony@atomide.com> Cc: Andy Gross <agross@kernel.org> Cc: Bjorn Andersson <bjorn.andersson@linaro.org> Cc: linux-arm-kernel@lists.infradead.org Cc: linux-omap@vger.kernel.org Cc: linux-arm-msm@vger.kernel.org Link: https://lore.kernel.org/r/20220407225107.2175958-1-robh@kernel.org' Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2022-04-14Merge tag 'ux500-defconfig-soc-v5.18' of ↵Arnd Bergmann
git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-nomadik into arm/fixes Defconfig updates for kernel v5.18: - Refresh defconfig with new and moved options - Add some new hardware drivers - Activate battery charging * tag 'ux500-defconfig-soc-v5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-nomadik: ARM: config: u8500: Re-enable AB8500 battery charging ARM: config: u8500: Add some common hardware ARM: config: Refresh U8500 defconfig Link: https://lore.kernel.org/r/CACRpkdY+_Go4XNzOh+Rvc24QBnUud2k-S7VQuaH5d-j71_dJog@mail.gmail.com Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2022-04-14Merge tag 's390-5.18-3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull s390 fixes from Heiko Carstens: - Convert current_stack_pointer to a register alias like it is assumed if ARCH_HAS_CURRENT_STACK_POINTER is selected. The existing implementation as a function breaks CONFIG_HARDENED_USERCOPY sanity-checks - Get rid of -Warray-bounds warning within kexec code - Add minimal IBM z16 support by reporting a proper elf platform, and adding compile options - Update defconfigs * tag 's390-5.18-3' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: s390: enable CONFIG_HARDENED_USERCOPY in debug_defconfig s390: current_stack_pointer shouldn't be a function s390: update defconfigs s390/kexec: silence -Warray-bounds warning s390: allow to compile with z16 optimizations s390: add z16 elf platform
2022-04-12Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm fixes from Paolo Bonzini: "x86: - Miscellaneous bugfixes - A small cleanup for the new workqueue code - Documentation syntax fix RISC-V: - Remove hgatp zeroing in kvm_arch_vcpu_put() - Fix alignment of the guest_hang() in KVM selftest - Fix PTE A and D bits in KVM selftest - Missing #include in vcpu_fp.c ARM: - Some PSCI fixes after introducing PSCIv1.1 and SYSTEM_RESET2 - Fix the MMU write-lock not being taken on THP split - Fix mixed-width VM handling - Fix potential UAF when debugfs registration fails - Various selftest updates for all of the above" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (24 commits) KVM: x86: hyper-v: Avoid writing to TSC page without an active vCPU KVM: SVM: Do not activate AVIC for SEV-enabled guest Documentation: KVM: Add SPDX-License-Identifier tag selftests: kvm: add tsc_scaling_sync to .gitignore RISC-V: KVM: include missing hwcap.h into vcpu_fp KVM: selftests: riscv: Fix alignment of the guest_hang() function KVM: selftests: riscv: Set PTE A and D bits in VS-stage page table RISC-V: KVM: Don't clear hgatp CSR in kvm_arch_vcpu_put() selftests: KVM: Free the GIC FD when cleaning up in arch_timer selftests: KVM: Don't leak GIC FD across dirty log test iterations KVM: Don't create VM debugfs files outside of the VM directory KVM: selftests: get-reg-list: Add KVM_REG_ARM_FW_REG(3) KVM: avoid NULL pointer dereference in kvm_dirty_ring_push KVM: arm64: selftests: Introduce vcpu_width_config KVM: arm64: mixed-width check should be skipped for uninitialized vCPUs KVM: arm64: vgic: Remove unnecessary type castings KVM: arm64: Don't split hugepages outside of MMU write lock KVM: arm64: Drop unneeded minor version check from PSCI v1.x handler KVM: arm64: Actually prevent SMC64 SYSTEM_RESET2 from AArch32 KVM: arm64: Generally disallow SMC64 for AArch32 guests ...
2022-04-12stat: fix inconsistency between struct stat and struct compat_statMikulas Patocka
struct stat (defined in arch/x86/include/uapi/asm/stat.h) has 32-bit st_dev and st_rdev; struct compat_stat (defined in arch/x86/include/asm/compat.h) has 16-bit st_dev and st_rdev followed by a 16-bit padding. This patch fixes struct compat_stat to match struct stat. [ Historical note: the old x86 'struct stat' did have that 16-bit field that the compat layer had kept around, but it was changes back in 2003 by "struct stat - support larger dev_t": https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/commit/?id=e95b2065677fe32512a597a79db94b77b90c968d and back in those days, the x86_64 port was still new, and separate from the i386 code, and had already picked up the old version with a 16-bit st_dev field ] Note that we can't change compat_dev_t because it is used by compat_loop_info. Also, if the st_dev and st_rdev values are 32-bit, we don't have to use old_valid_dev to test if the value fits into them. This fixes -EOVERFLOW on filesystems that are on NVMe because NVMe uses the major number 259. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: Andreas Schwab <schwab@linux-m68k.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-04-12s390: enable CONFIG_HARDENED_USERCOPY in debug_defconfigSven Schnelle
Signed-off-by: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com>