summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2022-11-10KVM: selftests: memslot_perf_test: Use data->nslots in prepare_vm()Gavin Shan
In prepare_vm(), 'data->nslots' is assigned with 'max_mem_slots - 1' at the beginning, meaning they are interchangeable. Use 'data->nslots' isntead of 'max_mem_slots - 1'. With this, it becomes easier to move the logic of probing number of slots into upper layer in subsequent patches. No functional change intended. Signed-off-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221020071209.559062-2-gshan@redhat.com
2022-11-10KVM: arm64: Handle stage-2 faults in parallelOliver Upton
The stage-2 map walker has been made parallel-aware, and as such can be called while only holding the read side of the MMU lock. Rip out the conditional locking in user_mem_abort() and instead grab the read lock. Continue to take the write lock from other callsites to kvm_pgtable_stage2_map(). Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221107220033.1895655-1-oliver.upton@linux.dev
2022-11-10KVM: arm64: Make table->block changes parallel-awareOliver Upton
stage2_map_walker_try_leaf() and friends now handle stage-2 PTEs generically, and perform the correct flush when a table PTE is removed. Additionally, they've been made parallel-aware, using an atomic break to take ownership of the PTE. Stop clearing the PTE in the pre-order callback and instead let stage2_map_walker_try_leaf() deal with it. Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221107220006.1895572-1-oliver.upton@linux.dev
2022-11-10KVM: arm64: Make leaf->leaf PTE changes parallel-awareOliver Upton
Convert stage2_map_walker_try_leaf() to use the new break-before-make helpers, thereby making the handler parallel-aware. As before, avoid the break-before-make if recreating the existing mapping. Additionally, retry execution if another vCPU thread is modifying the same PTE. Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Reviewed-by: Ben Gardon <bgardon@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221107215934.1895478-1-oliver.upton@linux.dev
2022-11-10KVM: arm64: Make block->table PTE changes parallel-awareOliver Upton
In order to service stage-2 faults in parallel, stage-2 table walkers must take exclusive ownership of the PTE being worked on. An additional requirement of the architecture is that software must perform a 'break-before-make' operation when changing the block size used for mapping memory. Roll these two concepts together into helpers for performing a 'break-before-make' sequence. Use a special PTE value to indicate a PTE has been locked by a software walker. Additionally, use an atomic compare-exchange to 'break' the PTE when the stage-2 page tables are possibly shared with another software walker. Elide the DSB + TLBI if the evicted PTE was invalid (and thus not subject to break-before-make). All of the atomics do nothing for now, as the stage-2 walker isn't fully ready to perform parallel walks. Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221107215855.1895367-1-oliver.upton@linux.dev
2022-11-10KVM: arm64: Split init and set for table PTEOliver Upton
Create a helper to initialize a table and directly call smp_store_release() to install it (for now). Prepare for a subsequent change that generalizes PTE writes with a helper. Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221107215644.1895162-11-oliver.upton@linux.dev
2022-11-10KVM: arm64: Atomically update stage 2 leaf attributes in parallel walksOliver Upton
The stage2 attr walker is already used for parallel walks. Since commit f783ef1c0e82 ("KVM: arm64: Add fast path to handle permission relaxation during dirty logging"), KVM acquires the read lock when write-unprotecting a PTE. However, the walker only uses a simple store to update the PTE. This is safe as the only possible race is with hardware updates to the access flag, which is benign. However, a subsequent change to KVM will allow more changes to the stage 2 page tables to be done in parallel. Prepare the stage 2 attribute walker by performing atomic updates to the PTE when walking in parallel. Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221107215644.1895162-10-oliver.upton@linux.dev
2022-11-10KVM: arm64: Protect stage-2 traversal with RCUOliver Upton
Use RCU to safely walk the stage-2 page tables in parallel. Acquire and release the RCU read lock when traversing the page tables. Defer the freeing of table memory to an RCU callback. Indirect the calls into RCU and provide stubs for hypervisor code, as RCU is not available in such a context. The RCU protection doesn't amount to much at the moment, as readers are already protected by the read-write lock (all walkers that free table memory take the write lock). Nonetheless, a subsequent change will futher relax the locking requirements around the stage-2 MMU, thereby depending on RCU. Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221107215644.1895162-9-oliver.upton@linux.dev
2022-11-10KVM: arm64: Tear down unlinked stage-2 subtree after break-before-makeOliver Upton
The break-before-make sequence is a bit annoying as it opens a window wherein memory is unmapped from the guest. KVM should replace the PTE as quickly as possible and avoid unnecessary work in between. Presently, the stage-2 map walker tears down a removed table before installing a block mapping when coalescing a table into a block. As the removed table is no longer visible to hardware walkers after the DSB+TLBI, it is possible to move the remaining cleanup to happen after installing the new PTE. Reshuffle the stage-2 map walker to install the new block entry in the pre-order callback. Unwire all of the teardown logic and replace it with a call to kvm_pgtable_stage2_free_removed() after fixing the PTE. The post-order visitor is now completely unnecessary, so drop it. Finally, touch up the comments to better represent the now simplified map walker. Note that the call to tear down the unlinked stage-2 is indirected as a subsequent change will use an RCU callback to trigger tear down. RCU is not available to pKVM, so there is a need to use different implementations on pKVM and non-pKVM VMs. Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Reviewed-by: Ben Gardon <bgardon@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221107215644.1895162-8-oliver.upton@linux.dev
2022-11-10KVM: arm64: Use an opaque type for ptepsOliver Upton
Use an opaque type for pteps and require visitors explicitly dereference the pointer before using. Protecting page table memory with RCU requires that KVM dereferences RCU-annotated pointers before using. However, RCU is not available for use in the nVHE hypervisor and the opaque type can be conditionally annotated with RCU for the stage-2 MMU. Call the type a 'pteref' to avoid a naming collision with raw pteps. No functional change intended. Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221107215644.1895162-7-oliver.upton@linux.dev
2022-11-10KVM: arm64: Add a helper to tear down unlinked stage-2 subtreesOliver Upton
A subsequent change to KVM will move the tear down of an unlinked stage-2 subtree out of the critical path of the break-before-make sequence. Introduce a new helper for tearing down unlinked stage-2 subtrees. Leverage the existing stage-2 free walkers to do so, with a deep call into __kvm_pgtable_walk() as the subtree is no longer reachable from the root. Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221107215644.1895162-6-oliver.upton@linux.dev
2022-11-10KVM: arm64: Don't pass kvm_pgtable through kvm_pgtable_walk_dataOliver Upton
In order to tear down page tables from outside the context of kvm_pgtable (such as an RCU callback), stop passing a pointer through kvm_pgtable_walk_data. No functional change intended. Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Reviewed-by: Ben Gardon <bgardon@google.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221107215644.1895162-5-oliver.upton@linux.dev
2022-11-10KVM: arm64: Pass mm_ops through the visitor contextOliver Upton
As a prerequisite for getting visitors off of struct kvm_pgtable, pass mm_ops through the visitor context. No functional change intended. Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Reviewed-by: Ben Gardon <bgardon@google.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221107215644.1895162-4-oliver.upton@linux.dev
2022-11-10KVM: arm64: Stash observed pte value in visitor contextOliver Upton
Rather than reading the ptep all over the shop, read the ptep once from __kvm_pgtable_visit() and stick it in the visitor context. Reread the ptep after visiting a leaf in case the callback installed a new table underneath. No functional change intended. Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Reviewed-by: Ben Gardon <bgardon@google.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221107215644.1895162-3-oliver.upton@linux.dev
2022-11-10KVM: arm64: Combine visitor arguments into a context structureOliver Upton
Passing new arguments by value to the visitor callbacks is extremely inflexible for stuffing new parameters used by only some of the visitors. Use a context structure instead and pass the pointer through to the visitor callback. While at it, redefine the 'flags' parameter to the visitor to contain the bit indicating the phase of the walk. Pass the entire set of flags through the context structure such that the walker can communicate additional state to the visitor callback. No functional change intended. Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Reviewed-by: Ben Gardon <bgardon@google.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221107215644.1895162-2-oliver.upton@linux.dev
2022-11-10KVM: selftests: Automate choosing dirty ring size in dirty_log_testGavin Shan
In the dirty ring case, we rely on vcpu exit due to full dirty ring state. On ARM64 system, there are 4096 host pages when the host page size is 64KB. In this case, the vcpu never exits due to the full dirty ring state. The similar case is 4KB page size on host and 64KB page size on guest. The vcpu corrupts same set of host pages, but the dirty page information isn't collected in the main thread. This leads to infinite loop as the following log shows. # ./dirty_log_test -M dirty-ring -c 65536 -m 5 Setting log mode to: 'dirty-ring' Test iterations: 32, interval: 10 (ms) Testing guest mode: PA-bits:40, VA-bits:48, 4K pages guest physical test memory offset: 0xffbffe0000 vcpu stops because vcpu is kicked out... Notifying vcpu to continue vcpu continues now. Iteration 1 collected 576 pages <No more output afterwards> Fix the issue by automatically choosing the best dirty ring size, to ensure vcpu exit due to full dirty ring state. The option '-c' becomes a hint to the dirty ring count, instead of the value of it. Signed-off-by: Gavin Shan <gshan@redhat.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221110104914.31280-8-gshan@redhat.com
2022-11-10KVM: selftests: Clear dirty ring states between two modes in dirty_log_testGavin Shan
There are two states, which need to be cleared before next mode is executed. Otherwise, we will hit failure as the following messages indicate. - The variable 'dirty_ring_vcpu_ring_full' shared by main and vcpu thread. It's indicating if the vcpu exit due to full ring buffer. The value can be carried from previous mode (VM_MODE_P40V48_4K) to current one (VM_MODE_P40V48_64K) when VM_MODE_P40V48_16K isn't supported. - The current ring buffer index needs to be reset before next mode (VM_MODE_P40V48_64K) is executed. Otherwise, the stale value is carried from previous mode (VM_MODE_P40V48_4K). # ./dirty_log_test -M dirty-ring Setting log mode to: 'dirty-ring' Test iterations: 32, interval: 10 (ms) Testing guest mode: PA-bits:40, VA-bits:48, 4K pages guest physical test memory offset: 0xffbfffc000 : Dirtied 995328 pages Total bits checked: dirty (1012434), clear (7114123), track_next (966700) Testing guest mode: PA-bits:40, VA-bits:48, 64K pages guest physical test memory offset: 0xffbffc0000 vcpu stops because vcpu is kicked out... vcpu continues now. Notifying vcpu to continue Iteration 1 collected 0 pages vcpu stops because dirty ring is full... vcpu continues now. vcpu stops because dirty ring is full... vcpu continues now. vcpu stops because dirty ring is full... ==== Test Assertion Failure ==== dirty_log_test.c:369: cleared == count pid=10541 tid=10541 errno=22 - Invalid argument 1 0x0000000000403087: dirty_ring_collect_dirty_pages at dirty_log_test.c:369 2 0x0000000000402a0b: log_mode_collect_dirty_pages at dirty_log_test.c:492 3 (inlined by) run_test at dirty_log_test.c:795 4 (inlined by) run_test at dirty_log_test.c:705 5 0x0000000000403a37: for_each_guest_mode at guest_modes.c:100 6 0x0000000000401ccf: main at dirty_log_test.c:938 7 0x0000ffff9ecd279b: ?? ??:0 8 0x0000ffff9ecd286b: ?? ??:0 9 0x0000000000401def: _start at ??:? Reset dirty pages (0) mismatch with collected (35566) Fix the issues by clearing 'dirty_ring_vcpu_ring_full' and the ring buffer index before next new mode is to be executed. Signed-off-by: Gavin Shan <gshan@redhat.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221110104914.31280-7-gshan@redhat.com
2022-11-10KVM: selftests: Use host page size to map ring buffer in dirty_log_testGavin Shan
In vcpu_map_dirty_ring(), the guest's page size is used to figure out the offset in the virtual area. It works fine when we have same page sizes on host and guest. However, it fails when the page sizes on host and guest are different on arm64, like below error messages indicates. # ./dirty_log_test -M dirty-ring -m 7 Setting log mode to: 'dirty-ring' Test iterations: 32, interval: 10 (ms) Testing guest mode: PA-bits:40, VA-bits:48, 64K pages guest physical test memory offset: 0xffbffc0000 vcpu stops because vcpu is kicked out... Notifying vcpu to continue vcpu continues now. ==== Test Assertion Failure ==== lib/kvm_util.c:1477: addr == MAP_FAILED pid=9000 tid=9000 errno=0 - Success 1 0x0000000000405f5b: vcpu_map_dirty_ring at kvm_util.c:1477 2 0x0000000000402ebb: dirty_ring_collect_dirty_pages at dirty_log_test.c:349 3 0x00000000004029b3: log_mode_collect_dirty_pages at dirty_log_test.c:478 4 (inlined by) run_test at dirty_log_test.c:778 5 (inlined by) run_test at dirty_log_test.c:691 6 0x0000000000403a57: for_each_guest_mode at guest_modes.c:105 7 0x0000000000401ccf: main at dirty_log_test.c:921 8 0x0000ffffb06ec79b: ?? ??:0 9 0x0000ffffb06ec86b: ?? ??:0 10 0x0000000000401def: _start at ??:? Dirty ring mapped private Fix the issue by using host's page size to map the ring buffer. Signed-off-by: Gavin Shan <gshan@redhat.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221110104914.31280-6-gshan@redhat.com
2022-11-10KVM: arm64: Enable ring-based dirty memory trackingGavin Shan
Enable ring-based dirty memory tracking on ARM64: - Enable CONFIG_HAVE_KVM_DIRTY_RING_ACQ_REL. - Enable CONFIG_NEED_KVM_DIRTY_RING_WITH_BITMAP. - Set KVM_DIRTY_LOG_PAGE_OFFSET for the ring buffer's physical page offset. - Add ARM64 specific kvm_arch_allow_write_without_running_vcpu() to keep the site of saving vgic/its tables out of the no-running-vcpu radar. Signed-off-by: Gavin Shan <gshan@redhat.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221110104914.31280-5-gshan@redhat.com
2022-11-10KVM: Support dirty ring in conjunction with bitmapGavin Shan
ARM64 needs to dirty memory outside of a VCPU context when VGIC/ITS is enabled. It's conflicting with that ring-based dirty page tracking always requires a running VCPU context. Introduce a new flavor of dirty ring that requires the use of both VCPU dirty rings and a dirty bitmap. The expectation is that for non-VCPU sources of dirty memory (such as the VGIC/ITS on arm64), KVM writes to the dirty bitmap. Userspace should scan the dirty bitmap before migrating the VM to the target. Use an additional capability to advertise this behavior. The newly added capability (KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP) can't be enabled before KVM_CAP_DIRTY_LOG_RING_ACQ_REL on ARM64. In this way, the newly added capability is treated as an extension of KVM_CAP_DIRTY_LOG_RING_ACQ_REL. Suggested-by: Marc Zyngier <maz@kernel.org> Suggested-by: Peter Xu <peterx@redhat.com> Co-developed-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Gavin Shan <gshan@redhat.com> Acked-by: Peter Xu <peterx@redhat.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221110104914.31280-4-gshan@redhat.com
2022-11-10KVM: Move declaration of kvm_cpu_dirty_log_size() to kvm_dirty_ring.hGavin Shan
Not all architectures like ARM64 need to override the function. Move its declaration to kvm_dirty_ring.h to avoid the following compiling warning on ARM64 when the feature is enabled. arch/arm64/kvm/../../../virt/kvm/dirty_ring.c:14:12: \ warning: no previous prototype for 'kvm_cpu_dirty_log_size' \ [-Wmissing-prototypes] \ int __weak kvm_cpu_dirty_log_size(void) Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221110104914.31280-3-gshan@redhat.com
2022-11-10KVM: x86: Introduce KVM_REQ_DIRTY_RING_SOFT_FULLGavin Shan
The VCPU isn't expected to be runnable when the dirty ring becomes soft full, until the dirty pages are harvested and the dirty ring is reset from userspace. So there is a check in each guest's entrace to see if the dirty ring is soft full or not. The VCPU is stopped from running if its dirty ring has been soft full. The similar check will be needed when the feature is going to be supported on ARM64. As Marc Zyngier suggested, a new event will avoid pointless overhead to check the size of the dirty ring ('vcpu->kvm->dirty_ring_size') in each guest's entrance. Add KVM_REQ_DIRTY_RING_SOFT_FULL. The event is raised when the dirty ring becomes soft full in kvm_dirty_ring_push(). The event is only cleared in the check, done in the newly added helper kvm_dirty_ring_check_request(). Since the VCPU is not runnable when the dirty ring becomes soft full, the KVM_REQ_DIRTY_RING_SOFT_FULL event is always set to prevent the VCPU from running until the dirty pages are harvested and the dirty ring is reset by userspace. kvm_dirty_ring_soft_full() becomes a private function with the newly added helper kvm_dirty_ring_check_request(). The alignment for the various event definitions in kvm_host.h is changed to tab character by the way. In order to avoid using 'container_of()', the argument @ring is replaced by @vcpu in kvm_dirty_ring_push(). Link: https://lore.kernel.org/kvmarm/87lerkwtm5.wl-maz@kernel.org Suggested-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Peter Xu <peterx@redhat.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221110104914.31280-2-gshan@redhat.com
2022-11-10Merge tag 'kvmarm-fixes-6.1-3' into kvm-arm64/dirty-ringMarc Zyngier
KVM/arm64 fixes for 6.1, take #3 - Fix the pKVM stage-1 walker erronously using the stage-2 accessor - Correctly convert vcpu->kvm to a hyp pointer when generating an exception in a nVHE+MTE configuration - Check that KVM_CAP_DIRTY_LOG_* are valid before enabling them - Fix SMPRI_EL1/TPIDR2_EL0 trapping on VHE - Document the boot requirements for FGT when entering the kernel at EL1 Signed-off-by: Marc Zyngier <maz@kernel.org>
2022-11-09KVM: replace direct irq.h inclusionPaolo Bonzini
virt/kvm/irqchip.c is including "irq.h" from the arch-specific KVM source directory (i.e. not from arch/*/include) for the sole purpose of retrieving irqchip_in_kernel. Making the function inline in a header that is already included, such as asm/kvm_host.h, is not possible because it needs to look at struct kvm which is defined after asm/kvm_host.h is included. So add a kvm_arch_irqchip_in_kernel non-inline function; irqchip_in_kernel() is only performance critical on arm64 and x86, and the non-inline function is enough on all other architectures. irq.h can then be deleted from all architectures except x86. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-09KVM: x86/pmu: Defer counter emulated overflow via pmc->prev_counterLike Xu
Defer reprogramming counters and handling overflow via KVM_REQ_PMU when incrementing counters. KVM skips emulated WRMSR in the VM-Exit fastpath, the fastpath runs with IRQs disabled, skipping instructions can increment and reprogram counters, reprogramming counters can sleep, and sleeping is disallowed while IRQs are disabled. [*] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:580 [*] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 2981888, name: CPU 15/KVM [*] preempt_count: 1, expected: 0 [*] RCU nest depth: 0, expected: 0 [*] INFO: lockdep is turned off. [*] irq event stamp: 0 [*] hardirqs last enabled at (0): [<0000000000000000>] 0x0 [*] hardirqs last disabled at (0): [<ffffffff8121222a>] copy_process+0x146a/0x62d0 [*] softirqs last enabled at (0): [<ffffffff81212269>] copy_process+0x14a9/0x62d0 [*] softirqs last disabled at (0): [<0000000000000000>] 0x0 [*] Preemption disabled at: [*] [<ffffffffc2063fc1>] vcpu_enter_guest+0x1001/0x3dc0 [kvm] [*] CPU: 17 PID: 2981888 Comm: CPU 15/KVM Kdump: 5.19.0-rc1-g239111db364c-dirty #2 [*] Call Trace: [*] <TASK> [*] dump_stack_lvl+0x6c/0x9b [*] __might_resched.cold+0x22e/0x297 [*] __mutex_lock+0xc0/0x23b0 [*] perf_event_ctx_lock_nested+0x18f/0x340 [*] perf_event_pause+0x1a/0x110 [*] reprogram_counter+0x2af/0x1490 [kvm] [*] kvm_pmu_trigger_event+0x429/0x950 [kvm] [*] kvm_skip_emulated_instruction+0x48/0x90 [kvm] [*] handle_fastpath_set_msr_irqoff+0x349/0x3b0 [kvm] [*] vmx_vcpu_run+0x268e/0x3b80 [kvm_intel] [*] vcpu_enter_guest+0x1d22/0x3dc0 [kvm] Add a field to kvm_pmc to track the previous counter value in order to defer overflow detection to kvm_pmu_handle_event() (the counter must be paused before handling overflow, and that may increment the counter). Opportunistically shrink sizeof(struct kvm_pmc) a bit. Suggested-by: Wanpeng Li <wanpengli@tencent.com> Fixes: 9cd803d496e7 ("KVM: x86: Update vPMCs when retiring instructions") Signed-off-by: Like Xu <likexu@tencent.com> Link: https://lore.kernel.org/r/20220831085328.45489-6-likexu@tencent.com [sean: avoid re-triggering KVM_REQ_PMU on overflow, tweak changelog] Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220923001355.3741194-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-09KVM: x86/pmu: Defer reprogram_counter() to kvm_pmu_handle_event()Like Xu
Batch reprogramming PMU counters by setting KVM_REQ_PMU and thus deferring reprogramming kvm_pmu_handle_event() to avoid reprogramming a counter multiple times during a single VM-Exit. Deferring programming will also allow KVM to fix a bug where immediately reprogramming a counter can result in sleeping (taking a mutex) while interrupts are disabled in the VM-Exit fastpath. Introduce kvm_pmu_request_counter_reprogam() to make it obvious that KVM is _requesting_ a reprogram and not actually doing the reprogram. Opportunistically refine related comments to avoid misunderstandings. Signed-off-by: Like Xu <likexu@tencent.com> Link: https://lore.kernel.org/r/20220831085328.45489-5-likexu@tencent.com Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220923001355.3741194-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-09KVM: x86/pmu: Clear "reprogram" bit if counter is disabled or disallowedSean Christopherson
When reprogramming a counter, clear the counter's "reprogram pending" bit if the counter is disabled (by the guest) or is disallowed (by the userspace filter). In both cases, there's no need to re-attempt programming on the next coincident KVM_REQ_PMU as enabling the counter by either method will trigger reprogramming. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220923001355.3741194-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-09KVM: x86/pmu: Force reprogramming of all counters on PMU filter changeSean Christopherson
Force vCPUs to reprogram all counters on a PMU filter change to provide a sane ABI for userspace. Use the existing KVM_REQ_PMU to do the programming, and take advantage of the fact that the reprogram_pmi bitmap fits in a u64 to set all bits in a single atomic update. Note, setting the bitmap and making the request needs to be done _after_ the SRCU synchronization to ensure that vCPUs will reprogram using the new filter. KVM's current "lazy" approach is confusing and non-deterministic. It's confusing because, from a developer perspective, the code is buggy as it makes zero sense to let userspace modify the filter but then not actually enforce the new filter. The lazy approach is non-deterministic because KVM enforces the filter whenever a counter is reprogrammed, not just on guest WRMSRs, i.e. a guest might gain/lose access to an event at random times depending on what is going on in the host. Note, the resulting behavior is still non-determinstic while the filter is in flux. If userspace wants to guarantee deterministic behavior, all vCPUs should be paused during the filter update. Jim Mattson <jmattson@google.com> Fixes: 66bb8a065f5a ("KVM: x86: PMU Event Filter") Cc: Aaron Lewis <aaronlewis@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220923001355.3741194-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-09KVM: x86/mmu: WARN if TDP MMU SP disallows hugepage after being zappedSean Christopherson
Extend the accounting sanity check in kvm_recover_nx_huge_pages() to the TDP MMU, i.e. verify that zapping a shadow page unaccounts the disallowed NX huge page regardless of the MMU type. Recovery runs while holding mmu_lock for write and so it should be impossible to get false positives on the WARN. Suggested-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20221019165618.927057-9-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-09KVM: x86/mmu: explicitly check nx_hugepage in disallowed_hugepage_adjust()Mingwei Zhang
Explicitly check if a NX huge page is disallowed when determining if a page fault needs to be forced to use a smaller sized page. KVM currently assumes that the NX huge page mitigation is the only scenario where KVM will force a shadow page instead of a huge page, and so unnecessarily keeps an existing shadow page instead of replacing it with a huge page. Any scenario that causes KVM to zap leaf SPTEs may result in having a SP that can be made huge without violating the NX huge page mitigation. E.g. prior to commit 5ba7c4c6d1c7 ("KVM: x86/MMU: Zap non-leaf SPTEs when disabling dirty logging"), KVM would keep shadow pages after disabling dirty logging due to a live migration being canceled, resulting in degraded performance due to running with 4kb pages instead of huge pages. Although the dirty logging case is "fixed", that fix is coincidental, i.e. is an implementation detail, and there are other scenarios where KVM will zap leaf SPTEs. E.g. zapping leaf SPTEs in response to a host page migration (mmu_notifier invalidation) to create a huge page would yield a similar result; KVM would see the shadow-present non-leaf SPTE and assume a huge page is disallowed. Fixes: b8e8c8303ff2 ("kvm: mmu: ITLB_MULTIHIT mitigation") Reviewed-by: Ben Gardon <bgardon@google.com> Reviewed-by: David Matlack <dmatlack@google.com> Signed-off-by: Mingwei Zhang <mizhang@google.com> [sean: use spte_to_child_sp(), massage changelog, fold into if-statement] Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Message-Id: <20221019165618.927057-8-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-09KVM: x86/mmu: Add helper to convert SPTE value to its shadow pageSean Christopherson
Add a helper to convert a SPTE to its shadow page to deduplicate a variety of flows and hopefully avoid future bugs, e.g. if KVM attempts to get the shadow page for a SPTE without dropping high bits. Opportunistically add a comment in mmu_free_root_page() documenting why it treats the root HPA as a SPTE. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20221019165618.927057-7-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-09KVM: x86/mmu: Track the number of TDP MMU pages, but not the actual pagesSean Christopherson
Track the number of TDP MMU "shadow" pages instead of tracking the pages themselves. With the NX huge page list manipulation moved out of the common linking flow, elminating the list-based tracking means the happy path of adding a shadow page doesn't need to acquire a spinlock and can instead inc/dec an atomic. Keep the tracking as the WARN during TDP MMU teardown on leaked shadow pages is very, very useful for detecting KVM bugs. Tracking the number of pages will also make it trivial to expose the counter to userspace as a stat in the future, which may or may not be desirable. Note, the TDP MMU needs to use a separate counter (and stat if that ever comes to be) from the existing n_used_mmu_pages. The TDP MMU doesn't bother supporting the shrinker nor does it honor KVM_SET_NR_MMU_PAGES (because the TDP MMU consumes so few pages relative to shadow paging), and including TDP MMU pages in that counter would break both the shrinker and shadow MMUs, e.g. if a VM is using nested TDP. Cc: Yan Zhao <yan.y.zhao@intel.com> Reviewed-by: Mingwei Zhang <mizhang@google.com> Reviewed-by: David Matlack <dmatlack@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Message-Id: <20221019165618.927057-6-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-09KVM: x86/mmu: Set disallowed_nx_huge_page in TDP MMU before setting SPTESean Christopherson
Set nx_huge_page_disallowed in TDP MMU shadow pages before making the SP visible to other readers, i.e. before setting its SPTE. This will allow KVM to query the flag when determining if a shadow page can be replaced by a NX huge page without violating the rules of the mitigation. Note, the shadow/legacy MMU holds mmu_lock for write, so it's impossible for another CPU to see a shadow page without an up-to-date nx_huge_page_disallowed, i.e. only the TDP MMU needs the complicated dance. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: David Matlack <dmatlack@google.com> Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Message-Id: <20221019165618.927057-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-09KVM: x86/mmu: Properly account NX huge page workaround for nonpaging MMUsSean Christopherson
Account and track NX huge pages for nonpaging MMUs so that a future enhancement to precisely check if a shadow page can't be replaced by a NX huge page doesn't get false positives. Without correct tracking, KVM can get stuck in a loop if an instruction is fetching and writing data on the same huge page, e.g. KVM installs a small executable page on the fetch fault, replaces it with an NX huge page on the write fault, and faults again on the fetch. Alternatively, and perhaps ideally, KVM would simply not enforce the workaround for nonpaging MMUs. The guest has no page tables to abuse and KVM is guaranteed to switch to a different MMU on CR0.PG being toggled so there's no security or performance concerns. However, getting make_spte() to play nice now and in the future is unnecessarily complex. In the current code base, make_spte() can enforce the mitigation if TDP is enabled or the MMU is indirect, but make_spte() may not always have a vCPU/MMU to work with, e.g. if KVM were to support in-line huge page promotion when disabling dirty logging. Without a vCPU/MMU, KVM could either pass in the correct information and/or derive it from the shadow page, but the former is ugly and the latter subtly non-trivial due to the possibility of direct shadow pages in indirect MMUs. Given that using shadow paging with an unpaged guest is far from top priority _and_ has been subjected to the workaround since its inception, keep it simple and just fix the accounting glitch. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: David Matlack <dmatlack@google.com> Reviewed-by: Mingwei Zhang <mizhang@google.com> Message-Id: <20221019165618.927057-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-09KVM: x86/mmu: Rename NX huge pages fields/functions for consistencySean Christopherson
Rename most of the variables/functions involved in the NX huge page mitigation to provide consistency, e.g. lpage vs huge page, and NX huge vs huge NX, and also to provide clarity, e.g. to make it obvious the flag applies only to the NX huge page mitigation, not to any condition that prevents creating a huge page. Add a comment explaining what the newly named "possible_nx_huge_pages" tracks. Leave the nx_lpage_splits stat alone as the name is ABI and thus set in stone. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Mingwei Zhang <mizhang@google.com> Message-Id: <20221019165618.927057-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-09KVM: x86/mmu: Tag disallowed NX huge pages even if they're not trackedSean Christopherson
Tag shadow pages that cannot be replaced with an NX huge page regardless of whether or not zapping the page would allow KVM to immediately create a huge page, e.g. because something else prevents creating a huge page. I.e. track pages that are disallowed from being NX huge pages regardless of whether or not the page could have been huge at the time of fault. KVM currently tracks pages that were disallowed from being huge due to the NX workaround if and only if the page could otherwise be huge. But that fails to handled the scenario where whatever restriction prevented KVM from installing a huge page goes away, e.g. if dirty logging is disabled, the host mapping level changes, etc... Failure to tag shadow pages appropriately could theoretically lead to false negatives, e.g. if a fetch fault requests a small page and thus isn't tracked, and a read/write fault later requests a huge page, KVM will not reject the huge page as it should. To avoid yet another flag, initialize the list_head and use list_empty() to determine whether or not a page is on the list of NX huge pages that should be recovered. Note, the TDP MMU accounting is still flawed as fixing the TDP MMU is more involved due to mmu_lock being held for read. This will be addressed in a future commit. Fixes: 5bcaf3e1715f ("KVM: x86/mmu: Account NX huge page disallowed iff huge page was requested") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20221019165618.927057-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-09selftests: kvm/x86: Test the flags in MSR filtering and MSR exitingAaron Lewis
When using the flags in KVM_X86_SET_MSR_FILTER and KVM_CAP_X86_USER_SPACE_MSR it is expected that an attempt to write to any of the unused bits will fail. Add testing to walk over every bit in each of the flag fields in MSR filtering and MSR exiting to verify that unused bits return and error and used bits, i.e. valid bits, succeed. Signed-off-by: Aaron Lewis <aaronlewis@google.com> Message-Id: <20220921151525.904162-6-aaronlewis@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-09KVM: x86: Add a VALID_MASK for the flags in kvm_msr_filter_rangeAaron Lewis
Add the mask KVM_MSR_FILTER_RANGE_VALID_MASK for the flags in the struct kvm_msr_filter_range. This simplifies checks that validate these flags, and makes it easier to introduce new flags in the future. No functional change intended. Signed-off-by: Aaron Lewis <aaronlewis@google.com> Message-Id: <20220921151525.904162-5-aaronlewis@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-09KVM: x86: Add a VALID_MASK for the flag in kvm_msr_filterAaron Lewis
Add the mask KVM_MSR_FILTER_VALID_MASK for the flag in the struct kvm_msr_filter. This makes it easier to introduce new flags in the future. No functional change intended. Signed-off-by: Aaron Lewis <aaronlewis@google.com> Message-Id: <20220921151525.904162-4-aaronlewis@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-09KVM: x86: Add a VALID_MASK for the MSR exit reason flagsAaron Lewis
Add the mask KVM_MSR_EXIT_REASON_VALID_MASK for the MSR exit reason flags. This simplifies checks that validate these flags, and makes it easier to introduce new flags in the future. No functional change intended. Signed-off-by: Aaron Lewis <aaronlewis@google.com> Message-Id: <20220921151525.904162-3-aaronlewis@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-09KVM: x86: Disallow the use of KVM_MSR_FILTER_DEFAULT_ALLOW in the kernelAaron Lewis
Protect the kernel from using the flag KVM_MSR_FILTER_DEFAULT_ALLOW. Its value is 0, and using it incorrectly could have unintended consequences. E.g. prevent someone in the kernel from writing something like this. if (filter.flags & KVM_MSR_FILTER_DEFAULT_ALLOW) <allow the MSR> and getting confused when it doesn't work. It would be more ideal to remove this flag altogether, but userspace may already be using it, so protecting the kernel is all that can reasonably be done at this point. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Aaron Lewis <aaronlewis@google.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220921151525.904162-2-aaronlewis@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-09kvm: x86: Allow to respond to generic signals during slow PFPeter Xu
Enable x86 slow page faults to be able to respond to non-fatal signals, returning -EINTR properly when it happens. Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <20221011195947.557281-1-peterx@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-09kvm: Add interruptible flag to __gfn_to_pfn_memslot()Peter Xu
Add a new "interruptible" flag showing that the caller is willing to be interrupted by signals during the __gfn_to_pfn_memslot() request. Wire it up with a FOLL_INTERRUPTIBLE flag that we've just introduced. This prepares KVM to be able to respond to SIGUSR1 (for QEMU that's the SIGIPI) even during e.g. handling an userfaultfd page fault. No functional change intended. Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <20221011195809.557016-4-peterx@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-09kvm: Add KVM_PFN_ERR_SIGPENDINGPeter Xu
Add a new pfn error to show that we've got a pending signal to handle during hva_to_pfn_slow() procedure (of -EINTR retval). Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <20221011195809.557016-3-peterx@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-09mm/gup: Add FOLL_INTERRUPTIBLEPeter Xu
We have had FAULT_FLAG_INTERRUPTIBLE but it was never applied to GUPs. One issue with it is that not all GUP paths are able to handle signal delivers besides SIGKILL. That's not ideal for the GUP users who are actually able to handle these cases, like KVM. KVM uses GUP extensively on faulting guest pages, during which we've got existing infrastructures to retry a page fault at a later time. Allowing the GUP to be interrupted by generic signals can make KVM related threads to be more responsive. For examples: (1) SIGUSR1: which QEMU/KVM uses to deliver an inter-process IPI, e.g. when the admin issues a vm_stop QMP command, SIGUSR1 can be generated to kick the vcpus out of kernel context immediately, (2) SIGINT: which can be used with interactive hypervisor users to stop a virtual machine with Ctrl-C without any delays/hangs, (3) SIGTRAP: which grants GDB capability even during page faults that are stuck for a long time. Normally hypervisor will be able to receive these signals properly, but not if we're stuck in a GUP for a long time for whatever reason. It happens easily with a stucked postcopy migration when e.g. a network temp failure happens, then some vcpu threads can hang death waiting for the pages. With the new FOLL_INTERRUPTIBLE, we can allow GUP users like KVM to selectively enable the ability to trap these signals. Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Peter Xu <peterx@redhat.com> Message-Id: <20221011195809.557016-2-peterx@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-09KVM: x86: smm: preserve interrupt shadow in SMRAMMaxim Levitsky
When #SMI is asserted, the CPU can be in interrupt shadow due to sti or mov ss. It is not mandatory in Intel/AMD prm to have the #SMI blocked during the shadow, and on top of that, since neither SVM nor VMX has true support for SMI window, waiting for one instruction would mean single stepping the guest. Instead, allow #SMI in this case, but both reset the interrupt window and stash its value in SMRAM to restore it on exit from SMM. This fixes rare failures seen mostly on windows guests on VMX, when #SMI falls on the sti instruction which mainfest in VM entry failure due to EFLAGS.IF not being set, but STI interrupt window still being set in the VMCS. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20221025124741.228045-24-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-09KVM: x86: SVM: don't save SVM state to SMRAM when VM is not long mode capableMaxim Levitsky
When the guest CPUID doesn't have support for long mode, 32 bit SMRAM layout is used and it has no support for preserving EFER and/or SVM state. Note that this isn't relevant to running 32 bit guests on VM which is long mode capable - such VM can still run 32 bit guests in compatibility mode. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20221025124741.228045-23-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-09KVM: x86: SVM: use smram structsMaxim Levitsky
Use SMM structs in the SVM code as well, which removes the last user of put_smstate/GET_SMSTATE so remove these macros as well. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20221025124741.228045-22-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-09KVM: svm: drop explicit return value of kvm_vcpu_mapMaxim Levitsky
if kvm_vcpu_map returns non zero value, error path should be triggered regardless of the exact returned error value. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20221025124741.228045-21-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-09KVM: x86: smm: use smram struct for 64 bit smram load/restoreMaxim Levitsky
Use kvm_smram_state_64 struct to save/restore the 64 bit SMM state (used when X86_FEATURE_LM is present in the guest CPUID, regardless of 32-bitness of the guest). Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20221025124741.228045-20-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>