summaryrefslogtreecommitdiff
path: root/arch/s390/kvm/vsie.c
AgeCommit message (Collapse)Author
2025-01-31KVM: s390: move some gmap shadowing functions away from mm/gmap.cClaudio Imbrenda
Move some gmap shadowing functions from mm/gmap.c to kvm/kvm-s390.c and the newly created kvm/gmap-vsie.c This is a step toward removing gmap from mm. Reviewed-by: Janosch Frank <frankja@linux.ibm.com> Reviewed-by: Christoph Schlameuss <schlameuss@linux.ibm.com> Link: https://lore.kernel.org/r/20250123144627.312456-10-imbrenda@linux.ibm.com Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Message-ID: <20250123144627.312456-10-imbrenda@linux.ibm.com>
2025-01-31KVM: s390: vsie: stop using "struct page" for vsie pageDavid Hildenbrand
Now that we no longer use page->index and the page refcount explicitly, let's avoid messing with "struct page" completely. Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Reviewed-by: Christoph Schlameuss <schlameuss@linux.ibm.com> Tested-by: Christoph Schlameuss <schlameuss@linux.ibm.com> Message-ID: <20250107154344.1003072-5-david@redhat.com> Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
2025-01-31KVM: s390: vsie: stop messing with page refcountDavid Hildenbrand
Let's stop messing with the page refcount, and use a flag that is set / cleared atomically to remember whether a vsie page is currently in use. Note that we could use a page flag, or a lower bit of the scb_gpa. Let's keep it simple for now, we have sufficient space. While at it, stop passing "struct kvm *" to put_vsie_page(), it's unused. Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Reviewed-by: Christoph Schlameuss <schlameuss@linux.ibm.com> Tested-by: Christoph Schlameuss <schlameuss@linux.ibm.com> Message-ID: <20250107154344.1003072-4-david@redhat.com> Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
2025-01-31KVM: s390: vsie: stop using page->indexDavid Hildenbrand
Let's stop using page->index, and instead use a field inside "struct vsie_page" to hold that value. We have plenty of space left in there. This is one part of stopping using "struct page" when working with vsie pages. We place the "page_to_virt(page)" strategically, so the next cleanups requires less churn. Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Reviewed-by: Christoph Schlameuss <schlameuss@linux.ibm.com> Tested-by: Christoph Schlameuss <schlameuss@linux.ibm.com> Message-ID: <20250107154344.1003072-3-david@redhat.com> Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
2025-01-31KVM: s390: vsie: fix some corner-cases when grabbing vsie pagesDavid Hildenbrand
We try to reuse the same vsie page when re-executing the vsie with a given SCB address. The result is that we use the same shadow SCB -- residing in the vsie page -- and can avoid flushing the TLB when re-running the vsie on a CPU. So, when we allocate a fresh vsie page, or when we reuse a vsie page for a different SCB address -- reusing the shadow SCB in different context -- we set ihcpu=0xffff to trigger the flush. However, after we looked up the SCB address in the radix tree, but before we grabbed the vsie page by raising the refcount to 2, someone could reuse the vsie page for a different SCB address, adjusting page->index and the radix tree. In that case, we would be reusing the vsie page with a wrong page->index. Another corner case is that we might set the SCB address for a vsie page, but fail the insertion into the radix tree. Whoever would reuse that page would remove the corresponding radix tree entry -- which might now be a valid entry pointing at another page, resulting in the wrong vsie page getting removed from the radix tree. Let's handle such races better, by validating that the SCB address of a vsie page didn't change after we grabbed it (not reuse for a different SCB; the alternative would be performing another tree lookup), and by setting the SCB address to invalid until the insertion in the tree succeeded (SCB addresses are aligned to 512, so ULONG_MAX is invalid). These scenarios are rare, the effects a bit unclear, and these issues were only found by code inspection. Let's CC stable to be safe. Fixes: a3508fbe9dc6 ("KVM: s390: vsie: initial support for nested virtualization") Cc: stable@vger.kernel.org Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Reviewed-by: Christoph Schlameuss <schlameuss@linux.ibm.com> Tested-by: Christoph Schlameuss <schlameuss@linux.ibm.com> Message-ID: <20250107154344.1003072-2-david@redhat.com> Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
2025-01-07KVM: s390: vsie: fix virtual/physical address in unpin_scb()Claudio Imbrenda
In commit 77b533411595 ("KVM: s390: VSIE: sort out virtual/physical address in pin_guest_page"), only pin_scb() has been updated. This means that in unpin_scb() a virtual address was still used directly as physical address without conversion. The resulting physical address is obviously wrong and most of the time also invalid. Since commit d0ef8d9fbebe ("KVM: s390: Use kvm_release_page_dirty() to unpin "struct page" memory"), unpin_guest_page() will directly use kvm_release_page_dirty(), instead of kvm_release_pfn_dirty(), which has since been removed. One of the checks that were performed by kvm_release_pfn_dirty() was to verify whether the page was valid at all, and silently return successfully without doing anything if the page was invalid. When kvm_release_pfn_dirty() was still used, the invalid page was thus silently ignored. Now the check is gone and the result is an Oops. This also means that when running with a V!=R kernel, the page was not released, causing a leak. The solution is simply to add the missing virt_to_phys(). Fixes: 77b533411595 ("KVM: s390: VSIE: sort out virtual/physical address in pin_guest_page") Reviewed-by: Janosch Frank <frankja@linux.ibm.com> Reviewed-by: Nico Boehr <nrb@linux.ibm.com> Link: https://lore.kernel.org/r/20241210083948.23963-1-imbrenda@linux.ibm.com Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Message-ID: <20241210083948.23963-1-imbrenda@linux.ibm.com>
2024-11-23Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm updates from Paolo Bonzini: "The biggest change here is eliminating the awful idea that KVM had of essentially guessing which pfns are refcounted pages. The reason to do so was that KVM needs to map both non-refcounted pages (for example BARs of VFIO devices) and VM_PFNMAP/VM_MIXMEDMAP VMAs that contain refcounted pages. However, the result was security issues in the past, and more recently the inability to map VM_IO and VM_PFNMAP memory that _is_ backed by struct page but is not refcounted. In particular this broke virtio-gpu blob resources (which directly map host graphics buffers into the guest as "vram" for the virtio-gpu device) with the amdgpu driver, because amdgpu allocates non-compound higher order pages and the tail pages could not be mapped into KVM. This requires adjusting all uses of struct page in the per-architecture code, to always work on the pfn whenever possible. The large series that did this, from David Stevens and Sean Christopherson, also cleaned up substantially the set of functions that provided arch code with the pfn for a host virtual addresses. The previous maze of twisty little passages, all different, is replaced by five functions (__gfn_to_page, __kvm_faultin_pfn, the non-__ versions of these two, and kvm_prefetch_pages) saving almost 200 lines of code. ARM: - Support for stage-1 permission indirection (FEAT_S1PIE) and permission overlays (FEAT_S1POE), including nested virt + the emulated page table walker - Introduce PSCI SYSTEM_OFF2 support to KVM + client driver. This call was introduced in PSCIv1.3 as a mechanism to request hibernation, similar to the S4 state in ACPI - Explicitly trap + hide FEAT_MPAM (QoS controls) from KVM guests. As part of it, introduce trivial initialization of the host's MPAM context so KVM can use the corresponding traps - PMU support under nested virtualization, honoring the guest hypervisor's trap configuration and event filtering when running a nested guest - Fixes to vgic ITS serialization where stale device/interrupt table entries are not zeroed when the mapping is invalidated by the VM - Avoid emulated MMIO completion if userspace has requested synchronous external abort injection - Various fixes and cleanups affecting pKVM, vCPU initialization, and selftests LoongArch: - Add iocsr and mmio bus simulation in kernel. - Add in-kernel interrupt controller emulation. - Add support for virtualization extensions to the eiointc irqchip. PPC: - Drop lingering and utterly obsolete references to PPC970 KVM, which was removed 10 years ago. - Fix incorrect documentation references to non-existing ioctls RISC-V: - Accelerate KVM RISC-V when running as a guest - Perf support to collect KVM guest statistics from host side s390: - New selftests: more ucontrol selftests and CPU model sanity checks - Support for the gen17 CPU model - List registers supported by KVM_GET/SET_ONE_REG in the documentation x86: - Cleanup KVM's handling of Accessed and Dirty bits to dedup code, improve documentation, harden against unexpected changes. Even if the hardware A/D tracking is disabled, it is possible to use the hardware-defined A/D bits to track if a PFN is Accessed and/or Dirty, and that removes a lot of special cases. - Elide TLB flushes when aging secondary PTEs, as has been done in x86's primary MMU for over 10 years. - Recover huge pages in-place in the TDP MMU when dirty page logging is toggled off, instead of zapping them and waiting until the page is re-accessed to create a huge mapping. This reduces vCPU jitter. - Batch TLB flushes when dirty page logging is toggled off. This reduces the time it takes to disable dirty logging by ~3x. - Remove the shrinker that was (poorly) attempting to reclaim shadow page tables in low-memory situations. - Clean up and optimize KVM's handling of writes to MSR_IA32_APICBASE. - Advertise CPUIDs for new instructions in Clearwater Forest - Quirk KVM's misguided behavior of initialized certain feature MSRs to their maximum supported feature set, which can result in KVM creating invalid vCPU state. E.g. initializing PERF_CAPABILITIES to a non-zero value results in the vCPU having invalid state if userspace hides PDCM from the guest, which in turn can lead to save/restore failures. - Fix KVM's handling of non-canonical checks for vCPUs that support LA57 to better follow the "architecture", in quotes because the actual behavior is poorly documented. E.g. most MSR writes and descriptor table loads ignore CR4.LA57 and operate purely on whether the CPU supports LA57. - Bypass the register cache when querying CPL from kvm_sched_out(), as filling the cache from IRQ context is generally unsafe; harden the cache accessors to try to prevent similar issues from occuring in the future. The issue that triggered this change was already fixed in 6.12, but was still kinda latent. - Advertise AMD_IBPB_RET to userspace, and fix a related bug where KVM over-advertises SPEC_CTRL when trying to support cross-vendor VMs. - Minor cleanups - Switch hugepage recovery thread to use vhost_task. These kthreads can consume significant amounts of CPU time on behalf of a VM or in response to how the VM behaves (for example how it accesses its memory); therefore KVM tried to place the thread in the VM's cgroups and charge the CPU time consumed by that work to the VM's container. However the kthreads did not process SIGSTOP/SIGCONT, and therefore cgroups which had KVM instances inside could not complete freezing. Fix this by replacing the kthread with a PF_USER_WORKER thread, via the vhost_task abstraction. Another 100+ lines removed, with generally better behavior too like having these threads properly parented in the process tree. - Revert a workaround for an old CPU erratum (Nehalem/Westmere) that didn't really work; there was really nothing to work around anyway: the broken patch was meant to fix nested virtualization, but the PERF_GLOBAL_CTRL MSR is virtualized and therefore unaffected by the erratum. - Fix 6.12 regression where CONFIG_KVM will be built as a module even if asked to be builtin, as long as neither KVM_INTEL nor KVM_AMD is 'y'. x86 selftests: - x86 selftests can now use AVX. Documentation: - Use rST internal links - Reorganize the introduction to the API document Generic: - Protect vcpu->pid accesses outside of vcpu->mutex with a rwlock instead of RCU, so that running a vCPU on a different task doesn't encounter long due to having to wait for all CPUs become quiescent. In general both reads and writes are rare, but userspace that supports confidential computing is introducing the use of "helper" vCPUs that may jump from one host processor to another. Those will be very happy to trigger a synchronize_rcu(), and the effect on performance is quite the disaster" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (298 commits) KVM: x86: Break CONFIG_KVM_X86's direct dependency on KVM_INTEL || KVM_AMD KVM: x86: add back X86_LOCAL_APIC dependency Revert "KVM: VMX: Move LOAD_IA32_PERF_GLOBAL_CTRL errata handling out of setup_vmcs_config()" KVM: x86: switch hugepage recovery thread to vhost_task KVM: x86: expose MSR_PLATFORM_INFO as a feature MSR x86: KVM: Advertise CPUIDs for new instructions in Clearwater Forest Documentation: KVM: fix malformed table irqchip/loongson-eiointc: Add virt extension support LoongArch: KVM: Add irqfd support LoongArch: KVM: Add PCHPIC user mode read and write functions LoongArch: KVM: Add PCHPIC read and write functions LoongArch: KVM: Add PCHPIC device support LoongArch: KVM: Add EIOINTC user mode read and write functions LoongArch: KVM: Add EIOINTC read and write functions LoongArch: KVM: Add EIOINTC device support LoongArch: KVM: Add IPI user mode read and write function LoongArch: KVM: Add IPI read and write function LoongArch: KVM: Add IPI device support LoongArch: KVM: Add iocsr and mmio bus simulation in kernel KVM: arm64: Pass on SVE mapping failures ...
2024-11-12Merge tag 'kvm-s390-next-6.13-1' of ↵Paolo Bonzini
https://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD - second part of the ucontrol selftest - cpumodel sanity check selftest - gen17 cpumodel changes
2024-11-11KVM: s390: add msa11 to cpu modelHendrik Brueckner
Message-security-assist 11 introduces pckmo subfunctions to encrypt hmac keys. Signed-off-by: Hendrik Brueckner <brueckner@linux.ibm.com> Reviewed-by: Janosch Frank <frankja@linux.ibm.com> Reviewed-by: Christian Borntraeger <borntraeger@linux.ibm.com> Link: https://lore.kernel.org/r/20241107152319.77816-3-brueckner@linux.ibm.com Signed-off-by: Janosch Frank <frankja@linux.ibm.com> Message-ID: <20241107152319.77816-3-brueckner@linux.ibm.com>
2024-11-07s390/kvm: Mask extra bits from program interrupt codeClaudio Imbrenda
The program interrupt code has some extra bits that are sometimes set by hardware for various reasons; those bits should be ignored when the program interrupt number is needed for interrupt handling. Fixes: 05066cafa925 ("s390/mm/fault: Handle guest-related program interrupts in KVM") Reported-by: Christian Borntraeger <borntraeger@linux.ibm.com> Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Tested-by: Christian Borntraeger <borntraeger@linux.ibm.com> Link: https://lore.kernel.org/r/20241031120316.25462-1-imbrenda@linux.ibm.com Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2024-10-29s390/kvm: Stop using gmap_{en,dis}able()Claudio Imbrenda
Stop using gmap_enable(), gmap_disable(), gmap_get_enabled(). The correct guest ASCE is passed as a parameter of sie64a(), there is no need to save the current gmap in lowcore. Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Acked-by: Steffen Eiden <seiden@linux.ibm.com> Reviewed-by: Heiko Carstens <hca@linux.ibm.com> Link: https://lore.kernel.org/r/20241022120601.167009-7-imbrenda@linux.ibm.com Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2024-10-29s390/mm/fault: Handle guest-related program interrupts in KVMClaudio Imbrenda
Any program interrupt that happens in the host during the execution of a KVM guest will now short circuit the fault handler and return to KVM immediately. Guest fault handling (including pfault) will happen entirely inside KVM. When sie64a() returns zero, current->thread.gmap_int_code will contain the program interrupt number that caused the exit, or zero if the exit was not caused by a host program interrupt. KVM will now take care of handling all guest faults in vcpu_post_run(). Since gmap faults will not be visible by the rest of the kernel, remove GMAP_FAULT, the linux fault handlers for secure execution faults, the exception table entries for the sie instruction, the nop padding after the sie instruction, and all other references to guest faults from the s390 code. Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Co-developed-by: Heiko Carstens <hca@linux.ibm.com> Link: https://lore.kernel.org/r/20241022120601.167009-6-imbrenda@linux.ibm.com Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2024-10-25KVM: s390: Use kvm_release_page_dirty() to unpin "struct page" memorySean Christopherson
Use kvm_release_page_dirty() when unpinning guest pages, as the pfn was retrieved via pin_guest_page(), i.e. is guaranteed to be backed by struct page memory. This will allow dropping kvm_release_pfn_dirty() and friends. Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-81-seanjc@google.com>
2024-10-25KVM: Drop KVM_ERR_PTR_BAD_PAGE and instead return NULL to indicate an errorSean Christopherson
Remove KVM_ERR_PTR_BAD_PAGE and instead return NULL, as "bad page" is just a leftover bit of weirdness from days of old when KVM stuffed a "bad" page into the guest instead of actually handling missing pages. See commit cea7bb21280e ("KVM: MMU: Make gfn_to_page() always safe"). Reviewed-by: Alex Bennée <alex.bennee@linaro.org> Tested-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-2-seanjc@google.com>
2024-07-20Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm updates from Paolo Bonzini: "ARM: - Initial infrastructure for shadow stage-2 MMUs, as part of nested virtualization enablement - Support for userspace changes to the guest CTR_EL0 value, enabling (in part) migration of VMs between heterogenous hardware - Fixes + improvements to pKVM's FF-A proxy, adding support for v1.1 of the protocol - FPSIMD/SVE support for nested, including merged trap configuration and exception routing - New command-line parameter to control the WFx trap behavior under KVM - Introduce kCFI hardening in the EL2 hypervisor - Fixes + cleanups for handling presence/absence of FEAT_TCRX - Miscellaneous fixes + documentation updates LoongArch: - Add paravirt steal time support - Add support for KVM_DIRTY_LOG_INITIALLY_SET - Add perf kvm-stat support for loongarch RISC-V: - Redirect AMO load/store access fault traps to guest - perf kvm stat support - Use guest files for IMSIC virtualization, when available s390: - Assortment of tiny fixes which are not time critical x86: - Fixes for Xen emulation - Add a global struct to consolidate tracking of host values, e.g. EFER - Add KVM_CAP_X86_APIC_BUS_CYCLES_NS to allow configuring the effective APIC bus frequency, because TDX - Print the name of the APICv/AVIC inhibits in the relevant tracepoint - Clean up KVM's handling of vendor specific emulation to consistently act on "compatible with Intel/AMD", versus checking for a specific vendor - Drop MTRR virtualization, and instead always honor guest PAT on CPUs that support self-snoop - Update to the newfangled Intel CPU FMS infrastructure - Don't advertise IA32_PERF_GLOBAL_OVF_CTRL as an MSR-to-be-saved, as it reads '0' and writes from userspace are ignored - Misc cleanups x86 - MMU: - Small cleanups, renames and refactoring extracted from the upcoming Intel TDX support - Don't allocate kvm_mmu_page.shadowed_translation for shadow pages that can't hold leafs SPTEs - Unconditionally drop mmu_lock when allocating TDP MMU page tables for eager page splitting, to avoid stalling vCPUs when splitting huge pages - Bug the VM instead of simply warning if KVM tries to split a SPTE that is non-present or not-huge. KVM is guaranteed to end up in a broken state because the callers fully expect a valid SPTE, it's all but dangerous to let more MMU changes happen afterwards x86 - AMD: - Make per-CPU save_area allocations NUMA-aware - Force sev_es_host_save_area() to be inlined to avoid calling into an instrumentable function from noinstr code - Base support for running SEV-SNP guests. API-wise, this includes a new KVM_X86_SNP_VM type, encrypting/measure the initial image into guest memory, and finalizing it before launching it. Internally, there are some gmem/mmu hooks needed to prepare gmem-allocated pages before mapping them into guest private memory ranges This includes basic support for attestation guest requests, enough to say that KVM supports the GHCB 2.0 specification There is no support yet for loading into the firmware those signing keys to be used for attestation requests, and therefore no need yet for the host to provide certificate data for those keys. To support fetching certificate data from userspace, a new KVM exit type will be needed to handle fetching the certificate from userspace. An attempt to define a new KVM_EXIT_COCO / KVM_EXIT_COCO_REQ_CERTS exit type to handle this was introduced in v1 of this patchset, but is still being discussed by community, so for now this patchset only implements a stub version of SNP Extended Guest Requests that does not provide certificate data x86 - Intel: - Remove an unnecessary EPT TLB flush when enabling hardware - Fix a series of bugs that cause KVM to fail to detect nested pending posted interrupts as valid wake eents for a vCPU executing HLT in L2 (with HLT-exiting disable by L1) - KVM: x86: Suppress MMIO that is triggered during task switch emulation Explicitly suppress userspace emulated MMIO exits that are triggered when emulating a task switch as KVM doesn't support userspace MMIO during complex (multi-step) emulation Silently ignoring the exit request can result in the WARN_ON_ONCE(vcpu->mmio_needed) firing if KVM exits to userspace for some other reason prior to purging mmio_needed See commit 0dc902267cb3 ("KVM: x86: Suppress pending MMIO write exits if emulator detects exception") for more details on KVM's limitations with respect to emulated MMIO during complex emulator flows Generic: - Rename the AS_UNMOVABLE flag that was introduced for KVM to AS_INACCESSIBLE, because the special casing needed by these pages is not due to just unmovability (and in fact they are only unmovable because the CPU cannot access them) - New ioctl to populate the KVM page tables in advance, which is useful to mitigate KVM page faults during guest boot or after live migration. The code will also be used by TDX, but (probably) not through the ioctl - Enable halt poll shrinking by default, as Intel found it to be a clear win - Setup empty IRQ routing when creating a VM to avoid having to synchronize SRCU when creating a split IRQCHIP on x86 - Rework the sched_in/out() paths to replace kvm_arch_sched_in() with a flag that arch code can use for hooking both sched_in() and sched_out() - Take the vCPU @id as an "unsigned long" instead of "u32" to avoid truncating a bogus value from userspace, e.g. to help userspace detect bugs - Mark a vCPU as preempted if and only if it's scheduled out while in the KVM_RUN loop, e.g. to avoid marking it preempted and thus writing guest memory when retrieving guest state during live migration blackout Selftests: - Remove dead code in the memslot modification stress test - Treat "branch instructions retired" as supported on all AMD Family 17h+ CPUs - Print the guest pseudo-RNG seed only when it changes, to avoid spamming the log for tests that create lots of VMs - Make the PMU counters test less flaky when counting LLC cache misses by doing CLFLUSH{OPT} in every loop iteration" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (227 commits) crypto: ccp: Add the SNP_VLEK_LOAD command KVM: x86/pmu: Add kvm_pmu_call() to simplify static calls of kvm_pmu_ops KVM: x86: Introduce kvm_x86_call() to simplify static calls of kvm_x86_ops KVM: x86: Replace static_call_cond() with static_call() KVM: SEV: Provide support for SNP_EXTENDED_GUEST_REQUEST NAE event x86/sev: Move sev_guest.h into common SEV header KVM: SEV: Provide support for SNP_GUEST_REQUEST NAE event KVM: x86: Suppress MMIO that is triggered during task switch emulation KVM: x86/mmu: Clean up make_huge_page_split_spte() definition and intro KVM: x86/mmu: Bug the VM if KVM tries to split a !hugepage SPTE KVM: selftests: x86: Add test for KVM_PRE_FAULT_MEMORY KVM: x86: Implement kvm_arch_vcpu_pre_fault_memory() KVM: x86/mmu: Make kvm_mmu_do_page_fault() return mapped level KVM: x86/mmu: Account pf_{fixed,emulate,spurious} in callers of "do page fault" KVM: x86/mmu: Bump pf_taken stat only in the "real" page fault handler KVM: Add KVM_PRE_FAULT_MEMORY vcpu ioctl to pre-populate guest memory KVM: Document KVM_PRE_FAULT_MEMORY ioctl mm, virt: merge AS_UNMOVABLE and AS_INACCESSIBLE perf kvm: Add kvm-stat for loongarch64 LoongArch: KVM: Add PV steal time support in guest side ...
2024-07-10s390/entry: Pass the asce as parameter to sie64a()Claudio Imbrenda
Pass the guest ASCE explicitly as parameter, instead of having sie64a() take it from lowcore. This removes hidden state from lowcore, and makes things look cleaner. Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Reviewed-by: Nico Boehr <nrb@linux.ibm.com> Link: https://lore.kernel.org/r/20240703155900.103783-2-imbrenda@linux.ibm.com Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-07-04KVM: s390: vsie: retry SIE instruction on host interceptsEric Farman
It's possible that SIE exits for work that the host needs to perform rather than something that is intended for the guest. A Linux guest will ignore this intercept code since there is nothing for it to do, but a more robust solution would rewind the PSW back to the SIE instruction. This will transparently resume the guest once the host completes its work, without the guest needing to process what is effectively a NOP and re-issue SIE itself. Signed-off-by: Eric Farman <farman@linux.ibm.com> Acked-by: Christian Borntraeger <borntraeger@linux.ibm.com> Link: https://lore.kernel.org/r/20240301204342.3217540-1-farman@linux.ibm.com Signed-off-by: Janosch Frank <frankja@linux.ibm.com> Message-ID: <20240301204342.3217540-1-farman@linux.ibm.com>
2024-05-01KVM: s390: vsie: Use virt_to_phys for crypto control blockNina Schoetterl-Glausch
The address of the crypto control block in the (shadow) SIE block is absolute/physical. Convert from virtual to physical when shadowing the guest's control block during VSIE. Signed-off-by: Nina Schoetterl-Glausch <nsg@linux.ibm.com> Reviewed-by: Christian Borntraeger <borntraeger@linux.ibm.com> Acked-by: Alexander Gordeev <agordeev@linux.ibm.com> Link: https://lore.kernel.org/r/20240429171512.879215-1-nsg@linux.ibm.com Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-04-17KVM: s390: vsie: Use virt_to_phys for facility control blockNina Schoetterl-Glausch
In order for SIE to interpretively execute STFLE, it requires the real or absolute address of a facility-list control block. Before writing the location into the shadow SIE control block, convert it from a virtual address. We currently do not run into this bug because the lower 31 bits are the same for virtual and physical addresses. Signed-off-by: Nina Schoetterl-Glausch <nsg@linux.ibm.com> Link: https://lore.kernel.org/r/20240319164420.4053380-3-nsg@linux.ibm.com Signed-off-by: Janosch Frank <frankja@linux.ibm.com> Message-Id: <20240319164420.4053380-3-nsg@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-03-12Merge tag 's390-6.9-1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull s390 updates from Heiko Carstens: - Various virtual vs physical address usage fixes - Fix error handling in Processor Activity Instrumentation device driver, and export number of counters with a sysfs file - Allow for multiple events when Processor Activity Instrumentation counters are monitored in system wide sampling - Change multiplier and shift values of the Time-of-Day clock source to improve steering precision - Remove a couple of unneeded GFP_DMA flags from allocations - Disable mmap alignment if randomize_va_space is also disabled, to avoid a too small heap - Various changes to allow s390 to be compiled with LLVM=1, since ld.lld and llvm-objcopy will have proper s390 support witch clang 19 - Add __uninitialized macro to Compiler Attributes. This is helpful with s390's FPU code where some users have up to 520 byte stack frames. Clearing such stack frames (if INIT_STACK_ALL_PATTERN or INIT_STACK_ALL_ZERO is enabled) before they are used contradicts the intention (performance improvement) of such code sections. - Convert switch_to() to an out-of-line function, and use the generic switch_to header file - Replace the usage of s390's debug feature with pr_debug() calls within the zcrypt device driver - Improve hotplug support of the Adjunct Processor device driver - Improve retry handling in the zcrypt device driver - Various changes to the in-kernel FPU code: - Make in-kernel FPU sections preemptible - Convert various larger inline assemblies and assembler files to C, mainly by using singe instruction inline assemblies. This increases readability, but also allows makes it easier to add proper instrumentation hooks - Cleanup of the header files - Provide fast variants of csum_partial() and csum_partial_copy_nocheck() based on vector instructions - Introduce and use a lock to synchronize accesses to zpci device data structures to avoid inconsistent states caused by concurrent accesses - Compile the kernel without -fPIE. This addresses the following problems if the kernel is compiled with -fPIE: - It uses dynamic symbols (.dynsym), for which the linker refuses to allow more than 64k sections. This can break features which use '-ffunction-sections' and '-fdata-sections', including kpatch-build and function granular KASLR - It unnecessarily uses GOT relocations, adding an extra layer of indirection for many memory accesses - Fix shared_cpu_list for CPU private L2 caches, which incorrectly were reported as globally shared * tag 's390-6.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (117 commits) s390/tools: handle rela R_390_GOTPCDBL/R_390_GOTOFF64 s390/cache: prevent rebuild of shared_cpu_list s390/crypto: remove retry loop with sleep from PAES pkey invocation s390/pkey: improve pkey retry behavior s390/zcrypt: improve zcrypt retry behavior s390/zcrypt: introduce retries on in-kernel send CPRB functions s390/ap: introduce mutex to lock the AP bus scan s390/ap: rework ap_scan_bus() to return true on config change s390/ap: clarify AP scan bus related functions and variables s390/ap: rearm APQNs bindings complete completion s390/configs: increase number of LOCKDEP_BITS s390/vfio-ap: handle hardware checkstop state on queue reset operation s390/pai: change sampling event assignment for PMU device driver s390/boot: fix minor comment style damages s390/boot: do not check for zero-termination relocation entry s390/boot: make type of __vmlinux_relocs_64_start|end consistent s390/boot: sanitize kaslr_adjust_relocs() function prototype s390/boot: simplify GOT handling s390: vmlinux.lds.S: fix .got.plt assertion s390/boot: workaround current 'llvm-objdump -t -j ...' behavior ...
2024-02-16s390/kvm: convert to regular kernel fpu userHeiko Carstens
KVM modifies the kernel fpu's regs pointer to its own area to implement its custom version of preemtible kernel fpu context. With general support for preemptible kernel fpu context there is no need for the extra complexity in KVM code anymore. Therefore convert KVM to a regular kernel fpu user. In particular this means that all TIF_FPU checks can be removed, since the fpu register context will never be changed by other kernel fpu users, and also the fpu register context will be restored if a thread is preempted. Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2024-02-16s390/fpu: rename save_fpu_regs() to save_user_fpu_regs(), etcHeiko Carstens
Rename save_fpu_regs(), load_fpu_regs(), and struct thread_struct's fpu member to save_user_fpu_regs(), load_user_fpu_regs(), and ufpu. This way the function and variable names reflect for which context they are supposed to be used. This large and trivial conversion is a prerequisite for making the kernel fpu usage preemptible. Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2024-02-16s390/fpu: convert FPU CIF flag to regular TIF flagHeiko Carstens
The FPU state, as represented by the CIF_FPU flag reflects the FPU state of a task, not the CPU it is running on. Therefore convert the flag to a regular TIF flag. This removes the magic in switch_to() where a save_fpu_regs() call for the currently (previous) running task sets the per-cpu CIF_FPU flag, which is required to restore FPU register contents of the next task, when it returns to user space. Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2024-02-16s390/fpu: move, rename, and merge header filesHeiko Carstens
Move, rename, and merge the fpu and vx header files. This way fpu header files have a consistent naming scheme (fpu*.h). Also get rid of the fpu subdirectory and move header files to asm directory, so that all fpu and vx header files can be found at the same location. Merge internal.h header file into other header files, since the internal helpers are used at many locations. so those helper functions are really not internal. Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2024-01-26Merge tag 'kvm-s390-master-6.8-1' of ↵Paolo Bonzini
https://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD pqap instruction missing cc fix vsie shadow creation race fix
2024-01-02Merge tag 'kvm-s390-next-6.8-1' of ↵Paolo Bonzini
https://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD - uvdevice fixed additional data return length - stfle (feature indication) vsie fixes and minor cleanup
2023-12-23KVM: s390: vsie: Fix length of facility list shadowedNina Schoetterl-Glausch
The length of the facility list accessed when interpretively executing STFLE is the same as the hosts facility list (in case of format-0) The memory following the facility list doesn't need to be accessible. The current VSIE implementation accesses a fixed length that exceeds the guest/host facility list length and can therefore wrongly inject a validity intercept. Instead, find out the host facility list length by running STFLE and copy only as much as necessary when shadowing. Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Acked-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Nina Schoetterl-Glausch <nsg@linux.ibm.com> Reviewed-by: Janosch Frank <frankja@linux.ibm.com> Link: https://lore.kernel.org/r/20231219140854.1042599-3-nsg@linux.ibm.com Signed-off-by: Janosch Frank <frankja@linux.ibm.com> Message-ID: <20231219140854.1042599-3-nsg@linux.ibm.com>
2023-12-23KVM: s390: vsie: Fix STFLE interpretive execution identificationNina Schoetterl-Glausch
STFLE can be interpretively executed. This occurs when the facility list designation is unequal to zero. Perform the check before applying the address mask instead of after. Fixes: 66b630d5b7f2 ("KVM: s390: vsie: support STFLE interpretation") Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Acked-by: David Hildenbrand <david@redhat.com> Signed-off-by: Nina Schoetterl-Glausch <nsg@linux.ibm.com> Reviewed-by: Christian Borntraeger <borntraeger@linux.ibm.com> Reviewed-by: Janosch Frank <frankja@linux.ibm.com> Link: https://lore.kernel.org/r/20231219140854.1042599-2-nsg@linux.ibm.com Signed-off-by: Janosch Frank <frankja@linux.ibm.com> Message-ID: <20231219140854.1042599-2-nsg@linux.ibm.com>
2023-12-21KVM: s390: vsie: fix race during shadow creationChristian Borntraeger
Right now it is possible to see gmap->private being zero in kvm_s390_vsie_gmap_notifier resulting in a crash. This is due to the fact that we add gmap->private == kvm after creation: static int acquire_gmap_shadow(struct kvm_vcpu *vcpu, struct vsie_page *vsie_page) { [...] gmap = gmap_shadow(vcpu->arch.gmap, asce, edat); if (IS_ERR(gmap)) return PTR_ERR(gmap); gmap->private = vcpu->kvm; Let children inherit the private field of the parent. Reported-by: Marc Hartmayer <mhartmay@linux.ibm.com> Fixes: a3508fbe9dc6 ("KVM: s390: vsie: initial support for nested virtualization") Cc: <stable@vger.kernel.org> Cc: David Hildenbrand <david@redhat.com> Reviewed-by: Janosch Frank <frankja@linux.ibm.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@linux.ibm.com> Link: https://lore.kernel.org/r/20231220125317.4258-1-borntraeger@linux.ibm.com
2023-11-14KVM: s390: vsie: fix wrong VIR 37 when MSO is usedClaudio Imbrenda
When the host invalidates a guest page, it will also check if the page was used to map the prefix of any guest CPUs, in which case they are stopped and marked as needing a prefix refresh. Upon starting the affected CPUs again, their prefix pages are explicitly faulted in and revalidated if they had been invalidated. A bit in the PGSTEs indicates whether or not a page might contain a prefix. The bit is allowed to overindicate. Pages above 2G are skipped, because they cannot be prefixes, since KVM runs all guests with MSO = 0. The same applies for nested guests (VSIE). When the host invalidates a guest page that maps the prefix of the nested guest, it has to stop the affected nested guest CPUs and mark them as needing a prefix refresh. The same PGSTE bit used for the guest prefix is also used for the nested guest. Pages above 2G are skipped like for normal guests, which is the source of the bug. The nested guest runs is the guest primary address space. The guest could be running the nested guest using MSO != 0. If the MSO + prefix for the nested guest is above 2G, the check for nested prefix will skip it. This will cause the invalidation notifier to not stop the CPUs of the nested guest and not mark them as needing refresh. When the nested guest is run again, its prefix will not be refreshed, since it has not been marked for refresh. This will cause a fatal validity intercept with VIR code 37. Fix this by removing the check for 2G for nested guests. Now all invalidations of pages with the notify bit set will always scan the existing VSIE shadow state descriptors. This allows to catch invalidations of nested guest prefix mappings even when the prefix is above 2G in the guest virtual address space. Fixes: a3508fbe9dc6 ("KVM: s390: vsie: initial support for nested virtualization") Tested-by: Nico Boehr <nrb@linux.ibm.com> Reviewed-by: Nico Boehr <nrb@linux.ibm.com> Reviewed-by: David Hildenbrand <david@redhat.com> Message-ID: <20231102153549.53984-1-imbrenda@linux.ibm.com> Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
2023-10-16KVM: s390: add stat counter for shadow gmap eventsNico Boehr
The shadow gmap tracks memory of nested guests (guest-3). In certain scenarios, the shadow gmap needs to be rebuilt, which is a costly operation since it involves a SIE exit into guest-1 for every entry in the respective shadow level. Add kvm stat counters when new shadow structures are created at various levels. Also add a counter gmap_shadow_create when a completely fresh shadow gmap is created as well as a counter gmap_shadow_reuse when an existing gmap is being reused. Note that when several levels are shadowed at once, counters on all affected levels will be increased. Also note that not all page table levels need to be present and a ASCE can directly point to e.g. a segment table. In this case, a new segment table will always be equivalent to a new shadow gmap and hence will be counted as gmap_shadow_create and not as gmap_shadow_segment. Signed-off-by: Nico Boehr <nrb@linux.ibm.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Reviewed-by: Janosch Frank <frankja@linux.ibm.com> Signed-off-by: Janosch Frank <frankja@linux.ibm.com> Link: https://lore.kernel.org/r/20231009093304.2555344-2-nrb@linux.ibm.com Message-Id: <20231009093304.2555344-2-nrb@linux.ibm.com>
2023-07-06Merge tag 's390-6.5-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull more s390 updates from Alexander Gordeev: - Fix virtual vs physical address confusion in vmem_add_range() and vmem_remove_range() functions - Include <linux/io.h> instead of <asm/io.h> and <asm-generic/io.h> throughout s390 code - Make all PSW related defines also available for assembler files. Remove PSW_DEFAULT_KEY define from uapi for that - When adding an undefined symbol the build still succeeds, but userspace crashes trying to execute VDSO, because the symbol is not resolved. Add undefined symbols check to prevent that - Use kvmalloc_array() instead of kzalloc() for allocaton of 256k memory when executing s390 crypto adapter IOCTL - Add -fPIE flag to prevent decompressor misaligned symbol build error with clang - Use .balign instead of .align everywhere. This is a no-op for s390, but with this there no mix in using .align and .balign anymore - Filter out -mno-pic-data-is-text-relative flag when compiling kernel to prevent VDSO build error - Rework entering of DAT-on mode on CPU restart to use PSW_KERNEL_BITS mask directly - Do not retry administrative requests to some s390 crypto cards, since the firmware assumes replay attacks - Remove most of the debug code, which is build in when kernel config option CONFIG_ZCRYPT_DEBUG is enabled - Remove CONFIG_ZCRYPT_MULTIDEVNODES kernel config option and switch off the multiple devices support for the s390 zcrypt device driver - With the conversion to generic entry machine checks are accounted to the current context instead of irq time. As result, the STCKF instruction at the beginning of the machine check handler and the lowcore member are no longer required, therefore remove it - Fix various typos found with codespell - Minor cleanups to CPU-measurement Counter and Sampling Facilities code - Revert patch that removes VMEM_MAX_PHYS macro, since it causes a regression * tag 's390-6.5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (25 commits) Revert "s390/mm: get rid of VMEM_MAX_PHYS macro" s390/cpum_sf: remove check on CPU being online s390/cpum_sf: handle casts consistently s390/cpum_sf: remove unnecessary debug statement s390/cpum_sf: remove parameter in call to pr_err s390/cpum_sf: simplify function setup_pmu_cpu s390/cpum_cf: remove unneeded debug statements s390/entry: remove mcck clock s390: fix various typos s390/zcrypt: remove ZCRYPT_MULTIDEVNODES kernel config option s390/zcrypt: do not retry administrative requests s390/zcrypt: cleanup some debug code s390/entry: rework entering DAT-on mode on CPU restart s390/mm: fence off VM macros from asm and linker s390: include linux/io.h instead of asm/io.h s390/ptrace: make all psw related defines also available for asm s390/ptrace: remove PSW_DEFAULT_KEY from uapi s390/vdso: filter out mno-pic-data-is-text-relative cflag s390: consistently use .balign instead of .align s390/decompressor: fix misaligned symbol build error ...
2023-07-03s390: fix various typosHeiko Carstens
Fix various typos found with codespell. Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2023-06-16KVM: s390: vsie: fix the length of APCB bitmapPierre Morel
bit_and() uses the count of bits as the woking length. Fix the previous implementation and effectively use the right bitmap size. Fixes: 19fd83a64718 ("KVM: s390: vsie: allow CRYCB FORMAT-1") Fixes: 56019f9aca22 ("KVM: s390: vsie: Allow CRYCB FORMAT-2") Signed-off-by: Pierre Morel <pmorel@linux.ibm.com> Reviewed-by: Janosch Frank <frankja@linux.ibm.com> Link: https://lore.kernel.org/kvm/20230511094719.9691-1-pmorel@linux.ibm.com/ Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
2023-04-20KVM: s390: vsie: clarifications on setting the APCBPierre Morel
The APCB is part of the CRYCB. The calculation of the APCB origin can be done by adding the APCB offset to the CRYCB origin. Current code makes confusing transformations, converting the CRYCB origin to a pointer to calculate the APCB origin. Let's make things simpler and keep the CRYCB origin to make these calculations. Signed-off-by: Pierre Morel <pmorel@linux.ibm.com> Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Janosch Frank <frankja@linux.ibm.com> Signed-off-by: Janosch Frank <frankja@linux.ibm.com> Link: https://lore.kernel.org/r/20230214122841.13066-2-pmorel@linux.ibm.com Message-Id: <20230214122841.13066-2-pmorel@linux.ibm.com>
2022-12-15Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm updates from Paolo Bonzini: "ARM64: - Enable the per-vcpu dirty-ring tracking mechanism, together with an option to keep the good old dirty log around for pages that are dirtied by something other than a vcpu. - Switch to the relaxed parallel fault handling, using RCU to delay page table reclaim and giving better performance under load. - Relax the MTE ABI, allowing a VMM to use the MAP_SHARED mapping option, which multi-process VMMs such as crosvm rely on (see merge commit 382b5b87a97d: "Fix a number of issues with MTE, such as races on the tags being initialised vs the PG_mte_tagged flag as well as the lack of support for VM_SHARED when KVM is involved. Patches from Catalin Marinas and Peter Collingbourne"). - Merge the pKVM shadow vcpu state tracking that allows the hypervisor to have its own view of a vcpu, keeping that state private. - Add support for the PMUv3p5 architecture revision, bringing support for 64bit counters on systems that support it, and fix the no-quite-compliant CHAIN-ed counter support for the machines that actually exist out there. - Fix a handful of minor issues around 52bit VA/PA support (64kB pages only) as a prefix of the oncoming support for 4kB and 16kB pages. - Pick a small set of documentation and spelling fixes, because no good merge window would be complete without those. s390: - Second batch of the lazy destroy patches - First batch of KVM changes for kernel virtual != physical address support - Removal of a unused function x86: - Allow compiling out SMM support - Cleanup and documentation of SMM state save area format - Preserve interrupt shadow in SMM state save area - Respond to generic signals during slow page faults - Fixes and optimizations for the non-executable huge page errata fix. - Reprogram all performance counters on PMU filter change - Cleanups to Hyper-V emulation and tests - Process Hyper-V TLB flushes from a nested guest (i.e. from a L2 guest running on top of a L1 Hyper-V hypervisor) - Advertise several new Intel features - x86 Xen-for-KVM: - Allow the Xen runstate information to cross a page boundary - Allow XEN_RUNSTATE_UPDATE flag behaviour to be configured - Add support for 32-bit guests in SCHEDOP_poll - Notable x86 fixes and cleanups: - One-off fixes for various emulation flows (SGX, VMXON, NRIPS=0). - Reinstate IBPB on emulated VM-Exit that was incorrectly dropped a few years back when eliminating unnecessary barriers when switching between vmcs01 and vmcs02. - Clean up vmread_error_trampoline() to make it more obvious that params must be passed on the stack, even for x86-64. - Let userspace set all supported bits in MSR_IA32_FEAT_CTL irrespective of the current guest CPUID. - Fudge around a race with TSC refinement that results in KVM incorrectly thinking a guest needs TSC scaling when running on a CPU with a constant TSC, but no hardware-enumerated TSC frequency. - Advertise (on AMD) that the SMM_CTL MSR is not supported - Remove unnecessary exports Generic: - Support for responding to signals during page faults; introduces new FOLL_INTERRUPTIBLE flag that was reviewed by mm folks Selftests: - Fix an inverted check in the access tracking perf test, and restore support for asserting that there aren't too many idle pages when running on bare metal. - Fix build errors that occur in certain setups (unsure exactly what is unique about the problematic setup) due to glibc overriding static_assert() to a variant that requires a custom message. - Introduce actual atomics for clear/set_bit() in selftests - Add support for pinning vCPUs in dirty_log_perf_test. - Rename the so called "perf_util" framework to "memstress". - Add a lightweight psuedo RNG for guest use, and use it to randomize the access pattern and write vs. read percentage in the memstress tests. - Add a common ucall implementation; code dedup and pre-work for running SEV (and beyond) guests in selftests. - Provide a common constructor and arch hook, which will eventually be used by x86 to automatically select the right hypercall (AMD vs. Intel). - A bunch of added/enabled/fixed selftests for ARM64, covering memslots, breakpoints, stage-2 faults and access tracking. - x86-specific selftest changes: - Clean up x86's page table management. - Clean up and enhance the "smaller maxphyaddr" test, and add a related test to cover generic emulation failure. - Clean up the nEPT support checks. - Add X86_PROPERTY_* framework to retrieve multi-bit CPUID values. - Fix an ordering issue in the AMX test introduced by recent conversions to use kvm_cpu_has(), and harden the code to guard against similar bugs in the future. Anything that tiggers caching of KVM's supported CPUID, kvm_cpu_has() in this case, effectively hides opt-in XSAVE features if the caching occurs before the test opts in via prctl(). Documentation: - Remove deleted ioctls from documentation - Clean up the docs for the x86 MSR filter. - Various fixes" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (361 commits) KVM: x86: Add proper ReST tables for userspace MSR exits/flags KVM: selftests: Allocate ucall pool from MEM_REGION_DATA KVM: arm64: selftests: Align VA space allocator with TTBR0 KVM: arm64: Fix benign bug with incorrect use of VA_BITS KVM: arm64: PMU: Fix period computation for 64bit counters with 32bit overflow KVM: x86: Advertise that the SMM_CTL MSR is not supported KVM: x86: remove unnecessary exports KVM: selftests: Fix spelling mistake "probabalistic" -> "probabilistic" tools: KVM: selftests: Convert clear/set_bit() to actual atomics tools: Drop "atomic_" prefix from atomic test_and_set_bit() tools: Drop conflicting non-atomic test_and_{clear,set}_bit() helpers KVM: selftests: Use non-atomic clear/set bit helpers in KVM tests perf tools: Use dedicated non-atomic clear/set bit helpers tools: Take @bit as an "unsigned long" in {clear,set}_bit() helpers KVM: arm64: selftests: Enable single-step without a "full" ucall() KVM: x86: fix APICv/x2AVIC disabled when vm reboot by itself KVM: Remove stale comment about KVM_REQ_UNHALT KVM: Add missing arch for KVM_CREATE_DEVICE and KVM_{SET,GET}_DEVICE_ATTR KVM: Reference to kvm_userspace_memory_region in doc and comments KVM: Delete all references to removed KVM_SET_MEMORY_ALIAS ioctl ...
2022-11-24KVM: s390: vsie: Fix the initialization of the epoch extension (epdx) fieldThomas Huth
We recently experienced some weird huge time jumps in nested guests when rebooting them in certain cases. After adding some debug code to the epoch handling in vsie.c (thanks to David Hildenbrand for the idea!), it was obvious that the "epdx" field (the multi-epoch extension) did not get set to 0xff in case the "epoch" field was negative. Seems like the code misses to copy the value from the epdx field from the guest to the shadow control block. By doing so, the weird time jumps are gone in our scenarios. Link: https://bugzilla.redhat.com/show_bug.cgi?id=2140899 Fixes: 8fa1696ea781 ("KVM: s390: Multiple Epoch Facility support") Signed-off-by: Thomas Huth <thuth@redhat.com> Reviewed-by: Christian Borntraeger <borntraeger@linux.ibm.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Reviewed-by: Janosch Frank <frankja@linux.ibm.com> Cc: stable@vger.kernel.org # 4.19+ Link: https://lore.kernel.org/r/20221123090833.292938-1-thuth@redhat.com Message-Id: <20221123090833.292938-1-thuth@redhat.com> Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
2022-10-26KVM: s390: VSIE: sort out virtual/physical address in pin_guest_pageNico Boehr
pin_guest_page() used page_to_virt() to calculate the hpa of the pinned page. This currently works, because virtual and physical addresses are the same. Use page_to_phys() instead to resolve the virtual-real address confusion. One caller of pin_guest_page() actually expected the hpa to be a hva, so add the missing phys_to_virt() conversion here. Signed-off-by: Nico Boehr <nrb@linux.ibm.com> Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Christian Borntraeger <borntraeger@linux.ibm.com> Link: https://lore.kernel.org/r/20221025082039.117372-2-nrb@linux.ibm.com Message-Id: <20221025082039.117372-2-nrb@linux.ibm.com> Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
2022-07-20KVM: s390: guest support for topology functionPierre Morel
We report a topology change to the guest for any CPU hotplug. The reporting to the guest is done using the Multiprocessor Topology-Change-Report (MTCR) bit of the utility entry in the guest's SCA which will be cleared during the interpretation of PTF. On every vCPU creation we set the MCTR bit to let the guest know the next time it uses the PTF with command 2 instruction that the topology changed and that it should use the STSI(15.1.x) instruction to get the topology details. STSI(15.1.x) gives information on the CPU configuration topology. Let's accept the interception of STSI with the function code 15 and let the userland part of the hypervisor handle it when userland supports the CPU Topology facility. Signed-off-by: Pierre Morel <pmorel@linux.ibm.com> Reviewed-by: Nico Boehr <nrb@linux.ibm.com> Reviewed-by: Janis Schoetterl-Glausch <scgl@linux.ibm.com> Reviewed-by: Janosch Frank <frankja@linux.ibm.com> Link: https://lore.kernel.org/r/20220714101824.101601-2-pmorel@linux.ibm.com Message-Id: <20220714101824.101601-2-pmorel@linux.ibm.com> Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
2022-04-21KVM: Add helpers to wrap vcpu->srcu_idx and yell if it's abusedSean Christopherson
Add wrappers to acquire/release KVM's SRCU lock when stashing the index in vcpu->src_idx, along with rudimentary detection of illegal usage, e.g. re-acquiring SRCU and thus overwriting vcpu->src_idx. Because the SRCU index is (currently) either 0 or 1, illegal nesting bugs can go unnoticed for quite some time and only cause problems when the nested lock happens to get a different index. Wrap the WARNs in PROVE_RCU=y, and make them ONCE, otherwise KVM will likely yell so loudly that it will bring the kernel to its knees. Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Fabiano Rosas <farosas@linux.ibm.com> Message-Id: <20220415004343.2203171-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-27KVM: s390: Enable specification exception interpretationJanis Schoetterl-Glausch
When this feature is enabled the hardware is free to interpret specification exceptions generated by the guest, instead of causing program interruption interceptions. This benefits (test) programs that generate a lot of specification exceptions (roughly 4x increase in exceptions/sec). Interceptions will occur as before if ICTL_PINT is set, i.e. if guest debug is enabled. There is no indication if this feature is available or not and the hardware is free to interpret or not. So we can simply set this bit and if the hardware ignores it we fall back to intercept 8 handling. Signed-off-by: Janis Schoetterl-Glausch <scgl@linux.ibm.com> Link: https://lore.kernel.org/linux-s390/20210706114714.3936825-1-scgl@linux.ibm.com/ Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2021-03-24KVM: s390: VSIE: fix MVPG handling for prefixing and MSOClaudio Imbrenda
Prefixing needs to be applied to the guest real address to translate it into a guest absolute address. The value of MSO needs to be added to a guest-absolute address in order to obtain the host-virtual. Fixes: bdf7509bbefa ("s390/kvm: VSIE: correctly handle MVPG when in VSIE") Reported-by: Janosch Frank <frankja@linux.ibm.com> Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20210322140559.500716-3-imbrenda@linux.ibm.com [borntraeger@de.ibm.com simplify mso] Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2021-03-24KVM: s390: VSIE: correctly handle MVPG when in VSIEClaudio Imbrenda
Correctly handle the MVPG instruction when issued by a VSIE guest. Fixes: a3508fbe9dc6d ("KVM: s390: vsie: initial support for nested virtualization") Cc: stable@vger.kernel.org # f85f1baaa189: KVM: s390: split kvm_s390_logical_to_effective Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Acked-by: Janosch Frank <frankja@linux.ibm.com> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> Link: https://lore.kernel.org/r/20210302174443.514363-4-imbrenda@linux.ibm.com [borntraeger@de.ibm.com: apply fixup from Claudio] Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2021-03-24KVM: s390: extend kvm_s390_shadow_fault to return entry pointerClaudio Imbrenda
Extend kvm_s390_shadow_fault to return the pointer to the valid leaf DAT table entry, or to the invalid entry. Also return some flags in the lower bits of the address: PEI_DAT_PROT: indicates that DAT protection applies because of the protection bit in the segment (or, if EDAT, region) tables. PEI_NOT_PTE: indicates that the address of the DAT table entry returned does not refer to a PTE, but to a segment or region table. Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: stable@vger.kernel.org Reviewed-by: Janosch Frank <frankja@de.ibm.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com> Link: https://lore.kernel.org/r/20210302174443.514363-3-imbrenda@linux.ibm.com [borntraeger@de.ibm.com: fold in a fix from Claudio] Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2021-01-19s390: convert to generic entrySven Schnelle
This patch converts s390 to use the generic entry infrastructure from kernel/entry/*. There are a few special things on s390: - PIF_PER_TRAP is moved to TIF_PER_TRAP as the generic code doesn't know about our PIF flags in exit_to_user_mode_loop(). - The old code had several ways to restart syscalls: a) PIF_SYSCALL_RESTART, which was only set during execve to force a restart after upgrading a process (usually qemu-kvm) to pgste page table extensions. b) PIF_SYSCALL, which is set by do_signal() to indicate that the current syscall should be restarted. This is changed so that do_signal() now also uses PIF_SYSCALL_RESTART. Continuing to use PIF_SYSCALL doesn't work with the generic code, and changing it to PIF_SYSCALL_RESTART makes PIF_SYSCALL and PIF_SYSCALL_RESTART more unique. - On s390 calling sys_sigreturn or sys_rt_sigreturn is implemented by executing a svc instruction on the process stack which causes a fault. While handling that fault the fault code sets PIF_SYSCALL to hand over processing to the syscall code on exit to usermode. The patch introduces PIF_SYSCALL_RET_SET, which is set if ptrace sets a return value for a syscall. The s390x ptrace ABI uses r2 both for the syscall number and return value, so ptrace cannot set the syscall number + return value at the same time. The flag makes handling that a bit easier. do_syscall() will just skip executing the syscall if PIF_SYSCALL_RET_SET is set. CONFIG_DEBUG_ASCE was removd in favour of the generic CONFIG_DEBUG_ENTRY. CR1/7/13 will be checked both on kernel entry and exit to contain the correct asces. Signed-off-by: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2020-12-10KVM: s390: Add memcg accounting to KVM allocationsChristian Borntraeger
Almost all kvm allocations in the s390x KVM code can be attributed to the process that triggers the allocation (in other words, no global allocation for other guests). This will help the memcg controller to make the right decisions. Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Acked-by: Janosch Frank <frankja@linux.ibm.com> Acked-by: Cornelia Huck <cohuck@redhat.com>
2020-06-23s390/kvm: diagnose 0x318 sync and resetCollin Walling
DIAGNOSE 0x318 (diag318) sets information regarding the environment the VM is running in (Linux, z/VM, etc) and is observed via firmware/service events. This is a privileged s390x instruction that must be intercepted by SIE. Userspace handles the instruction as well as migration. Data is communicated via VCPU register synchronization. The Control Program Name Code (CPNC) is stored in the SIE block. The CPNC along with the Control Program Version Code (CPVC) are stored in the kvm_vcpu_arch struct. This data is reset on load normal and clear resets. Signed-off-by: Collin Walling <walling@linux.ibm.com> Reviewed-by: Janosch Frank <frankja@linux.ibm.com> Acked-by: Cornelia Huck <cohuck@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Link: https://lore.kernel.org/r/20200622154636.5499-3-walling@linux.ibm.com [borntraeger@de.ibm.com: fix sync_reg position] Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2020-06-08Merge tag 's390-5.8-1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull s390 updates from Vasily Gorbik: - Add support for multi-function devices in pci code. - Enable PF-VF linking for architectures using the pdev->no_vf_scan flag (currently just s390). - Add reipl from NVMe support. - Get rid of critical section cleanup in entry.S. - Refactor PNSO CHSC (perform network subchannel operation) in cio and qeth. - QDIO interrupts and error handling fixes and improvements, more refactoring changes. - Align ioremap() with generic code. - Accept requests without the prefetch bit set in vfio-ccw. - Enable path handling via two new regions in vfio-ccw. - Other small fixes and improvements all over the code. * tag 's390-5.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (52 commits) vfio-ccw: make vfio_ccw_regops variables declarations static vfio-ccw: Add trace for CRW event vfio-ccw: Wire up the CRW irq and CRW region vfio-ccw: Introduce a new CRW region vfio-ccw: Refactor IRQ handlers vfio-ccw: Introduce a new schib region vfio-ccw: Refactor the unregister of the async regions vfio-ccw: Register a chp_event callback for vfio-ccw vfio-ccw: Introduce new helper functions to free/destroy regions vfio-ccw: document possible errors vfio-ccw: Enable transparent CCW IPL from DASD s390/pci: Log new handle in clp_disable_fh() s390/cio, s390/qeth: cleanup PNSO CHSC s390/qdio: remove q->first_to_kick s390/qdio: fix up qdio_start_irq() kerneldoc s390: remove critical section cleanup from entry.S s390: add machine check SIGP s390/pci: ioremap() align with generic code s390/ap: introduce new ap function ap_get_qdev() Documentation/s390: Update / remove developerWorks web links ...
2020-05-28s390: remove critical section cleanup from entry.SSven Schnelle
The current code is rather complex and caused a lot of subtle and hard to debug bugs in the past. Simplify the code by calling the system_call handler with interrupts disabled, save machine state, and re-enable them later. This requires significant changes to the machine check handling code as well. When the machine check interrupt arrived while being in kernel mode the new code will signal pending machine checks with a SIGP external call. When userspace was interrupted, the handler will switch to the kernel stack and directly execute s390_handle_mcck(). Signed-off-by: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2020-04-20KVM: s390: vsie: Move conditional rescheduleDavid Hildenbrand
Let's move it to the outer loop, in case we ever run again into long loops, trying to map the prefix. While at it, convert it to cond_resched(). Signed-off-by: David Hildenbrand <david@redhat.com> Link: https://lore.kernel.org/r/20200403153050.20569-5-david@redhat.com Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>