Age | Commit message (Collapse) | Author |
|
Move some gmap shadowing functions from mm/gmap.c to kvm/kvm-s390.c and
the newly created kvm/gmap-vsie.c
This is a step toward removing gmap from mm.
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Reviewed-by: Christoph Schlameuss <schlameuss@linux.ibm.com>
Link: https://lore.kernel.org/r/20250123144627.312456-10-imbrenda@linux.ibm.com
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Message-ID: <20250123144627.312456-10-imbrenda@linux.ibm.com>
|
|
Now that we no longer use page->index and the page refcount explicitly,
let's avoid messing with "struct page" completely.
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Reviewed-by: Christoph Schlameuss <schlameuss@linux.ibm.com>
Tested-by: Christoph Schlameuss <schlameuss@linux.ibm.com>
Message-ID: <20250107154344.1003072-5-david@redhat.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
|
|
Let's stop messing with the page refcount, and use a flag that is set /
cleared atomically to remember whether a vsie page is currently in use.
Note that we could use a page flag, or a lower bit of the scb_gpa. Let's
keep it simple for now, we have sufficient space.
While at it, stop passing "struct kvm *" to put_vsie_page(), it's
unused.
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Reviewed-by: Christoph Schlameuss <schlameuss@linux.ibm.com>
Tested-by: Christoph Schlameuss <schlameuss@linux.ibm.com>
Message-ID: <20250107154344.1003072-4-david@redhat.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
|
|
Let's stop using page->index, and instead use a field inside "struct
vsie_page" to hold that value. We have plenty of space left in there.
This is one part of stopping using "struct page" when working with vsie
pages. We place the "page_to_virt(page)" strategically, so the next
cleanups requires less churn.
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Reviewed-by: Christoph Schlameuss <schlameuss@linux.ibm.com>
Tested-by: Christoph Schlameuss <schlameuss@linux.ibm.com>
Message-ID: <20250107154344.1003072-3-david@redhat.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
|
|
We try to reuse the same vsie page when re-executing the vsie with a
given SCB address. The result is that we use the same shadow SCB --
residing in the vsie page -- and can avoid flushing the TLB when
re-running the vsie on a CPU.
So, when we allocate a fresh vsie page, or when we reuse a vsie page for
a different SCB address -- reusing the shadow SCB in different context --
we set ihcpu=0xffff to trigger the flush.
However, after we looked up the SCB address in the radix tree, but before
we grabbed the vsie page by raising the refcount to 2, someone could reuse
the vsie page for a different SCB address, adjusting page->index and the
radix tree. In that case, we would be reusing the vsie page with a
wrong page->index.
Another corner case is that we might set the SCB address for a vsie
page, but fail the insertion into the radix tree. Whoever would reuse
that page would remove the corresponding radix tree entry -- which might
now be a valid entry pointing at another page, resulting in the wrong
vsie page getting removed from the radix tree.
Let's handle such races better, by validating that the SCB address of a
vsie page didn't change after we grabbed it (not reuse for a different
SCB; the alternative would be performing another tree lookup), and by
setting the SCB address to invalid until the insertion in the tree
succeeded (SCB addresses are aligned to 512, so ULONG_MAX is invalid).
These scenarios are rare, the effects a bit unclear, and these issues were
only found by code inspection. Let's CC stable to be safe.
Fixes: a3508fbe9dc6 ("KVM: s390: vsie: initial support for nested virtualization")
Cc: stable@vger.kernel.org
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Reviewed-by: Christoph Schlameuss <schlameuss@linux.ibm.com>
Tested-by: Christoph Schlameuss <schlameuss@linux.ibm.com>
Message-ID: <20250107154344.1003072-2-david@redhat.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
|
|
In commit 77b533411595 ("KVM: s390: VSIE: sort out virtual/physical
address in pin_guest_page"), only pin_scb() has been updated. This
means that in unpin_scb() a virtual address was still used directly as
physical address without conversion. The resulting physical address is
obviously wrong and most of the time also invalid.
Since commit d0ef8d9fbebe ("KVM: s390: Use kvm_release_page_dirty() to
unpin "struct page" memory"), unpin_guest_page() will directly use
kvm_release_page_dirty(), instead of kvm_release_pfn_dirty(), which has
since been removed.
One of the checks that were performed by kvm_release_pfn_dirty() was to
verify whether the page was valid at all, and silently return
successfully without doing anything if the page was invalid.
When kvm_release_pfn_dirty() was still used, the invalid page was thus
silently ignored. Now the check is gone and the result is an Oops.
This also means that when running with a V!=R kernel, the page was not
released, causing a leak.
The solution is simply to add the missing virt_to_phys().
Fixes: 77b533411595 ("KVM: s390: VSIE: sort out virtual/physical address in pin_guest_page")
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Reviewed-by: Nico Boehr <nrb@linux.ibm.com>
Link: https://lore.kernel.org/r/20241210083948.23963-1-imbrenda@linux.ibm.com
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Message-ID: <20241210083948.23963-1-imbrenda@linux.ibm.com>
|
|
Pull kvm updates from Paolo Bonzini:
"The biggest change here is eliminating the awful idea that KVM had of
essentially guessing which pfns are refcounted pages.
The reason to do so was that KVM needs to map both non-refcounted
pages (for example BARs of VFIO devices) and VM_PFNMAP/VM_MIXMEDMAP
VMAs that contain refcounted pages.
However, the result was security issues in the past, and more recently
the inability to map VM_IO and VM_PFNMAP memory that _is_ backed by
struct page but is not refcounted. In particular this broke virtio-gpu
blob resources (which directly map host graphics buffers into the
guest as "vram" for the virtio-gpu device) with the amdgpu driver,
because amdgpu allocates non-compound higher order pages and the tail
pages could not be mapped into KVM.
This requires adjusting all uses of struct page in the
per-architecture code, to always work on the pfn whenever possible.
The large series that did this, from David Stevens and Sean
Christopherson, also cleaned up substantially the set of functions
that provided arch code with the pfn for a host virtual addresses.
The previous maze of twisty little passages, all different, is
replaced by five functions (__gfn_to_page, __kvm_faultin_pfn, the
non-__ versions of these two, and kvm_prefetch_pages) saving almost
200 lines of code.
ARM:
- Support for stage-1 permission indirection (FEAT_S1PIE) and
permission overlays (FEAT_S1POE), including nested virt + the
emulated page table walker
- Introduce PSCI SYSTEM_OFF2 support to KVM + client driver. This
call was introduced in PSCIv1.3 as a mechanism to request
hibernation, similar to the S4 state in ACPI
- Explicitly trap + hide FEAT_MPAM (QoS controls) from KVM guests. As
part of it, introduce trivial initialization of the host's MPAM
context so KVM can use the corresponding traps
- PMU support under nested virtualization, honoring the guest
hypervisor's trap configuration and event filtering when running a
nested guest
- Fixes to vgic ITS serialization where stale device/interrupt table
entries are not zeroed when the mapping is invalidated by the VM
- Avoid emulated MMIO completion if userspace has requested
synchronous external abort injection
- Various fixes and cleanups affecting pKVM, vCPU initialization, and
selftests
LoongArch:
- Add iocsr and mmio bus simulation in kernel.
- Add in-kernel interrupt controller emulation.
- Add support for virtualization extensions to the eiointc irqchip.
PPC:
- Drop lingering and utterly obsolete references to PPC970 KVM, which
was removed 10 years ago.
- Fix incorrect documentation references to non-existing ioctls
RISC-V:
- Accelerate KVM RISC-V when running as a guest
- Perf support to collect KVM guest statistics from host side
s390:
- New selftests: more ucontrol selftests and CPU model sanity checks
- Support for the gen17 CPU model
- List registers supported by KVM_GET/SET_ONE_REG in the
documentation
x86:
- Cleanup KVM's handling of Accessed and Dirty bits to dedup code,
improve documentation, harden against unexpected changes.
Even if the hardware A/D tracking is disabled, it is possible to
use the hardware-defined A/D bits to track if a PFN is Accessed
and/or Dirty, and that removes a lot of special cases.
- Elide TLB flushes when aging secondary PTEs, as has been done in
x86's primary MMU for over 10 years.
- Recover huge pages in-place in the TDP MMU when dirty page logging
is toggled off, instead of zapping them and waiting until the page
is re-accessed to create a huge mapping. This reduces vCPU jitter.
- Batch TLB flushes when dirty page logging is toggled off. This
reduces the time it takes to disable dirty logging by ~3x.
- Remove the shrinker that was (poorly) attempting to reclaim shadow
page tables in low-memory situations.
- Clean up and optimize KVM's handling of writes to
MSR_IA32_APICBASE.
- Advertise CPUIDs for new instructions in Clearwater Forest
- Quirk KVM's misguided behavior of initialized certain feature MSRs
to their maximum supported feature set, which can result in KVM
creating invalid vCPU state. E.g. initializing PERF_CAPABILITIES to
a non-zero value results in the vCPU having invalid state if
userspace hides PDCM from the guest, which in turn can lead to
save/restore failures.
- Fix KVM's handling of non-canonical checks for vCPUs that support
LA57 to better follow the "architecture", in quotes because the
actual behavior is poorly documented. E.g. most MSR writes and
descriptor table loads ignore CR4.LA57 and operate purely on
whether the CPU supports LA57.
- Bypass the register cache when querying CPL from kvm_sched_out(),
as filling the cache from IRQ context is generally unsafe; harden
the cache accessors to try to prevent similar issues from occuring
in the future. The issue that triggered this change was already
fixed in 6.12, but was still kinda latent.
- Advertise AMD_IBPB_RET to userspace, and fix a related bug where
KVM over-advertises SPEC_CTRL when trying to support cross-vendor
VMs.
- Minor cleanups
- Switch hugepage recovery thread to use vhost_task.
These kthreads can consume significant amounts of CPU time on
behalf of a VM or in response to how the VM behaves (for example
how it accesses its memory); therefore KVM tried to place the
thread in the VM's cgroups and charge the CPU time consumed by that
work to the VM's container.
However the kthreads did not process SIGSTOP/SIGCONT, and therefore
cgroups which had KVM instances inside could not complete freezing.
Fix this by replacing the kthread with a PF_USER_WORKER thread, via
the vhost_task abstraction. Another 100+ lines removed, with
generally better behavior too like having these threads properly
parented in the process tree.
- Revert a workaround for an old CPU erratum (Nehalem/Westmere) that
didn't really work; there was really nothing to work around anyway:
the broken patch was meant to fix nested virtualization, but the
PERF_GLOBAL_CTRL MSR is virtualized and therefore unaffected by the
erratum.
- Fix 6.12 regression where CONFIG_KVM will be built as a module even
if asked to be builtin, as long as neither KVM_INTEL nor KVM_AMD is
'y'.
x86 selftests:
- x86 selftests can now use AVX.
Documentation:
- Use rST internal links
- Reorganize the introduction to the API document
Generic:
- Protect vcpu->pid accesses outside of vcpu->mutex with a rwlock
instead of RCU, so that running a vCPU on a different task doesn't
encounter long due to having to wait for all CPUs become quiescent.
In general both reads and writes are rare, but userspace that
supports confidential computing is introducing the use of "helper"
vCPUs that may jump from one host processor to another. Those will
be very happy to trigger a synchronize_rcu(), and the effect on
performance is quite the disaster"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (298 commits)
KVM: x86: Break CONFIG_KVM_X86's direct dependency on KVM_INTEL || KVM_AMD
KVM: x86: add back X86_LOCAL_APIC dependency
Revert "KVM: VMX: Move LOAD_IA32_PERF_GLOBAL_CTRL errata handling out of setup_vmcs_config()"
KVM: x86: switch hugepage recovery thread to vhost_task
KVM: x86: expose MSR_PLATFORM_INFO as a feature MSR
x86: KVM: Advertise CPUIDs for new instructions in Clearwater Forest
Documentation: KVM: fix malformed table
irqchip/loongson-eiointc: Add virt extension support
LoongArch: KVM: Add irqfd support
LoongArch: KVM: Add PCHPIC user mode read and write functions
LoongArch: KVM: Add PCHPIC read and write functions
LoongArch: KVM: Add PCHPIC device support
LoongArch: KVM: Add EIOINTC user mode read and write functions
LoongArch: KVM: Add EIOINTC read and write functions
LoongArch: KVM: Add EIOINTC device support
LoongArch: KVM: Add IPI user mode read and write function
LoongArch: KVM: Add IPI read and write function
LoongArch: KVM: Add IPI device support
LoongArch: KVM: Add iocsr and mmio bus simulation in kernel
KVM: arm64: Pass on SVE mapping failures
...
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD
- second part of the ucontrol selftest
- cpumodel sanity check selftest
- gen17 cpumodel changes
|
|
Message-security-assist 11 introduces pckmo subfunctions to encrypt
hmac keys.
Signed-off-by: Hendrik Brueckner <brueckner@linux.ibm.com>
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Link: https://lore.kernel.org/r/20241107152319.77816-3-brueckner@linux.ibm.com
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Message-ID: <20241107152319.77816-3-brueckner@linux.ibm.com>
|
|
The program interrupt code has some extra bits that are sometimes set
by hardware for various reasons; those bits should be ignored when the
program interrupt number is needed for interrupt handling.
Fixes: 05066cafa925 ("s390/mm/fault: Handle guest-related program interrupts in KVM")
Reported-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Tested-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Link: https://lore.kernel.org/r/20241031120316.25462-1-imbrenda@linux.ibm.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
Stop using gmap_enable(), gmap_disable(), gmap_get_enabled().
The correct guest ASCE is passed as a parameter of sie64a(), there is
no need to save the current gmap in lowcore.
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Acked-by: Steffen Eiden <seiden@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Link: https://lore.kernel.org/r/20241022120601.167009-7-imbrenda@linux.ibm.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
Any program interrupt that happens in the host during the execution of
a KVM guest will now short circuit the fault handler and return to KVM
immediately. Guest fault handling (including pfault) will happen
entirely inside KVM.
When sie64a() returns zero, current->thread.gmap_int_code will contain
the program interrupt number that caused the exit, or zero if the exit
was not caused by a host program interrupt.
KVM will now take care of handling all guest faults in vcpu_post_run().
Since gmap faults will not be visible by the rest of the kernel, remove
GMAP_FAULT, the linux fault handlers for secure execution faults, the
exception table entries for the sie instruction, the nop padding after
the sie instruction, and all other references to guest faults from the
s390 code.
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Co-developed-by: Heiko Carstens <hca@linux.ibm.com>
Link: https://lore.kernel.org/r/20241022120601.167009-6-imbrenda@linux.ibm.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
Use kvm_release_page_dirty() when unpinning guest pages, as the pfn was
retrieved via pin_guest_page(), i.e. is guaranteed to be backed by struct
page memory. This will allow dropping kvm_release_pfn_dirty() and
friends.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241010182427.1434605-81-seanjc@google.com>
|
|
Remove KVM_ERR_PTR_BAD_PAGE and instead return NULL, as "bad page" is just
a leftover bit of weirdness from days of old when KVM stuffed a "bad" page
into the guest instead of actually handling missing pages. See commit
cea7bb21280e ("KVM: MMU: Make gfn_to_page() always safe").
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Tested-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241010182427.1434605-2-seanjc@google.com>
|
|
Pull kvm updates from Paolo Bonzini:
"ARM:
- Initial infrastructure for shadow stage-2 MMUs, as part of nested
virtualization enablement
- Support for userspace changes to the guest CTR_EL0 value, enabling
(in part) migration of VMs between heterogenous hardware
- Fixes + improvements to pKVM's FF-A proxy, adding support for v1.1
of the protocol
- FPSIMD/SVE support for nested, including merged trap configuration
and exception routing
- New command-line parameter to control the WFx trap behavior under
KVM
- Introduce kCFI hardening in the EL2 hypervisor
- Fixes + cleanups for handling presence/absence of FEAT_TCRX
- Miscellaneous fixes + documentation updates
LoongArch:
- Add paravirt steal time support
- Add support for KVM_DIRTY_LOG_INITIALLY_SET
- Add perf kvm-stat support for loongarch
RISC-V:
- Redirect AMO load/store access fault traps to guest
- perf kvm stat support
- Use guest files for IMSIC virtualization, when available
s390:
- Assortment of tiny fixes which are not time critical
x86:
- Fixes for Xen emulation
- Add a global struct to consolidate tracking of host values, e.g.
EFER
- Add KVM_CAP_X86_APIC_BUS_CYCLES_NS to allow configuring the
effective APIC bus frequency, because TDX
- Print the name of the APICv/AVIC inhibits in the relevant
tracepoint
- Clean up KVM's handling of vendor specific emulation to
consistently act on "compatible with Intel/AMD", versus checking
for a specific vendor
- Drop MTRR virtualization, and instead always honor guest PAT on
CPUs that support self-snoop
- Update to the newfangled Intel CPU FMS infrastructure
- Don't advertise IA32_PERF_GLOBAL_OVF_CTRL as an MSR-to-be-saved, as
it reads '0' and writes from userspace are ignored
- Misc cleanups
x86 - MMU:
- Small cleanups, renames and refactoring extracted from the upcoming
Intel TDX support
- Don't allocate kvm_mmu_page.shadowed_translation for shadow pages
that can't hold leafs SPTEs
- Unconditionally drop mmu_lock when allocating TDP MMU page tables
for eager page splitting, to avoid stalling vCPUs when splitting
huge pages
- Bug the VM instead of simply warning if KVM tries to split a SPTE
that is non-present or not-huge. KVM is guaranteed to end up in a
broken state because the callers fully expect a valid SPTE, it's
all but dangerous to let more MMU changes happen afterwards
x86 - AMD:
- Make per-CPU save_area allocations NUMA-aware
- Force sev_es_host_save_area() to be inlined to avoid calling into
an instrumentable function from noinstr code
- Base support for running SEV-SNP guests. API-wise, this includes a
new KVM_X86_SNP_VM type, encrypting/measure the initial image into
guest memory, and finalizing it before launching it. Internally,
there are some gmem/mmu hooks needed to prepare gmem-allocated
pages before mapping them into guest private memory ranges
This includes basic support for attestation guest requests, enough
to say that KVM supports the GHCB 2.0 specification
There is no support yet for loading into the firmware those signing
keys to be used for attestation requests, and therefore no need yet
for the host to provide certificate data for those keys.
To support fetching certificate data from userspace, a new KVM exit
type will be needed to handle fetching the certificate from
userspace.
An attempt to define a new KVM_EXIT_COCO / KVM_EXIT_COCO_REQ_CERTS
exit type to handle this was introduced in v1 of this patchset, but
is still being discussed by community, so for now this patchset
only implements a stub version of SNP Extended Guest Requests that
does not provide certificate data
x86 - Intel:
- Remove an unnecessary EPT TLB flush when enabling hardware
- Fix a series of bugs that cause KVM to fail to detect nested
pending posted interrupts as valid wake eents for a vCPU executing
HLT in L2 (with HLT-exiting disable by L1)
- KVM: x86: Suppress MMIO that is triggered during task switch
emulation
Explicitly suppress userspace emulated MMIO exits that are
triggered when emulating a task switch as KVM doesn't support
userspace MMIO during complex (multi-step) emulation
Silently ignoring the exit request can result in the
WARN_ON_ONCE(vcpu->mmio_needed) firing if KVM exits to userspace
for some other reason prior to purging mmio_needed
See commit 0dc902267cb3 ("KVM: x86: Suppress pending MMIO write
exits if emulator detects exception") for more details on KVM's
limitations with respect to emulated MMIO during complex emulator
flows
Generic:
- Rename the AS_UNMOVABLE flag that was introduced for KVM to
AS_INACCESSIBLE, because the special casing needed by these pages
is not due to just unmovability (and in fact they are only
unmovable because the CPU cannot access them)
- New ioctl to populate the KVM page tables in advance, which is
useful to mitigate KVM page faults during guest boot or after live
migration. The code will also be used by TDX, but (probably) not
through the ioctl
- Enable halt poll shrinking by default, as Intel found it to be a
clear win
- Setup empty IRQ routing when creating a VM to avoid having to
synchronize SRCU when creating a split IRQCHIP on x86
- Rework the sched_in/out() paths to replace kvm_arch_sched_in() with
a flag that arch code can use for hooking both sched_in() and
sched_out()
- Take the vCPU @id as an "unsigned long" instead of "u32" to avoid
truncating a bogus value from userspace, e.g. to help userspace
detect bugs
- Mark a vCPU as preempted if and only if it's scheduled out while in
the KVM_RUN loop, e.g. to avoid marking it preempted and thus
writing guest memory when retrieving guest state during live
migration blackout
Selftests:
- Remove dead code in the memslot modification stress test
- Treat "branch instructions retired" as supported on all AMD Family
17h+ CPUs
- Print the guest pseudo-RNG seed only when it changes, to avoid
spamming the log for tests that create lots of VMs
- Make the PMU counters test less flaky when counting LLC cache
misses by doing CLFLUSH{OPT} in every loop iteration"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (227 commits)
crypto: ccp: Add the SNP_VLEK_LOAD command
KVM: x86/pmu: Add kvm_pmu_call() to simplify static calls of kvm_pmu_ops
KVM: x86: Introduce kvm_x86_call() to simplify static calls of kvm_x86_ops
KVM: x86: Replace static_call_cond() with static_call()
KVM: SEV: Provide support for SNP_EXTENDED_GUEST_REQUEST NAE event
x86/sev: Move sev_guest.h into common SEV header
KVM: SEV: Provide support for SNP_GUEST_REQUEST NAE event
KVM: x86: Suppress MMIO that is triggered during task switch emulation
KVM: x86/mmu: Clean up make_huge_page_split_spte() definition and intro
KVM: x86/mmu: Bug the VM if KVM tries to split a !hugepage SPTE
KVM: selftests: x86: Add test for KVM_PRE_FAULT_MEMORY
KVM: x86: Implement kvm_arch_vcpu_pre_fault_memory()
KVM: x86/mmu: Make kvm_mmu_do_page_fault() return mapped level
KVM: x86/mmu: Account pf_{fixed,emulate,spurious} in callers of "do page fault"
KVM: x86/mmu: Bump pf_taken stat only in the "real" page fault handler
KVM: Add KVM_PRE_FAULT_MEMORY vcpu ioctl to pre-populate guest memory
KVM: Document KVM_PRE_FAULT_MEMORY ioctl
mm, virt: merge AS_UNMOVABLE and AS_INACCESSIBLE
perf kvm: Add kvm-stat for loongarch64
LoongArch: KVM: Add PV steal time support in guest side
...
|
|
Pass the guest ASCE explicitly as parameter, instead of having sie64a()
take it from lowcore.
This removes hidden state from lowcore, and makes things look cleaner.
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Reviewed-by: Nico Boehr <nrb@linux.ibm.com>
Link: https://lore.kernel.org/r/20240703155900.103783-2-imbrenda@linux.ibm.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
|
|
It's possible that SIE exits for work that the host needs to perform
rather than something that is intended for the guest.
A Linux guest will ignore this intercept code since there is nothing
for it to do, but a more robust solution would rewind the PSW back to
the SIE instruction. This will transparently resume the guest once
the host completes its work, without the guest needing to process
what is effectively a NOP and re-issue SIE itself.
Signed-off-by: Eric Farman <farman@linux.ibm.com>
Acked-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Link: https://lore.kernel.org/r/20240301204342.3217540-1-farman@linux.ibm.com
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Message-ID: <20240301204342.3217540-1-farman@linux.ibm.com>
|
|
The address of the crypto control block in the (shadow) SIE block is
absolute/physical.
Convert from virtual to physical when shadowing the guest's control
block during VSIE.
Signed-off-by: Nina Schoetterl-Glausch <nsg@linux.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>
Link: https://lore.kernel.org/r/20240429171512.879215-1-nsg@linux.ibm.com
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
|
|
In order for SIE to interpretively execute STFLE, it requires the real
or absolute address of a facility-list control block.
Before writing the location into the shadow SIE control block, convert
it from a virtual address.
We currently do not run into this bug because the lower 31 bits are the
same for virtual and physical addresses.
Signed-off-by: Nina Schoetterl-Glausch <nsg@linux.ibm.com>
Link: https://lore.kernel.org/r/20240319164420.4053380-3-nsg@linux.ibm.com
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Message-Id: <20240319164420.4053380-3-nsg@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 updates from Heiko Carstens:
- Various virtual vs physical address usage fixes
- Fix error handling in Processor Activity Instrumentation device
driver, and export number of counters with a sysfs file
- Allow for multiple events when Processor Activity Instrumentation
counters are monitored in system wide sampling
- Change multiplier and shift values of the Time-of-Day clock source to
improve steering precision
- Remove a couple of unneeded GFP_DMA flags from allocations
- Disable mmap alignment if randomize_va_space is also disabled, to
avoid a too small heap
- Various changes to allow s390 to be compiled with LLVM=1, since
ld.lld and llvm-objcopy will have proper s390 support witch clang 19
- Add __uninitialized macro to Compiler Attributes. This is helpful
with s390's FPU code where some users have up to 520 byte stack
frames. Clearing such stack frames (if INIT_STACK_ALL_PATTERN or
INIT_STACK_ALL_ZERO is enabled) before they are used contradicts the
intention (performance improvement) of such code sections.
- Convert switch_to() to an out-of-line function, and use the generic
switch_to header file
- Replace the usage of s390's debug feature with pr_debug() calls
within the zcrypt device driver
- Improve hotplug support of the Adjunct Processor device driver
- Improve retry handling in the zcrypt device driver
- Various changes to the in-kernel FPU code:
- Make in-kernel FPU sections preemptible
- Convert various larger inline assemblies and assembler files to
C, mainly by using singe instruction inline assemblies. This
increases readability, but also allows makes it easier to add
proper instrumentation hooks
- Cleanup of the header files
- Provide fast variants of csum_partial() and
csum_partial_copy_nocheck() based on vector instructions
- Introduce and use a lock to synchronize accesses to zpci device data
structures to avoid inconsistent states caused by concurrent accesses
- Compile the kernel without -fPIE. This addresses the following
problems if the kernel is compiled with -fPIE:
- It uses dynamic symbols (.dynsym), for which the linker refuses
to allow more than 64k sections. This can break features which
use '-ffunction-sections' and '-fdata-sections', including
kpatch-build and function granular KASLR
- It unnecessarily uses GOT relocations, adding an extra layer of
indirection for many memory accesses
- Fix shared_cpu_list for CPU private L2 caches, which incorrectly were
reported as globally shared
* tag 's390-6.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (117 commits)
s390/tools: handle rela R_390_GOTPCDBL/R_390_GOTOFF64
s390/cache: prevent rebuild of shared_cpu_list
s390/crypto: remove retry loop with sleep from PAES pkey invocation
s390/pkey: improve pkey retry behavior
s390/zcrypt: improve zcrypt retry behavior
s390/zcrypt: introduce retries on in-kernel send CPRB functions
s390/ap: introduce mutex to lock the AP bus scan
s390/ap: rework ap_scan_bus() to return true on config change
s390/ap: clarify AP scan bus related functions and variables
s390/ap: rearm APQNs bindings complete completion
s390/configs: increase number of LOCKDEP_BITS
s390/vfio-ap: handle hardware checkstop state on queue reset operation
s390/pai: change sampling event assignment for PMU device driver
s390/boot: fix minor comment style damages
s390/boot: do not check for zero-termination relocation entry
s390/boot: make type of __vmlinux_relocs_64_start|end consistent
s390/boot: sanitize kaslr_adjust_relocs() function prototype
s390/boot: simplify GOT handling
s390: vmlinux.lds.S: fix .got.plt assertion
s390/boot: workaround current 'llvm-objdump -t -j ...' behavior
...
|
|
KVM modifies the kernel fpu's regs pointer to its own area to implement its
custom version of preemtible kernel fpu context. With general support for
preemptible kernel fpu context there is no need for the extra complexity in
KVM code anymore.
Therefore convert KVM to a regular kernel fpu user. In particular this
means that all TIF_FPU checks can be removed, since the fpu register
context will never be changed by other kernel fpu users, and also the fpu
register context will be restored if a thread is preempted.
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
Rename save_fpu_regs(), load_fpu_regs(), and struct thread_struct's fpu
member to save_user_fpu_regs(), load_user_fpu_regs(), and ufpu. This way
the function and variable names reflect for which context they are supposed
to be used.
This large and trivial conversion is a prerequisite for making the kernel
fpu usage preemptible.
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
The FPU state, as represented by the CIF_FPU flag reflects the FPU state of
a task, not the CPU it is running on. Therefore convert the flag to a
regular TIF flag.
This removes the magic in switch_to() where a save_fpu_regs() call for the
currently (previous) running task sets the per-cpu CIF_FPU flag, which is
required to restore FPU register contents of the next task, when it returns
to user space.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
Move, rename, and merge the fpu and vx header files. This way fpu header
files have a consistent naming scheme (fpu*.h).
Also get rid of the fpu subdirectory and move header files to asm
directory, so that all fpu and vx header files can be found at the same
location.
Merge internal.h header file into other header files, since the internal
helpers are used at many locations. so those helper functions are really
not internal.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD
pqap instruction missing cc fix
vsie shadow creation race fix
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD
- uvdevice fixed additional data return length
- stfle (feature indication) vsie fixes and minor cleanup
|
|
The length of the facility list accessed when interpretively executing
STFLE is the same as the hosts facility list (in case of format-0)
The memory following the facility list doesn't need to be accessible.
The current VSIE implementation accesses a fixed length that exceeds the
guest/host facility list length and can therefore wrongly inject a
validity intercept.
Instead, find out the host facility list length by running STFLE and
copy only as much as necessary when shadowing.
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Nina Schoetterl-Glausch <nsg@linux.ibm.com>
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Link: https://lore.kernel.org/r/20231219140854.1042599-3-nsg@linux.ibm.com
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Message-ID: <20231219140854.1042599-3-nsg@linux.ibm.com>
|
|
STFLE can be interpretively executed.
This occurs when the facility list designation is unequal to zero.
Perform the check before applying the address mask instead of after.
Fixes: 66b630d5b7f2 ("KVM: s390: vsie: support STFLE interpretation")
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Nina Schoetterl-Glausch <nsg@linux.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Link: https://lore.kernel.org/r/20231219140854.1042599-2-nsg@linux.ibm.com
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Message-ID: <20231219140854.1042599-2-nsg@linux.ibm.com>
|
|
Right now it is possible to see gmap->private being zero in
kvm_s390_vsie_gmap_notifier resulting in a crash. This is due to the
fact that we add gmap->private == kvm after creation:
static int acquire_gmap_shadow(struct kvm_vcpu *vcpu,
struct vsie_page *vsie_page)
{
[...]
gmap = gmap_shadow(vcpu->arch.gmap, asce, edat);
if (IS_ERR(gmap))
return PTR_ERR(gmap);
gmap->private = vcpu->kvm;
Let children inherit the private field of the parent.
Reported-by: Marc Hartmayer <mhartmay@linux.ibm.com>
Fixes: a3508fbe9dc6 ("KVM: s390: vsie: initial support for nested virtualization")
Cc: <stable@vger.kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Link: https://lore.kernel.org/r/20231220125317.4258-1-borntraeger@linux.ibm.com
|
|
When the host invalidates a guest page, it will also check if the page
was used to map the prefix of any guest CPUs, in which case they are
stopped and marked as needing a prefix refresh. Upon starting the
affected CPUs again, their prefix pages are explicitly faulted in and
revalidated if they had been invalidated. A bit in the PGSTEs indicates
whether or not a page might contain a prefix. The bit is allowed to
overindicate. Pages above 2G are skipped, because they cannot be
prefixes, since KVM runs all guests with MSO = 0.
The same applies for nested guests (VSIE). When the host invalidates a
guest page that maps the prefix of the nested guest, it has to stop the
affected nested guest CPUs and mark them as needing a prefix refresh.
The same PGSTE bit used for the guest prefix is also used for the
nested guest. Pages above 2G are skipped like for normal guests, which
is the source of the bug.
The nested guest runs is the guest primary address space. The guest
could be running the nested guest using MSO != 0. If the MSO + prefix
for the nested guest is above 2G, the check for nested prefix will skip
it. This will cause the invalidation notifier to not stop the CPUs of
the nested guest and not mark them as needing refresh. When the nested
guest is run again, its prefix will not be refreshed, since it has not
been marked for refresh. This will cause a fatal validity intercept
with VIR code 37.
Fix this by removing the check for 2G for nested guests. Now all
invalidations of pages with the notify bit set will always scan the
existing VSIE shadow state descriptors.
This allows to catch invalidations of nested guest prefix mappings even
when the prefix is above 2G in the guest virtual address space.
Fixes: a3508fbe9dc6 ("KVM: s390: vsie: initial support for nested virtualization")
Tested-by: Nico Boehr <nrb@linux.ibm.com>
Reviewed-by: Nico Boehr <nrb@linux.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Message-ID: <20231102153549.53984-1-imbrenda@linux.ibm.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
|
|
The shadow gmap tracks memory of nested guests (guest-3). In certain
scenarios, the shadow gmap needs to be rebuilt, which is a costly operation
since it involves a SIE exit into guest-1 for every entry in the respective
shadow level.
Add kvm stat counters when new shadow structures are created at various
levels. Also add a counter gmap_shadow_create when a completely fresh
shadow gmap is created as well as a counter gmap_shadow_reuse when an
existing gmap is being reused.
Note that when several levels are shadowed at once, counters on all
affected levels will be increased.
Also note that not all page table levels need to be present and a ASCE
can directly point to e.g. a segment table. In this case, a new segment
table will always be equivalent to a new shadow gmap and hence will be
counted as gmap_shadow_create and not as gmap_shadow_segment.
Signed-off-by: Nico Boehr <nrb@linux.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Link: https://lore.kernel.org/r/20231009093304.2555344-2-nrb@linux.ibm.com
Message-Id: <20231009093304.2555344-2-nrb@linux.ibm.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull more s390 updates from Alexander Gordeev:
- Fix virtual vs physical address confusion in vmem_add_range() and
vmem_remove_range() functions
- Include <linux/io.h> instead of <asm/io.h> and <asm-generic/io.h>
throughout s390 code
- Make all PSW related defines also available for assembler files.
Remove PSW_DEFAULT_KEY define from uapi for that
- When adding an undefined symbol the build still succeeds, but
userspace crashes trying to execute VDSO, because the symbol is not
resolved. Add undefined symbols check to prevent that
- Use kvmalloc_array() instead of kzalloc() for allocaton of 256k
memory when executing s390 crypto adapter IOCTL
- Add -fPIE flag to prevent decompressor misaligned symbol build error
with clang
- Use .balign instead of .align everywhere. This is a no-op for s390,
but with this there no mix in using .align and .balign anymore
- Filter out -mno-pic-data-is-text-relative flag when compiling kernel
to prevent VDSO build error
- Rework entering of DAT-on mode on CPU restart to use PSW_KERNEL_BITS
mask directly
- Do not retry administrative requests to some s390 crypto cards, since
the firmware assumes replay attacks
- Remove most of the debug code, which is build in when kernel config
option CONFIG_ZCRYPT_DEBUG is enabled
- Remove CONFIG_ZCRYPT_MULTIDEVNODES kernel config option and switch
off the multiple devices support for the s390 zcrypt device driver
- With the conversion to generic entry machine checks are accounted to
the current context instead of irq time. As result, the STCKF
instruction at the beginning of the machine check handler and the
lowcore member are no longer required, therefore remove it
- Fix various typos found with codespell
- Minor cleanups to CPU-measurement Counter and Sampling Facilities
code
- Revert patch that removes VMEM_MAX_PHYS macro, since it causes a
regression
* tag 's390-6.5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (25 commits)
Revert "s390/mm: get rid of VMEM_MAX_PHYS macro"
s390/cpum_sf: remove check on CPU being online
s390/cpum_sf: handle casts consistently
s390/cpum_sf: remove unnecessary debug statement
s390/cpum_sf: remove parameter in call to pr_err
s390/cpum_sf: simplify function setup_pmu_cpu
s390/cpum_cf: remove unneeded debug statements
s390/entry: remove mcck clock
s390: fix various typos
s390/zcrypt: remove ZCRYPT_MULTIDEVNODES kernel config option
s390/zcrypt: do not retry administrative requests
s390/zcrypt: cleanup some debug code
s390/entry: rework entering DAT-on mode on CPU restart
s390/mm: fence off VM macros from asm and linker
s390: include linux/io.h instead of asm/io.h
s390/ptrace: make all psw related defines also available for asm
s390/ptrace: remove PSW_DEFAULT_KEY from uapi
s390/vdso: filter out mno-pic-data-is-text-relative cflag
s390: consistently use .balign instead of .align
s390/decompressor: fix misaligned symbol build error
...
|
|
Fix various typos found with codespell.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
|
|
bit_and() uses the count of bits as the woking length.
Fix the previous implementation and effectively use
the right bitmap size.
Fixes: 19fd83a64718 ("KVM: s390: vsie: allow CRYCB FORMAT-1")
Fixes: 56019f9aca22 ("KVM: s390: vsie: Allow CRYCB FORMAT-2")
Signed-off-by: Pierre Morel <pmorel@linux.ibm.com>
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Link: https://lore.kernel.org/kvm/20230511094719.9691-1-pmorel@linux.ibm.com/
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
|
|
The APCB is part of the CRYCB.
The calculation of the APCB origin can be done by adding
the APCB offset to the CRYCB origin.
Current code makes confusing transformations, converting
the CRYCB origin to a pointer to calculate the APCB origin.
Let's make things simpler and keep the CRYCB origin to make
these calculations.
Signed-off-by: Pierre Morel <pmorel@linux.ibm.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Janosch Frank <frankja@linux.ibm.com>
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Link: https://lore.kernel.org/r/20230214122841.13066-2-pmorel@linux.ibm.com
Message-Id: <20230214122841.13066-2-pmorel@linux.ibm.com>
|
|
Pull kvm updates from Paolo Bonzini:
"ARM64:
- Enable the per-vcpu dirty-ring tracking mechanism, together with an
option to keep the good old dirty log around for pages that are
dirtied by something other than a vcpu.
- Switch to the relaxed parallel fault handling, using RCU to delay
page table reclaim and giving better performance under load.
- Relax the MTE ABI, allowing a VMM to use the MAP_SHARED mapping
option, which multi-process VMMs such as crosvm rely on (see merge
commit 382b5b87a97d: "Fix a number of issues with MTE, such as
races on the tags being initialised vs the PG_mte_tagged flag as
well as the lack of support for VM_SHARED when KVM is involved.
Patches from Catalin Marinas and Peter Collingbourne").
- Merge the pKVM shadow vcpu state tracking that allows the
hypervisor to have its own view of a vcpu, keeping that state
private.
- Add support for the PMUv3p5 architecture revision, bringing support
for 64bit counters on systems that support it, and fix the
no-quite-compliant CHAIN-ed counter support for the machines that
actually exist out there.
- Fix a handful of minor issues around 52bit VA/PA support (64kB
pages only) as a prefix of the oncoming support for 4kB and 16kB
pages.
- Pick a small set of documentation and spelling fixes, because no
good merge window would be complete without those.
s390:
- Second batch of the lazy destroy patches
- First batch of KVM changes for kernel virtual != physical address
support
- Removal of a unused function
x86:
- Allow compiling out SMM support
- Cleanup and documentation of SMM state save area format
- Preserve interrupt shadow in SMM state save area
- Respond to generic signals during slow page faults
- Fixes and optimizations for the non-executable huge page errata
fix.
- Reprogram all performance counters on PMU filter change
- Cleanups to Hyper-V emulation and tests
- Process Hyper-V TLB flushes from a nested guest (i.e. from a L2
guest running on top of a L1 Hyper-V hypervisor)
- Advertise several new Intel features
- x86 Xen-for-KVM:
- Allow the Xen runstate information to cross a page boundary
- Allow XEN_RUNSTATE_UPDATE flag behaviour to be configured
- Add support for 32-bit guests in SCHEDOP_poll
- Notable x86 fixes and cleanups:
- One-off fixes for various emulation flows (SGX, VMXON, NRIPS=0).
- Reinstate IBPB on emulated VM-Exit that was incorrectly dropped
a few years back when eliminating unnecessary barriers when
switching between vmcs01 and vmcs02.
- Clean up vmread_error_trampoline() to make it more obvious that
params must be passed on the stack, even for x86-64.
- Let userspace set all supported bits in MSR_IA32_FEAT_CTL
irrespective of the current guest CPUID.
- Fudge around a race with TSC refinement that results in KVM
incorrectly thinking a guest needs TSC scaling when running on a
CPU with a constant TSC, but no hardware-enumerated TSC
frequency.
- Advertise (on AMD) that the SMM_CTL MSR is not supported
- Remove unnecessary exports
Generic:
- Support for responding to signals during page faults; introduces
new FOLL_INTERRUPTIBLE flag that was reviewed by mm folks
Selftests:
- Fix an inverted check in the access tracking perf test, and restore
support for asserting that there aren't too many idle pages when
running on bare metal.
- Fix build errors that occur in certain setups (unsure exactly what
is unique about the problematic setup) due to glibc overriding
static_assert() to a variant that requires a custom message.
- Introduce actual atomics for clear/set_bit() in selftests
- Add support for pinning vCPUs in dirty_log_perf_test.
- Rename the so called "perf_util" framework to "memstress".
- Add a lightweight psuedo RNG for guest use, and use it to randomize
the access pattern and write vs. read percentage in the memstress
tests.
- Add a common ucall implementation; code dedup and pre-work for
running SEV (and beyond) guests in selftests.
- Provide a common constructor and arch hook, which will eventually
be used by x86 to automatically select the right hypercall (AMD vs.
Intel).
- A bunch of added/enabled/fixed selftests for ARM64, covering
memslots, breakpoints, stage-2 faults and access tracking.
- x86-specific selftest changes:
- Clean up x86's page table management.
- Clean up and enhance the "smaller maxphyaddr" test, and add a
related test to cover generic emulation failure.
- Clean up the nEPT support checks.
- Add X86_PROPERTY_* framework to retrieve multi-bit CPUID values.
- Fix an ordering issue in the AMX test introduced by recent
conversions to use kvm_cpu_has(), and harden the code to guard
against similar bugs in the future. Anything that tiggers
caching of KVM's supported CPUID, kvm_cpu_has() in this case,
effectively hides opt-in XSAVE features if the caching occurs
before the test opts in via prctl().
Documentation:
- Remove deleted ioctls from documentation
- Clean up the docs for the x86 MSR filter.
- Various fixes"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (361 commits)
KVM: x86: Add proper ReST tables for userspace MSR exits/flags
KVM: selftests: Allocate ucall pool from MEM_REGION_DATA
KVM: arm64: selftests: Align VA space allocator with TTBR0
KVM: arm64: Fix benign bug with incorrect use of VA_BITS
KVM: arm64: PMU: Fix period computation for 64bit counters with 32bit overflow
KVM: x86: Advertise that the SMM_CTL MSR is not supported
KVM: x86: remove unnecessary exports
KVM: selftests: Fix spelling mistake "probabalistic" -> "probabilistic"
tools: KVM: selftests: Convert clear/set_bit() to actual atomics
tools: Drop "atomic_" prefix from atomic test_and_set_bit()
tools: Drop conflicting non-atomic test_and_{clear,set}_bit() helpers
KVM: selftests: Use non-atomic clear/set bit helpers in KVM tests
perf tools: Use dedicated non-atomic clear/set bit helpers
tools: Take @bit as an "unsigned long" in {clear,set}_bit() helpers
KVM: arm64: selftests: Enable single-step without a "full" ucall()
KVM: x86: fix APICv/x2AVIC disabled when vm reboot by itself
KVM: Remove stale comment about KVM_REQ_UNHALT
KVM: Add missing arch for KVM_CREATE_DEVICE and KVM_{SET,GET}_DEVICE_ATTR
KVM: Reference to kvm_userspace_memory_region in doc and comments
KVM: Delete all references to removed KVM_SET_MEMORY_ALIAS ioctl
...
|
|
We recently experienced some weird huge time jumps in nested guests when
rebooting them in certain cases. After adding some debug code to the epoch
handling in vsie.c (thanks to David Hildenbrand for the idea!), it was
obvious that the "epdx" field (the multi-epoch extension) did not get set
to 0xff in case the "epoch" field was negative.
Seems like the code misses to copy the value from the epdx field from
the guest to the shadow control block. By doing so, the weird time
jumps are gone in our scenarios.
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2140899
Fixes: 8fa1696ea781 ("KVM: s390: Multiple Epoch Facility support")
Signed-off-by: Thomas Huth <thuth@redhat.com>
Reviewed-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Cc: stable@vger.kernel.org # 4.19+
Link: https://lore.kernel.org/r/20221123090833.292938-1-thuth@redhat.com
Message-Id: <20221123090833.292938-1-thuth@redhat.com>
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
|
|
pin_guest_page() used page_to_virt() to calculate the hpa of the pinned
page. This currently works, because virtual and physical addresses are
the same. Use page_to_phys() instead to resolve the virtual-real address
confusion.
One caller of pin_guest_page() actually expected the hpa to be a hva, so
add the missing phys_to_virt() conversion here.
Signed-off-by: Nico Boehr <nrb@linux.ibm.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Link: https://lore.kernel.org/r/20221025082039.117372-2-nrb@linux.ibm.com
Message-Id: <20221025082039.117372-2-nrb@linux.ibm.com>
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
|
|
We report a topology change to the guest for any CPU hotplug.
The reporting to the guest is done using the Multiprocessor
Topology-Change-Report (MTCR) bit of the utility entry in the guest's
SCA which will be cleared during the interpretation of PTF.
On every vCPU creation we set the MCTR bit to let the guest know the
next time it uses the PTF with command 2 instruction that the
topology changed and that it should use the STSI(15.1.x) instruction
to get the topology details.
STSI(15.1.x) gives information on the CPU configuration topology.
Let's accept the interception of STSI with the function code 15 and
let the userland part of the hypervisor handle it when userland
supports the CPU Topology facility.
Signed-off-by: Pierre Morel <pmorel@linux.ibm.com>
Reviewed-by: Nico Boehr <nrb@linux.ibm.com>
Reviewed-by: Janis Schoetterl-Glausch <scgl@linux.ibm.com>
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Link: https://lore.kernel.org/r/20220714101824.101601-2-pmorel@linux.ibm.com
Message-Id: <20220714101824.101601-2-pmorel@linux.ibm.com>
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
|
|
Add wrappers to acquire/release KVM's SRCU lock when stashing the index
in vcpu->src_idx, along with rudimentary detection of illegal usage,
e.g. re-acquiring SRCU and thus overwriting vcpu->src_idx. Because the
SRCU index is (currently) either 0 or 1, illegal nesting bugs can go
unnoticed for quite some time and only cause problems when the nested
lock happens to get a different index.
Wrap the WARNs in PROVE_RCU=y, and make them ONCE, otherwise KVM will
likely yell so loudly that it will bring the kernel to its knees.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Fabiano Rosas <farosas@linux.ibm.com>
Message-Id: <20220415004343.2203171-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
When this feature is enabled the hardware is free to interpret
specification exceptions generated by the guest, instead of causing
program interruption interceptions.
This benefits (test) programs that generate a lot of specification
exceptions (roughly 4x increase in exceptions/sec).
Interceptions will occur as before if ICTL_PINT is set,
i.e. if guest debug is enabled.
There is no indication if this feature is available or not and the
hardware is free to interpret or not. So we can simply set this bit and
if the hardware ignores it we fall back to intercept 8 handling.
Signed-off-by: Janis Schoetterl-Glausch <scgl@linux.ibm.com>
Link: https://lore.kernel.org/linux-s390/20210706114714.3936825-1-scgl@linux.ibm.com/
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
|
|
Prefixing needs to be applied to the guest real address to translate it
into a guest absolute address.
The value of MSO needs to be added to a guest-absolute address in order to
obtain the host-virtual.
Fixes: bdf7509bbefa ("s390/kvm: VSIE: correctly handle MVPG when in VSIE")
Reported-by: Janosch Frank <frankja@linux.ibm.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20210322140559.500716-3-imbrenda@linux.ibm.com
[borntraeger@de.ibm.com simplify mso]
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
|
|
Correctly handle the MVPG instruction when issued by a VSIE guest.
Fixes: a3508fbe9dc6d ("KVM: s390: vsie: initial support for nested virtualization")
Cc: stable@vger.kernel.org # f85f1baaa189: KVM: s390: split kvm_s390_logical_to_effective
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Acked-by: Janosch Frank <frankja@linux.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Link: https://lore.kernel.org/r/20210302174443.514363-4-imbrenda@linux.ibm.com
[borntraeger@de.ibm.com: apply fixup from Claudio]
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
|
|
Extend kvm_s390_shadow_fault to return the pointer to the valid leaf
DAT table entry, or to the invalid entry.
Also return some flags in the lower bits of the address:
PEI_DAT_PROT: indicates that DAT protection applies because of the
protection bit in the segment (or, if EDAT, region) tables.
PEI_NOT_PTE: indicates that the address of the DAT table entry returned
does not refer to a PTE, but to a segment or region table.
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Cc: stable@vger.kernel.org
Reviewed-by: Janosch Frank <frankja@de.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Link: https://lore.kernel.org/r/20210302174443.514363-3-imbrenda@linux.ibm.com
[borntraeger@de.ibm.com: fold in a fix from Claudio]
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
|
|
This patch converts s390 to use the generic entry infrastructure from
kernel/entry/*.
There are a few special things on s390:
- PIF_PER_TRAP is moved to TIF_PER_TRAP as the generic code doesn't
know about our PIF flags in exit_to_user_mode_loop().
- The old code had several ways to restart syscalls:
a) PIF_SYSCALL_RESTART, which was only set during execve to force a
restart after upgrading a process (usually qemu-kvm) to pgste page
table extensions.
b) PIF_SYSCALL, which is set by do_signal() to indicate that the
current syscall should be restarted. This is changed so that
do_signal() now also uses PIF_SYSCALL_RESTART. Continuing to use
PIF_SYSCALL doesn't work with the generic code, and changing it
to PIF_SYSCALL_RESTART makes PIF_SYSCALL and PIF_SYSCALL_RESTART
more unique.
- On s390 calling sys_sigreturn or sys_rt_sigreturn is implemented by
executing a svc instruction on the process stack which causes a fault.
While handling that fault the fault code sets PIF_SYSCALL to hand over
processing to the syscall code on exit to usermode.
The patch introduces PIF_SYSCALL_RET_SET, which is set if ptrace sets
a return value for a syscall. The s390x ptrace ABI uses r2 both for the
syscall number and return value, so ptrace cannot set the syscall number +
return value at the same time. The flag makes handling that a bit easier.
do_syscall() will just skip executing the syscall if PIF_SYSCALL_RET_SET
is set.
CONFIG_DEBUG_ASCE was removd in favour of the generic CONFIG_DEBUG_ENTRY.
CR1/7/13 will be checked both on kernel entry and exit to contain the
correct asces.
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
|
|
Almost all kvm allocations in the s390x KVM code can be attributed to
the process that triggers the allocation (in other words, no global
allocation for other guests). This will help the memcg controller to
make the right decisions.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Janosch Frank <frankja@linux.ibm.com>
Acked-by: Cornelia Huck <cohuck@redhat.com>
|
|
DIAGNOSE 0x318 (diag318) sets information regarding the environment
the VM is running in (Linux, z/VM, etc) and is observed via
firmware/service events.
This is a privileged s390x instruction that must be intercepted by
SIE. Userspace handles the instruction as well as migration. Data
is communicated via VCPU register synchronization.
The Control Program Name Code (CPNC) is stored in the SIE block. The
CPNC along with the Control Program Version Code (CPVC) are stored
in the kvm_vcpu_arch struct.
This data is reset on load normal and clear resets.
Signed-off-by: Collin Walling <walling@linux.ibm.com>
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Acked-by: Cornelia Huck <cohuck@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20200622154636.5499-3-walling@linux.ibm.com
[borntraeger@de.ibm.com: fix sync_reg position]
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 updates from Vasily Gorbik:
- Add support for multi-function devices in pci code.
- Enable PF-VF linking for architectures using the pdev->no_vf_scan
flag (currently just s390).
- Add reipl from NVMe support.
- Get rid of critical section cleanup in entry.S.
- Refactor PNSO CHSC (perform network subchannel operation) in cio and
qeth.
- QDIO interrupts and error handling fixes and improvements, more
refactoring changes.
- Align ioremap() with generic code.
- Accept requests without the prefetch bit set in vfio-ccw.
- Enable path handling via two new regions in vfio-ccw.
- Other small fixes and improvements all over the code.
* tag 's390-5.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (52 commits)
vfio-ccw: make vfio_ccw_regops variables declarations static
vfio-ccw: Add trace for CRW event
vfio-ccw: Wire up the CRW irq and CRW region
vfio-ccw: Introduce a new CRW region
vfio-ccw: Refactor IRQ handlers
vfio-ccw: Introduce a new schib region
vfio-ccw: Refactor the unregister of the async regions
vfio-ccw: Register a chp_event callback for vfio-ccw
vfio-ccw: Introduce new helper functions to free/destroy regions
vfio-ccw: document possible errors
vfio-ccw: Enable transparent CCW IPL from DASD
s390/pci: Log new handle in clp_disable_fh()
s390/cio, s390/qeth: cleanup PNSO CHSC
s390/qdio: remove q->first_to_kick
s390/qdio: fix up qdio_start_irq() kerneldoc
s390: remove critical section cleanup from entry.S
s390: add machine check SIGP
s390/pci: ioremap() align with generic code
s390/ap: introduce new ap function ap_get_qdev()
Documentation/s390: Update / remove developerWorks web links
...
|
|
The current code is rather complex and caused a lot of subtle
and hard to debug bugs in the past. Simplify the code by calling
the system_call handler with interrupts disabled, save
machine state, and re-enable them later.
This requires significant changes to the machine check handling code
as well. When the machine check interrupt arrived while being in kernel
mode the new code will signal pending machine checks with a SIGP external
call. When userspace was interrupted, the handler will switch to the
kernel stack and directly execute s390_handle_mcck().
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
|
|
Let's move it to the outer loop, in case we ever run again into long
loops, trying to map the prefix. While at it, convert it to cond_resched().
Signed-off-by: David Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20200403153050.20569-5-david@redhat.com
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
|