Age | Commit message (Collapse) | Author |
|
KVM/riscv changes for 5.19
- Added Sv57x4 support for G-stage page table
- Added range based local HFENCE functions
- Added remote HFENCE functions based on VCPU requests
- Added ISA extension registers in ONE_REG interface
- Updated KVM RISC-V maintainers entry to cover selftests support
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 updates for 5.19
- Add support for the ARMv8.6 WFxT extension
- Guard pages for the EL2 stacks
- Trap and emulate AArch32 ID registers to hide unsupported features
- Ability to select and save/restore the set of hypercalls exposed
to the guest
- Support for PSCI-initiated suspend in collaboration with userspace
- GICv3 register-based LPI invalidation support
- Move host PMU event merging into the vcpu data structure
- GICv3 ITS save/restore fixes
- The usual set of small-scale cleanups and fixes
[Due to the conflict, KVM_SYSTEM_EVENT_SEV_TERM is relocated
from 4 to 6. - Paolo]
|
|
In commit ec0671d5684a ("KVM: LAPIC: Delay trace_kvm_wait_lapic_expire
tracepoint to after vmexit", 2019-06-04), trace_kvm_wait_lapic_expire
was moved after guest_exit_irqoff() because invoking tracepoints within
kvm_guest_enter/kvm_guest_exit caused a lockdep splat.
These days this is not necessary, because commit 87fa7f3e98a1 ("x86/kvm:
Move context tracking where it belongs", 2020-07-09) restricted
the RCU extended quiescent state to be closer to vmentry/vmexit.
Moving the tracepoint back to __kvm_wait_lapic_expire is more accurate,
because it will be reported even if vcpu_enter_guest causes multiple
vmentries via the IPI/Timer fast paths, and it allows the removal of
advance_expire_delta.
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Message-Id: <1650961551-38390-1-git-send-email-wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Currently, there is no provision for vmm (qemu-kvm or kvmtool) to
query about multiple-letter ISA extensions. The config register
is only used for base single letter ISA extensions.
A new ISA extension register is added that will allow the vmm
to query about any ISA extension one at a time. It is enabled for
both single letter or multi-letter ISA extensions. The ISA extension
register is useful to if the vmm requires to retrieve/set single
extension while the config register should be used if all the base
ISA extension required to retrieve or set.
For any multi-letter ISA extensions, the new register interface
must be used.
Signed-off-by: Atish Patra <atishp@rivosinc.com>
Signed-off-by: Anup Patel <anup@brainfault.org>
|
|
On RISC-V platforms with hardware VMID support, we share same
VMID for all VCPUs of a particular Guest/VM. This means we might
have stale G-stage TLB entries on the current Host CPU due to
some other VCPU of the same Guest which ran previously on the
current Host CPU.
To cleanup stale TLB entries, we simply flush all G-stage TLB
entries by VMID whenever underlying Host CPU changes for a VCPU.
Signed-off-by: Anup Patel <apatel@ventanamicro.com>
Reviewed-by: Atish Patra <atishp@rivosinc.com>
Signed-off-by: Anup Patel <anup@brainfault.org>
|
|
The generic KVM has support for VCPU requests which can be used
to do arch-specific work in the run-loop. We introduce remote
HFENCE functions which will internally use VCPU requests instead
of host SBI calls.
Advantages of doing remote HFENCEs as VCPU requests are:
1) Multiple VCPUs of a Guest may be running on different Host CPUs
so it is not always possible to determine the Host CPU mask for
doing Host SBI call. For example, when VCPU X wants to do HFENCE
on VCPU Y, it is possible that VCPU Y is blocked or in user-space
(i.e. vcpu->cpu < 0).
2) To support nested virtualization, we will be having a separate
shadow G-stage for each VCPU and a common host G-stage for the
entire Guest/VM. The VCPU requests based remote HFENCEs helps
us easily synchronize the common host G-stage and shadow G-stage
of each VCPU without any additional IPI calls.
This is also a preparatory patch for upcoming nested virtualization
support where we will be having a shadow G-stage page table for
each Guest VCPU.
Signed-off-by: Anup Patel <apatel@ventanamicro.com>
Reviewed-by: Atish Patra <atishp@rivosinc.com>
Signed-off-by: Anup Patel <anup@brainfault.org>
|
|
Currently, the KVM_MAX_VCPUS value is 16384 for RV64 and 128
for RV32.
The KVM_MAX_VCPUS value is too high for RV64 and too low for
RV32 compared to other architectures (e.g. x86 sets it to 1024
and ARM64 sets it to 512). The too high value of KVM_MAX_VCPUS
on RV64 also leads to VCPU mask on stack consuming 2KB.
We set KVM_MAX_VCPUS to 1024 for both RV64 and RV32 to be
aligned other architectures.
Signed-off-by: Anup Patel <apatel@ventanamicro.com>
Reviewed-by: Atish Patra <atishp@rivosinc.com>
Signed-off-by: Anup Patel <anup@brainfault.org>
|
|
Various __kvm_riscv_hfence_xyz() functions implemented in the
kvm/tlb.S are equivalent to corresponding HFENCE.GVMA instructions
and we don't have range based local HFENCE functions.
This patch provides complete set of local HFENCE functions which
supports range based TLB invalidation and supports HFENCE.VVMA
based functions. This is also a preparatory patch for upcoming
Svinval support in KVM RISC-V.
Signed-off-by: Anup Patel <apatel@ventanamicro.com>
Reviewed-by: Atish Patra <atishp@rivosinc.com>
Signed-off-by: Anup Patel <anup@brainfault.org>
|
|
We should treat SBI HFENCE calls as NOPs until nested virtualization
is supported by KVM RISC-V. This will help us test booting a hypervisor
under KVM RISC-V.
Signed-off-by: Anup Patel <apatel@ventanamicro.com>
Reviewed-by: Atish Patra <atishp@rivosinc.com>
Signed-off-by: Anup Patel <anup@brainfault.org>
|
|
Latest QEMU supports G-stage Sv57x4 mode so this patch extends KVM
RISC-V G-stage handling to detect and use Sv57x4 mode when available.
Signed-off-by: Anup Patel <apatel@ventanamicro.com>
Reviewed-by: Atish Patra <atishp@rivosinc.com>
Signed-off-by: Anup Patel <anup@brainfault.org>
|
|
The two-stage address translation defined by the RISC-V privileged
specification defines: VS-stage (guest virtual address to guest
physical address) programmed by the Guest OS and G-stage (guest
physical addree to host physical address) programmed by the
hypervisor.
To align with above terminology, we replace "stage2" with "gstage"
and "Stage2" with "G-stage" name everywhere in KVM RISC-V sources.
Signed-off-by: Anup Patel <apatel@ventanamicro.com>
Reviewed-by: Atish Patra <atishp@rivosinc.com>
Signed-off-by: Anup Patel <anup@brainfault.org>
|
|
* kvm-arm64/its-save-restore-fixes-5.19:
: .
: Tighten the ITS save/restore infrastructure to fail early rather
: than late. Patches courtesy of Rocardo Koller.
: .
KVM: arm64: vgic: Undo work in failed ITS restores
KVM: arm64: vgic: Do not ignore vgic_its_restore_cte failures
KVM: arm64: vgic: Add more checks when restoring ITS tables
KVM: arm64: vgic: Check that new ITEs could be saved in guest memory
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
* kvm-arm64/misc-5.19:
: .
: Misc fixes and general improvements for KVMM/arm64:
:
: - Better handle out of sequence sysregs in the global tables
:
: - Remove a couple of unnecessary loads from constant pool
:
: - Drop unnecessary pKVM checks
:
: - Add all known M1 implementations to the SEIS workaround
:
: - Cleanup kerneldoc warnings
: .
KVM: arm64: vgic-v3: List M1 Pro/Max as requiring the SEIS workaround
KVM: arm64: pkvm: Don't mask already zeroed FEAT_SVE
KVM: arm64: pkvm: Drop unnecessary FP/SIMD trap handler
KVM: arm64: nvhe: Eliminate kernel-doc warnings
KVM: arm64: Avoid unnecessary absolute addressing via literals
KVM: arm64: Print emulated register table name when it is unsorted
KVM: arm64: Don't BUG_ON() if emulated register table is unsorted
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
* kvm-arm64/per-vcpu-host-pmu-data:
: .
: Pass the host PMU state in the vcpu to avoid the use of additional
: shared memory between EL1 and EL2 (this obviously only applies
: to nVHE and Protected setups).
:
: Patches courtesy of Fuad Tabba.
: .
KVM: arm64: pmu: Restore compilation when HW_PERF_EVENTS isn't selected
KVM: arm64: Reenable pmu in Protected Mode
KVM: arm64: Pass pmu events to hyp via vcpu
KVM: arm64: Repack struct kvm_pmu to reduce size
KVM: arm64: Wrapper for getting pmu_events
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
* kvm-arm64/vgic-invlpir:
: .
: Implement MMIO-based LPI invalidation for vGICv3.
: .
KVM: arm64: vgic-v3: Advertise GICR_CTLR.{IR, CES} as a new GICD_IIDR revision
KVM: arm64: vgic-v3: Implement MMIO-based LPI invalidation
KVM: arm64: vgic-v3: Expose GICR_CTLR.RWP when disabling LPIs
irqchip/gic-v3: Exposes bit values for GICR_CTLR.{IR, CES}
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
* kvm-arm64/psci-suspend:
: .
: Add support for PSCI SYSTEM_SUSPEND and allow userspace to
: filter the wake-up events.
:
: Patches courtesy of Oliver.
: .
Documentation: KVM: Fix title level for PSCI_SUSPEND
selftests: KVM: Test SYSTEM_SUSPEND PSCI call
selftests: KVM: Refactor psci_test to make it amenable to new tests
selftests: KVM: Use KVM_SET_MP_STATE to power off vCPU in psci_test
selftests: KVM: Create helper for making SMCCC calls
selftests: KVM: Rename psci_cpu_on_test to psci_test
KVM: arm64: Implement PSCI SYSTEM_SUSPEND
KVM: arm64: Add support for userspace to suspend a vCPU
KVM: arm64: Return a value from check_vcpu_requests()
KVM: arm64: Rename the KVM_REQ_SLEEP handler
KVM: arm64: Track vCPU power state using MP state values
KVM: arm64: Dedupe vCPU power off helpers
KVM: arm64: Don't depend on fallthrough to hide SYSTEM_RESET2
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
* kvm-arm64/hcall-selection:
: .
: Introduce a new set of virtual sysregs for userspace to
: select the hypercalls it wants to see exposed to the guest.
:
: Patches courtesy of Raghavendra and Oliver.
: .
KVM: arm64: Fix hypercall bitmap writeback when vcpus have already run
KVM: arm64: Hide KVM_REG_ARM_*_BMAP_BIT_COUNT from userspace
Documentation: Fix index.rst after psci.rst renaming
selftests: KVM: aarch64: Add the bitmap firmware registers to get-reg-list
selftests: KVM: aarch64: Introduce hypercall ABI test
selftests: KVM: Create helper for making SMCCC calls
selftests: KVM: Rename psci_cpu_on_test to psci_test
tools: Import ARM SMCCC definitions
Docs: KVM: Add doc for the bitmap firmware registers
Docs: KVM: Rename psci.rst to hypercalls.rst
KVM: arm64: Add vendor hypervisor firmware register
KVM: arm64: Add standard hypervisor firmware register
KVM: arm64: Setup a framework for hypercall bitmap firmware registers
KVM: arm64: Factor out firmware register handling from psci.c
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
We generally want to disallow hypercall bitmaps being changed
once vcpus have already run. But we must allow the write if
the written value is unchanged so that userspace can rewrite
the register file on reboot, for example.
Without this, a QEMU-based VM will fail to reboot correctly.
The original code was correct, and it is me that introduced
the regression.
Fixes: 05714cab7d63 ("KVM: arm64: Setup a framework for hypercall bitmap firmware registers")
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
Failed ITS restores should clean up all state restored until the
failure. There is some cleanup already present when failing to restore
some tables, but it's not complete. Add the missing cleanup.
Note that this changes the behavior in case of a failed restore of the
device tables.
restore ioctl:
1. restore collection tables
2. restore device tables
With this commit, failures in 2. clean up everything created so far,
including state created by 1.
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Ricardo Koller <ricarkol@google.com>
Reviewed-by: Oliver Upton <oupton@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20220510001633.552496-5-ricarkol@google.com
|
|
Restoring a corrupted collection entry (like an out of range ID) is
being ignored and treated as success. More specifically, a
vgic_its_restore_cte failure is treated as success by
vgic_its_restore_collection_table. vgic_its_restore_cte uses positive
and negative numbers to return error, and +1 to return success. The
caller then uses "ret > 0" to check for success.
Fix this by having vgic_its_restore_cte only return negative numbers on
error. Do this by changing alloc_collection return codes to only return
negative numbers on error.
Signed-off-by: Ricardo Koller <ricarkol@google.com>
Reviewed-by: Oliver Upton <oupton@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20220510001633.552496-4-ricarkol@google.com
|
|
Try to improve the predictability of ITS save/restores (and debuggability
of failed ITS saves) by failing early on restore when trying to read
corrupted tables.
Restoring the ITS tables does some checks for corrupted tables, but not as
many as in a save: an overflowing device ID will be detected on save but
not on restore. The consequence is that restoring a corrupted table won't
be detected until the next save; including the ITS not working as expected
after the restore. As an example, if the guest sets tables overlapping
each other, which would most likely result in some corrupted table, this is
what we would see from the host point of view:
guest sets base addresses that overlap each other
save ioctl
restore ioctl
save ioctl (fails)
Ideally, we would like the first save to fail, but overlapping tables could
actually be intended by the guest. So, let's at least fail on the restore
with some checks: like checking that device and event IDs don't overflow
their tables.
Signed-off-by: Ricardo Koller <ricarkol@google.com>
Reviewed-by: Oliver Upton <oupton@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20220510001633.552496-3-ricarkol@google.com
|
|
Try to improve the predictability of ITS save/restores by failing
commands that would lead to failed saves. More specifically, fail any
command that adds an entry into an ITS table that is not in guest
memory, which would otherwise lead to a failed ITS save ioctl. There
are already checks for collection and device entries, but not for
ITEs. Add the corresponding check for the ITT when adding ITEs.
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Ricardo Koller <ricarkol@google.com>
Reviewed-by: Oliver Upton <oupton@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20220510001633.552496-2-ricarkol@google.com
|
|
Moving kvm_pmu_events into the vcpu (and refering to it) broke the
somewhat unusual case where the kernel has no support for a PMU
at all.
In order to solve this, move things around a bit so that we can
easily avoid refering to the pmu structure outside of PMU-aware
code. As a bonus, pmu.c isn't compiled in when HW_PERF_EVENTS
isn't selected.
Reported-by: kernel test robot <lkp@intel.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/202205161814.KQHpOzsJ-lkp@intel.com
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fix from Michael Ellerman:
- Fix KVM PR on 32-bit, which was broken by some MMU code refactoring.
Thanks to: Alexander Graf, and Matt Evans.
* tag 'powerpc-5.18-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
KVM: PPC: Book3S PR: Enable MSR_DR for switch_mmu_context()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fix from Thomas Gleixner:
"A single fix for the handling of unpopulated sub-pmd spaces.
The copy & pasta from the corresponding s390 code screwed up the
address calculation for marking the sub-pmd ranges via memset by
omitting the ALIGN_DOWN() to calculate the proper start address.
It's a mystery why this code is not generic and shared because there
is nothing architecture specific in there, but that's too intrusive
for a backportable fix"
* tag 'x86-urgent-2022-05-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mm: Fix marking of unused sub-pmd ranges
|
|
These constants will change over time, and userspace has no
business knowing about them. Hide them behind __KERNEL__.
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
Now that the pmu code does not access hyp data, reenable it in
protected mode.
Once fully supported, protected VMs will not have pmu support,
since that could leak information. However, non-protected VMs in
protected mode should have pmu support if available.
Signed-off-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Oliver Upton <oupton@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20220510095710.148178-5-tabba@google.com
|
|
Instead of the host accessing hyp data directly, pass the pmu
events of the current cpu to hyp via the vcpu.
This adds 64 bits (in two fields) to the vcpu that need to be
synced before every vcpu run in nvhe and protected modes.
However, it isolates the hypervisor from the host, which allows
us to use pmu in protected mode in a subsequent patch.
No visible side effects in behavior intended.
Signed-off-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Oliver Upton <oupton@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20220510095710.148178-4-tabba@google.com
|
|
Eases migrating away from using hyp data and simplifies the code.
No functional change intended.
Reviewed-by: Oliver Upton <oupton@google.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20220510095710.148178-2-tabba@google.com
|
|
Unsusprisingly, Apple M1 Pro/Max have the exact same defect as the
original M1 and generate random SErrors in the host when a guest
tickles the GICv3 CPU interface the wrong way.
Add the part numbers for both the CPU types found in these two
new implementations, and add them to the hall of shame. This also
applies to the Ultra version, as it is composed of 2 Max SoCs.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Link: https://lore.kernel.org/r/20220514102524.3188730-1-maz@kernel.org
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"Seven MM fixes, three of which address issues added in the most recent
merge window, four of which are cc:stable.
Three non-MM fixes, none very serious"
[ And yes, that's a real pull request from Andrew, not me creating a
branch from emailed patches. Woo-hoo! ]
* tag 'mm-hotfixes-stable-2022-05-11' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
MAINTAINERS: add a mailing list for DAMON development
selftests: vm: Makefile: rename TARGETS to VMTARGETS
mm/kfence: reset PG_slab and memcg_data before freeing __kfence_pool
mailmap: add entry for martyna.szapar-mudlaw@intel.com
arm[64]/memremap: don't abuse pfn_valid() to ensure presence of linear map
procfs: prevent unprivileged processes accessing fdinfo dir
mm: mremap: fix sign for EFAULT error return value
mm/hwpoison: use pr_err() instead of dump_page() in get_any_page()
mm/huge_memory: do not overkill when splitting huge_zero_page
Revert "mm/memory-failure.c: skip huge_zero_page in memory_failure()"
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 fixes from Will Deacon:
- TLB invalidation workaround for Qualcomm Kryo-4xx "gold" CPUs
- Fix broken dependency in the vDSO Makefile
- Fix pointer authentication overrides in ISAR2 ID register
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64: Enable repeat tlbi workaround on KRYO4XX gold CPUs
arm64: cpufeature: remove duplicate ID_AA64ISAR2_EL1 entry
arm64: vdso: fix makefile dependency on vdso.so
|
|
The unused part precedes the new range spanned by the start, end parameters
of vmemmap_use_new_sub_pmd(). This means it actually goes from
ALIGN_DOWN(start, PMD_SIZE) up to start.
Use the correct address when applying the mark using memset.
Fixes: 8d400913c231 ("x86/vmemmap: handle unpopulated sub-pmd ranges")
Signed-off-by: Adrian-Ken Rueegsegger <ken@codelabs.ch>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20220509090637.24152-2-ken@codelabs.ch
|
|
Avoid calling handlers on empty rmap entries and skip to the next non
empty rmap entry.
Empty rmap entries are noop in handlers.
Signed-off-by: Vipin Sharma <vipinsh@google.com>
Suggested-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220502220347.174664-1-vipinsh@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Intel MKTME KeyID bits (including Intel TDX private KeyID bits) should
never be set to SPTE. Set shadow_me_value to 0 and shadow_me_mask to
include all MKTME KeyID bits to include them to shadow_zero_check.
Signed-off-by: Kai Huang <kai.huang@intel.com>
Message-Id: <27bc10e97a3c0b58a4105ff9107448c190328239.1650363789.git.kai.huang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Intel Multi-Key Total Memory Encryption (MKTME) repurposes couple of
high bits of physical address bits as 'KeyID' bits. Intel Trust Domain
Extentions (TDX) further steals part of MKTME KeyID bits as TDX private
KeyID bits. TDX private KeyID bits cannot be set in any mapping in the
host kernel since they can only be accessed by software running inside a
new CPU isolated mode. And unlike to AMD's SME, host kernel doesn't set
any legacy MKTME KeyID bits to any mapping either. Therefore, it's not
legitimate for KVM to set any KeyID bits in SPTE which maps guest
memory.
KVM maintains shadow_zero_check bits to represent which bits must be
zero for SPTE which maps guest memory. MKTME KeyID bits should be set
to shadow_zero_check. Currently, shadow_me_mask is used by AMD to set
the sme_me_mask to SPTE, and shadow_me_shadow is excluded from
shadow_zero_check. So initializing shadow_me_mask to represent all
MKTME keyID bits doesn't work for VMX (as oppositely, they must be set
to shadow_zero_check).
Introduce a new 'shadow_me_value' to replace existing shadow_me_mask,
and repurpose shadow_me_mask as 'all possible memory encryption bits'.
The new schematic of them will be:
- shadow_me_value: the memory encryption bit(s) that will be set to the
SPTE (the original shadow_me_mask).
- shadow_me_mask: all possible memory encryption bits (which is a super
set of shadow_me_value).
- For now, shadow_me_value is supposed to be set by SVM and VMX
respectively, and it is a constant during KVM's life time. This
perhaps doesn't fit MKTME but for now host kernel doesn't support it
(and perhaps will never do).
- Bits in shadow_me_mask are set to shadow_zero_check, except the bits
in shadow_me_value.
Introduce a new helper kvm_mmu_set_me_spte_mask() to initialize them.
Replace shadow_me_mask with shadow_me_value in almost all code paths,
except the one in PT64_PERM_MASK, which is used by need_remote_flush()
to determine whether remote TLB flush is needed. This should still use
shadow_me_mask as any encryption bit change should need a TLB flush.
And for AMD, move initializing shadow_me_value/shadow_me_mask from
kvm_mmu_reset_all_pte_masks() to svm_hardware_setup().
Signed-off-by: Kai Huang <kai.huang@intel.com>
Message-Id: <f90964b93a3398b1cf1c56f510f3281e0709e2ab.1650363789.git.kai.huang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Rename reset_rsvds_bits_mask() to reset_guest_rsvds_bits_mask() to make
it clearer that it resets the reserved bits check for guest's page table
entries.
Signed-off-by: Kai Huang <kai.huang@intel.com>
Message-Id: <efdc174b85d55598880064b8bf09245d3791031d.1650363789.git.kai.huang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Expand and clean up the page fault stats. The current stats are at best
incomplete, and at worst misleading. Differentiate between faults that
are actually fixed vs those that result in an MMIO SPTE being created,
track faults that are spurious, faults that trigger emulation, faults
that that are fixed in the fast path, and last but not least, track the
number of faults that are taken.
Note, the number of faults that require emulation for write-protected
shadow pages can roughly be calculated by subtracting the number of MMIO
SPTEs created from the overall number of faults that trigger emulation.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220423034752.1161007-10-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Use IS_ENABLED() instead of an #ifdef to activate the anti-RETPOLINE fast
path for TDP page faults. The generated code is identical, and the #ifdef
makes it dangerously difficult to extend the logic (guess who forgot to
add an "else" inside the #ifdef and ran through the page fault handler
twice).
No functional or binary change intented.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220423034752.1161007-9-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Move kvm_arch_async_page_ready() to mmu.c where it belongs, and move all
of the page fault handling collateral that was in mmu.h purely for the
async #PF handler into mmu_internal.h, where it belongs. This will allow
kvm_mmu_do_page_fault() to act on the RET_PF_* return without having to
expose those enums outside of the MMU.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220423034752.1161007-8-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Add RET_PF_CONTINUE and use it in handle_abnormal_pfn() and
kvm_faultin_pfn() to signal that the page fault handler should continue
doing its thing. Aside from being gross and inefficient, using a boolean
return to signal continue vs. stop makes it extremely difficult to add
more helpers and/or move existing code to a helper.
E.g. hypothetically, if nested MMUs were to gain a separate page fault
handler in the future, everything up to the "is self-modifying PTE" check
can be shared by all shadow MMUs, but communicating up the stack whether
to continue on or stop becomes a nightmare.
More concretely, proposed support for private guest memory ran into a
similar issue, where it'll be forced to forego a helper in order to yield
sane code: https://lore.kernel.org/all/YkJbxiL%2FAz7olWlq@google.com.
No functional change intended.
Cc: David Matlack <dmatlack@google.com>
Cc: Chao Peng <chao.p.peng@linux.intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220423034752.1161007-7-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Tweak the "page fault can be fast" logic to explicitly check for !PRESENT
faults in the access tracking case, and drop the exec/NX check that
becomes redundant as a result. No sane hardware will generate an access
that is both an instruct fetch and a write, i.e. it's a waste of cycles.
If hardware goes off the rails, or KVM runs under a misguided hypervisor,
spuriously running throught fast path is benign (KVM has been uknowingly
being doing exactly that for years).
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220423034752.1161007-6-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Check for A/D bits being disabled instead of the access tracking mask
being non-zero when deciding whether or not to attempt to fix a page
fault vian the fast path. Originally, the access tracking mask was
non-zero if and only if A/D bits were disabled by _KVM_ (including not
being supported by hardware), but that hasn't been true since nVMX was
fixed to honor EPTP12's A/D enabling, i.e. since KVM allowed L1 to cause
KVM to not use A/D bits while running L2 despite KVM using them while
running L1.
In other words, don't attempt the fast path just because EPT is enabled.
Note, attempting the fast path for all !PRESENT faults can "fix" a very,
_VERY_ tiny percentage of faults out of mmu_lock by detecting that the
fault is spurious, i.e. has been fixed by a different vCPU, but again the
odds of that happening are vanishingly small. E.g. booting an 8-vCPU VM
gets less than 10 successes out of 30k+ faults, and that's likely one of
the more favorable scenarios. Disabling dirty logging can likely lead to
a rash of collisions between vCPUs for some workloads that operate on a
common set of pages, but penalizing _all_ !PRESENT faults for that one
case is unlikely to be a net positive, not to mention that that problem
is best solved by not zapping in the first place.
The number of spurious faults does scale with the number of vCPUs, e.g. a
255-vCPU VM using TDP "jumps" to ~60 spurious faults detected in the fast
path (again out of 30k), but that's all of 0.2% of faults. Using legacy
shadow paging does get more spurious faults, and a few more detected out
of mmu_lock, but the percentage goes _down_ to 0.08% (and that's ignoring
faults that are reflected into the guest), i.e. the extra detections are
purely due to the sheer number of faults observed.
On the other hand, getting a "negative" in the fast path takes in the
neighborhood of 150-250 cycles. So while it is tempting to keep/extend
the current behavior, such a change needs to come with hard numbers
showing that it's actually a win in the grand scheme, or any scheme for
that matter.
Fixes: 995f00a61958 ("x86: kvm: mmu: use ept a/d in vmcs02 iff used in vmcs12")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220423034752.1161007-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Passing per_cpu() to list_for_each_entry() causes the macro to be
evaluated N+1 times for N sleeping vCPUs. This is a very small
inefficiency, and the code is cleaner if the address of the per-CPU
variable is loaded earlier. Do this for both the list and the spinlock.
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Message-Id: <1649244302-6777-1-git-send-email-lirongqing@baidu.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
This shows up as a TDP MMU leak when running nested. Non-working cmpxchg on L0
relies makes L1 install two different shadow pages under same spte, and one of
them is leaked.
Fixes: 1c2361f667f36 ("KVM: x86: Use __try_cmpxchg_user() to emulate atomic accesses")
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20220512101420.306759-1-mlevitsk@redhat.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Add KRYO4XX gold/big cores to the list of CPUs that need the
repeat TLBI workaround. Apply this to the affected
KRYO4XX cores (rcpe to rfpe).
The variant and revision bits are implementation defined and are
different from the their Cortex CPU counterparts on which they are
based on, i.e., (r0p0 to r3p0) is equivalent to (rcpe to rfpe).
Signed-off-by: Shreyas K K <quic_shrekk@quicinc.com>
Reviewed-by: Sai Prakash Ranjan <quic_saipraka@quicinc.com>
Link: https://lore.kernel.org/r/20220512110134.12179-1-quic_shrekk@quicinc.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
The ID register table should have one entry per ID register but
currently has two entries for ID_AA64ISAR2_EL1. Only one entry has an
override, and get_arm64_ftr_reg() can end up choosing the other, causing
the override to be ignored. Fix this by removing the duplicate entry.
While here, also make the check in sort_ftr_regs() more strict so that
duplicate entries can't be added in the future.
Fixes: def8c222f054 ("arm64: Add support of PAuth QARMA3 architected algorithm")
Signed-off-by: Kristina Martsenko <kristina.martsenko@arm.com>
Reviewed-by: Vladimir Murzin <vladimir.murzin@arm.com>
Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Link: https://lore.kernel.org/r/20220511162030.1403386-1-kristina.martsenko@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Commit 863771a28e27 ("powerpc/32s: Convert switch_mmu_context() to C")
moved the switch_mmu_context() to C. While in principle a good idea, it
meant that the function now uses the stack. The stack is not accessible
from real mode though.
So to keep calling the function, let's turn on MSR_DR while we call it.
That way, all pointer references to the stack are handled virtually.
In addition, make sure to save/restore r12 on the stack, as it may get
clobbered by the C function.
Fixes: 863771a28e27 ("powerpc/32s: Convert switch_mmu_context() to C")
Cc: stable@vger.kernel.org # v5.14+
Reported-by: Matt Evans <matt@ozlabs.org>
Signed-off-by: Alexander Graf <graf@amazon.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220510123717.24508-1-graf@amazon.com
|
|
There is currently no dependency for vdso*-wrap.S on vdso*.so, which means that
you can get a build that uses a stale vdso*-wrap.o.
In commit a5b8ca97fbf8, the file that includes the vdso.so was moved and renamed
from arch/arm64/kernel/vdso/vdso.S to arch/arm64/kernel/vdso-wrap.S, when this
happened the Makefile was not updated to force the dependcy on vdso.so.
Fixes: a5b8ca97fbf8 ("arm64: do not descend to vdso directories twice")
Signed-off-by: Joey Gouly <joey.gouly@arm.com>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20220510102721.50811-1-joey.gouly@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|