summaryrefslogtreecommitdiff
path: root/arch/x86/include/uapi
AgeCommit message (Collapse)Author
2024-05-07KVM: SEV: Allow per-guest configuration of GHCB protocol versionMichael Roth
The GHCB protocol version may be different from one guest to the next. Add a field to track it for each KVM instance and extend KVM_SEV_INIT2 to allow it to be configured by userspace. Now that all SEV-ES support for GHCB protocol version 2 is in place, go ahead and default to it when creating SEV-ES guests through the new KVM_SEV_INIT2 interface. Keep the older KVM_SEV_ES_INIT interface restricted to GHCB protocol version 1. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Michael Roth <michael.roth@amd.com> Message-ID: <20240501071048.2208265-5-michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-04-11KVM: SEV: introduce KVM_SEV_INIT2 operationPaolo Bonzini
The idea that no parameter would ever be necessary when enabling SEV or SEV-ES for a VM was decidedly optimistic. In fact, in some sense it's already a parameter whether SEV or SEV-ES is desired. Another possible source of variability is the desired set of VMSA features, as that affects the measurement of the VM's initial state and cannot be changed arbitrarily by the hypervisor. Create a new sub-operation for KVM_MEMORY_ENCRYPT_OP that can take a struct, and put the new op to work by including the VMSA features as a field of the struct. The existing KVM_SEV_INIT and KVM_SEV_ES_INIT use the full set of supported VMSA features for backwards compatibility. The struct also includes the usual bells and whistles for future extensibility: a flags field that must be zero for now, and some padding at the end. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20240404121327.3107131-13-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-04-11KVM: SEV: define VM types for SEV and SEV-ESPaolo Bonzini
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20240404121327.3107131-11-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-04-11KVM: SEV: publish supported VMSA featuresPaolo Bonzini
Compute the set of features to be stored in the VMSA when KVM is initialized; move it from there into kvm_sev_info when SEV is initialized, and then into the initial VMSA. The new variable can then be used to return the set of supported features to userspace, via the KVM_GET_DEVICE_ATTR ioctl. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com> Message-ID: <20240404121327.3107131-6-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-03-18KVM: SEV: fix compat ABI for KVM_MEMORY_ENCRYPT_OPPaolo Bonzini
The data structs for KVM_MEMORY_ENCRYPT_OP have different sizes for 32- and 64-bit userspace, but they do not make any attempt to convert from one ABI to the other when 32-bit userspace is running on 64-bit kernels. This configuration never worked, and SEV is only for 64-bit kernels so we're not breaking ABI on 32-bit kernels. Fix this by adding the appropriate padding; no functional change intended for 64-bit userspace. Reviewed-by: Michael Roth <michael.roth@amd.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-03-18Merge tag 'kvm-x86-asyncpf_abi-6.9' of https://github.com/kvm-x86/linux into ↵Paolo Bonzini
HEAD Guest-side KVM async #PF ABI cleanup for 6.9 Delete kvm_vcpu_pv_apf_data.enabled to fix a goof in KVM's async #PF ABI where the enabled field pushes the size of "struct kvm_vcpu_pv_apf_data" from 64 to 68 bytes, i.e. beyond a single cache line. The enabled field is purely a guest-side flag that Linux-as-a-guest uses to track whether or not the guest has enabled async #PF support. The actual flag that is passed to the host, i.e. to KVM proper, is a single bit in a synthetic MSR, MSR_KVM_ASYNC_PF_EN, i.e. is in a location completely unrelated to the shared kvm_vcpu_pv_apf_data structure. Simply drop the the field and use a dedicated guest-side per-CPU variable to fix the ABI, as opposed to fixing the documentation to match reality. KVM has never consumed kvm_vcpu_pv_apf_data.enabled, so the odds of the ABI change breaking anything are extremely low.
2024-03-15Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm updates from Paolo Bonzini: "S390: - Changes to FPU handling came in via the main s390 pull request - Only deliver to the guest the SCLP events that userspace has requested - More virtual vs physical address fixes (only a cleanup since virtual and physical address spaces are currently the same) - Fix selftests undefined behavior x86: - Fix a restriction that the guest can't program a PMU event whose encoding matches an architectural event that isn't included in the guest CPUID. The enumeration of an architectural event only says that if a CPU supports an architectural event, then the event can be programmed *using the architectural encoding*. The enumeration does NOT say anything about the encoding when the CPU doesn't report support the event *in general*. It might support it, and it might support it using the same encoding that made it into the architectural PMU spec - Fix a variety of bugs in KVM's emulation of RDPMC (more details on individual commits) and add a selftest to verify KVM correctly emulates RDMPC, counter availability, and a variety of other PMC-related behaviors that depend on guest CPUID and therefore are easier to validate with selftests than with custom guests (aka kvm-unit-tests) - Zero out PMU state on AMD if the virtual PMU is disabled, it does not cause any bug but it wastes time in various cases where KVM would check if a PMC event needs to be synthesized - Optimize triggering of emulated events, with a nice ~10% performance improvement in VM-Exit microbenchmarks when a vPMU is exposed to the guest - Tighten the check for "PMI in guest" to reduce false positives if an NMI arrives in the host while KVM is handling an IRQ VM-Exit - Fix a bug where KVM would report stale/bogus exit qualification information when exiting to userspace with an internal error exit code - Add a VMX flag in /proc/cpuinfo to report 5-level EPT support - Rework TDP MMU root unload, free, and alloc to run with mmu_lock held for read, e.g. to avoid serializing vCPUs when userspace deletes a memslot - Tear down TDP MMU page tables at 4KiB granularity (used to be 1GiB). KVM doesn't support yielding in the middle of processing a zap, and 1GiB granularity resulted in multi-millisecond lags that are quite impolite for CONFIG_PREEMPT kernels - Allocate write-tracking metadata on-demand to avoid the memory overhead when a kernel is built with i915 virtualization support but the workloads use neither shadow paging nor i915 virtualization - Explicitly initialize a variety of on-stack variables in the emulator that triggered KMSAN false positives - Fix the debugregs ABI for 32-bit KVM - Rework the "force immediate exit" code so that vendor code ultimately decides how and when to force the exit, which allowed some optimization for both Intel and AMD - Fix a long-standing bug where kvm_has_noapic_vcpu could be left elevated if vCPU creation ultimately failed, causing extra unnecessary work - Cleanup the logic for checking if the currently loaded vCPU is in-kernel - Harden against underflowing the active mmu_notifier invalidation count, so that "bad" invalidations (usually due to bugs elsehwere in the kernel) are detected earlier and are less likely to hang the kernel x86 Xen emulation: - Overlay pages can now be cached based on host virtual address, instead of guest physical addresses. This removes the need to reconfigure and invalidate the cache if the guest changes the gpa but the underlying host virtual address remains the same - When possible, use a single host TSC value when computing the deadline for Xen timers in order to improve the accuracy of the timer emulation - Inject pending upcall events when the vCPU software-enables its APIC to fix a bug where an upcall can be lost (and to follow Xen's behavior) - Fall back to the slow path instead of warning if "fast" IRQ delivery of Xen events fails, e.g. if the guest has aliased xAPIC IDs RISC-V: - Support exception and interrupt handling in selftests - New self test for RISC-V architectural timer (Sstc extension) - New extension support (Ztso, Zacas) - Support userspace emulation of random number seed CSRs ARM: - Infrastructure for building KVM's trap configuration based on the architectural features (or lack thereof) advertised in the VM's ID registers - Support for mapping vfio-pci BARs as Normal-NC (vaguely similar to x86's WC) at stage-2, improving the performance of interacting with assigned devices that can tolerate it - Conversion of KVM's representation of LPIs to an xarray, utilized to address serialization some of the serialization on the LPI injection path - Support for _architectural_ VHE-only systems, advertised through the absence of FEAT_E2H0 in the CPU's ID register - Miscellaneous cleanups, fixes, and spelling corrections to KVM and selftests LoongArch: - Set reserved bits as zero in CPUCFG - Start SW timer only when vcpu is blocking - Do not restart SW timer when it is expired - Remove unnecessary CSR register saving during enter guest - Misc cleanups and fixes as usual Generic: - Clean up Kconfig by removing CONFIG_HAVE_KVM, which was basically always true on all architectures except MIPS (where Kconfig determines the available depending on CPU capabilities). It is replaced either by an architecture-dependent symbol for MIPS, and IS_ENABLED(CONFIG_KVM) everywhere else - Factor common "select" statements in common code instead of requiring each architecture to specify it - Remove thoroughly obsolete APIs from the uapi headers - Move architecture-dependent stuff to uapi/asm/kvm.h - Always flush the async page fault workqueue when a work item is being removed, especially during vCPU destruction, to ensure that there are no workers running in KVM code when all references to KVM-the-module are gone, i.e. to prevent a very unlikely use-after-free if kvm.ko is unloaded - Grab a reference to the VM's mm_struct in the async #PF worker itself instead of gifting the worker a reference, so that there's no need to remember to *conditionally* clean up after the worker Selftests: - Reduce boilerplate especially when utilize selftest TAP infrastructure - Add basic smoke tests for SEV and SEV-ES, along with a pile of library support for handling private/encrypted/protected memory - Fix benign bugs where tests neglect to close() guest_memfd files" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (246 commits) selftests: kvm: remove meaningless assignments in Makefiles KVM: riscv: selftests: Add Zacas extension to get-reg-list test RISC-V: KVM: Allow Zacas extension for Guest/VM KVM: riscv: selftests: Add Ztso extension to get-reg-list test RISC-V: KVM: Allow Ztso extension for Guest/VM RISC-V: KVM: Forward SEED CSR access to user space KVM: riscv: selftests: Add sstc timer test KVM: riscv: selftests: Change vcpu_has_ext to a common function KVM: riscv: selftests: Add guest helper to get vcpu id KVM: riscv: selftests: Add exception handling support LoongArch: KVM: Remove unnecessary CSR register saving during enter guest LoongArch: KVM: Do not restart SW timer when it is expired LoongArch: KVM: Start SW timer only when vcpu is blocking LoongArch: KVM: Set reserved bits as zero in CPUCFG KVM: selftests: Explicitly close guest_memfd files in some gmem tests KVM: x86/xen: fix recursive deadlock in timer injection KVM: pfncache: simplify locking and make more self-contained KVM: x86/xen: remove WARN_ON_ONCE() with false positives in evtchn delivery KVM: x86/xen: inject vCPU upcall vector when local APIC is enabled KVM: x86/xen: improve accuracy of Xen timers ...
2024-03-12Merge branch 'linus' into x86/boot, to resolve conflictIngo Molnar
There's a new conflict with Linus's upstream tree, because in the following merge conflict resolution in <asm/coco.h>: 38b334fc767e Merge tag 'x86_sev_for_v6.9_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Linus has resolved the conflicting placement of 'cc_mask' better than the original commit: 1c811d403afd x86/sev: Fix position dependent variable references in startup code ... which was also done by an internal merge resolution: 2e5fc4786b7a Merge branch 'x86/sev' into x86/boot, to resolve conflicts and to pick up dependent tree But Linus is right in 38b334fc767e, the 'cc_mask' declaration is sufficient within the #ifdef CONFIG_ARCH_HAS_CC_PLATFORM block. So instead of forcing Linus to do the same resolution again, merge in Linus's tree and follow his conflict resolution. Conflicts: arch/x86/include/asm/coco.h Signed-off-by: Ingo Molnar <mingo@kernel.org>
2024-03-11Merge tag 'x86-build-2024-03-11' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 build updates from Ingo Molnar: - Reduce <asm/bootparam.h> dependencies - Simplify <asm/efi.h> - Unify *_setup_data definitions into <asm/setup_data.h> - Reduce the size of <asm/bootparam.h> * tag 'x86-build-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86: Do not include <asm/bootparam.h> in several files x86/efi: Implement arch_ima_efi_boot_mode() in source file x86/setup: Move internal setup_data structures into setup_data.h x86/setup: Move UAPI setup structures into setup_data.h
2024-03-04x86/boot: Move mem_encrypt= parsing to the decompressorArd Biesheuvel
The early SME/SEV code parses the command line very early, in order to decide whether or not memory encryption should be enabled, which needs to occur even before the initial page tables are created. This is problematic for a number of reasons: - this early code runs from the 1:1 mapping provided by the decompressor or firmware, which uses a different translation than the one assumed by the linker, and so the code needs to be built in a special way; - parsing external input while the entire kernel image is still mapped writable is a bad idea in general, and really does not belong in security minded code; - the current code ignores the built-in command line entirely (although this appears to be the case for the entire decompressor) Given that the decompressor/EFI stub is an intrinsic part of the x86 bootable kernel image, move the command line parsing there and out of the core kernel. This removes the need to build lib/cmdline.o in a special way, or to use RIP-relative LEA instructions in inline asm blocks. This involves a new xloadflag in the setup header to indicate that mem_encrypt=on appeared on the kernel command line. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20240227151907.387873-17-ardb+git@google.com
2024-02-22KVM: x86/xen: allow vcpu_info to be mapped by fixed HVAPaul Durrant
If the guest does not explicitly set the GPA of vcpu_info structure in memory then, for guests with 32 vCPUs or fewer, the vcpu_info embedded in the shared_info page may be used. As described in a previous commit, the shared_info page is an overlay at a fixed HVA within the VMM, so in this case it also more optimal to activate the vcpu_info cache with a fixed HVA to avoid unnecessary invalidation if the guest memory layout is modified. Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/20240215152916.1158-14-paul@xen.org [sean: use kvm_gpc_is_{gpa,hva}_active()] Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-22KVM: x86/xen: allow shared_info to be mapped by fixed HVAPaul Durrant
The shared_info page is not guest memory as such. It is a dedicated page allocated by the VMM and overlaid onto guest memory in a GFN chosen by the guest and specified in the XENMEM_add_to_physmap hypercall. The guest may even request that shared_info be moved from one GFN to another by re-issuing that hypercall, but the HVA is never going to change. Because the shared_info page is an overlay the memory slots need to be updated in response to the hypercall. However, memory slot adjustment is not atomic and, whilst all vCPUs are paused, there is still the possibility that events may be delivered (which requires the shared_info page to be updated) whilst the shared_info GPA is absent. The HVA is never absent though, so it makes much more sense to use that as the basis for the kernel's mapping. Hence add a new KVM_XEN_ATTR_TYPE_SHARED_INFO_HVA attribute type for this purpose and a KVM_XEN_HVM_CONFIG_SHARED_INFO_HVA flag to advertize its availability. Don't actually advertize it yet though. That will be done in a subsequent patch, which will also add tests for the new attribute type. Also update the KVM API documentation with the new attribute and also fix it up to consistently refer to 'shared_info' (with the underscore). Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/20240215152916.1158-13-paul@xen.org [sean: store "hva" as a user pointer, use kvm_gpc_is_{gpa,hva}_active()] Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-08KVM: define __KVM_HAVE_GUEST_DEBUG unconditionallyPaolo Bonzini
Since all architectures (for historical reasons) have to define struct kvm_guest_debug_arch, and since userspace has to check KVM_CHECK_EXTENSION(KVM_CAP_SET_GUEST_DEBUG) anyway, there is no advantage in masking the capability #define itself. Remove the #define __KVM_HAVE_GUEST_DEBUG from architecture-specific headers. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-02-08kvm: replace __KVM_HAVE_READONLY_MEM with Kconfig symbolPaolo Bonzini
KVM uses __KVM_HAVE_* symbols in the architecture-dependent uapi/asm/kvm.h to mask unused definitions in include/uapi/linux/kvm.h. __KVM_HAVE_READONLY_MEM however was nothing but a misguided attempt to define KVM_CAP_READONLY_MEM only on architectures where KVM_CHECK_EXTENSION(KVM_CAP_READONLY_MEM) could possibly return nonzero. This however does not make sense, and it prevented userspace from supporting this architecture-independent feature without recompilation. Therefore, these days __KVM_HAVE_READONLY_MEM does not mask anything and is only used in virt/kvm/kvm_main.c. Userspace does not need to test it and there should be no need for it to exist. Remove it and replace it with a Kconfig symbol within Linux source code. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-02-08KVM: x86: move x86-specific structs to uapi/asm/kvm.hPaolo Bonzini
Several capabilities that exist only on x86 nevertheless have their structs defined in include/uapi/linux/kvm.h. Move them to arch/x86/include/uapi/asm/kvm.h for cleanliness. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-02-08kvm: x86: use a uapi-friendly macro for GENMASKPaolo Bonzini
Change uapi header uses of GENMASK to instead use the uapi/linux/bits.h bit macros, since GENMASK is not defined in uapi headers. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-02-08kvm: x86: use a uapi-friendly macro for BITDionna Glaze
Change uapi header uses of BIT to instead use the uapi/linux/const.h bit macros, since BIT is not defined in uapi headers. The PMU mask uses _BITUL since it targets a 32 bit flag field, whereas the longmode definition is meant for a 64 bit flag field. Cc: Sean Christophersen <seanjc@google.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Dionna Glaze <dionnaglaze@google.com> Message-Id: <20231207001142.3617856-1-dionnaglaze@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-02-06x86/kvm: Use separate percpu variable to track the enabling of asyncpfXiaoyao Li
Refer to commit fd10cde9294f ("KVM paravirt: Add async PF initialization to PV guest") and commit 344d9588a9df ("KVM: Add PV MSR to enable asynchronous page faults delivery"). It turns out that at the time when asyncpf was introduced, the purpose was defining the shared PV data 'struct kvm_vcpu_pv_apf_data' with the size of 64 bytes. However, it made a mistake and defined the size to 68 bytes, which failed to make fit in a cache line and made the code inconsistent with the documentation. Below justification quoted from Sean[*] KVM (the host side) has *never* read kvm_vcpu_pv_apf_data.enabled, and the documentation clearly states that enabling is based solely on the bit in the synthetic MSR. So rather than update the documentation, fix the goof by removing the enabled filed and use the separate percpu variable instread. KVM-as-a-host obviously doesn't enforce anything or consume the size, and changing the header will only affect guests that are rebuilt against the new header, so there's no chance of ABI breakage between KVM and its guests. The only possible breakage is if some other hypervisor is emulating KVM's async #PF (LOL) and relies on the guest to set kvm_vcpu_pv_apf_data.enabled. But (a) I highly doubt such a hypervisor exists, (b) that would arguably be a violation of KVM's "spec", and (c) the worst case scenario is that the guest would simply lose async #PF functionality. [*] https://lore.kernel.org/all/ZS7ERnnRqs8Fl0ZF@google.com/T/#u Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20231025055914.1201792-2-xiaoyao.li@intel.com [sean: use true/false instead of 1/0 for booleans] Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-01-31x86/cpu: Add X86_CR4_FRED macroH. Peter Anvin (Intel)
Add X86_CR4_FRED macro for the FRED bit in %cr4. This bit must not be changed after initialization, so add it to the pinned CR4 bits. Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com> Signed-off-by: Xin Li <xin3.li@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: Shan Kang <shan.kang@intel.com> Link: https://lore.kernel.org/r/20231205105030.8698-12-xin3.li@intel.com
2024-01-30x86/setup: Move UAPI setup structures into setup_data.hThomas Zimmermann
The type definition of struct pci_setup_rom in <asm/pci.h> requires struct setup_data from <asm/bootparam.h>. Many drivers include <linux/pci.h>, but do not use boot parameters. Changes to bootparam.h or its included header files could easily trigger a large, unnecessary rebuild of the kernel. Moving struct setup_data and related code into its own header file avoids including <asm/bootparam.h> in <asm/pci.h>. Instead include the new header <asm/screen_data.h> and remove the include statement for x86_init.h, which is unnecessary but pulls in bootparams.h. Suggested-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/20240112095000.8952-2-tzimmermann@suse.de
2024-01-17Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm updates from Paolo Bonzini: "Generic: - Use memdup_array_user() to harden against overflow. - Unconditionally advertise KVM_CAP_DEVICE_CTRL for all architectures. - Clean up Kconfigs that all KVM architectures were selecting - New functionality around "guest_memfd", a new userspace API that creates an anonymous file and returns a file descriptor that refers to it. guest_memfd files are bound to their owning virtual machine, cannot be mapped, read, or written by userspace, and cannot be resized. guest_memfd files do however support PUNCH_HOLE, which can be used to switch a memory area between guest_memfd and regular anonymous memory. - New ioctl KVM_SET_MEMORY_ATTRIBUTES allowing userspace to specify per-page attributes for a given page of guest memory; right now the only attribute is whether the guest expects to access memory via guest_memfd or not, which in Confidential SVMs backed by SEV-SNP, TDX or ARM64 pKVM is checked by firmware or hypervisor that guarantees confidentiality (AMD PSP, Intel TDX module, or EL2 in the case of pKVM). x86: - Support for "software-protected VMs" that can use the new guest_memfd and page attributes infrastructure. This is mostly useful for testing, since there is no pKVM-like infrastructure to provide a meaningfully reduced TCB. - Fix a relatively benign off-by-one error when splitting huge pages during CLEAR_DIRTY_LOG. - Fix a bug where KVM could incorrectly test-and-clear dirty bits in non-leaf TDP MMU SPTEs if a racing thread replaces a huge SPTE with a non-huge SPTE. - Use more generic lockdep assertions in paths that don't actually care about whether the caller is a reader or a writer. - let Xen guests opt out of having PV clock reported as "based on a stable TSC", because some of them don't expect the "TSC stable" bit (added to the pvclock ABI by KVM, but never set by Xen) to be set. - Revert a bogus, made-up nested SVM consistency check for TLB_CONTROL. - Advertise flush-by-ASID support for nSVM unconditionally, as KVM always flushes on nested transitions, i.e. always satisfies flush requests. This allows running bleeding edge versions of VMware Workstation on top of KVM. - Sanity check that the CPU supports flush-by-ASID when enabling SEV support. - On AMD machines with vNMI, always rely on hardware instead of intercepting IRET in some cases to detect unmasking of NMIs - Support for virtualizing Linear Address Masking (LAM) - Fix a variety of vPMU bugs where KVM fail to stop/reset counters and other state prior to refreshing the vPMU model. - Fix a double-overflow PMU bug by tracking emulated counter events using a dedicated field instead of snapshotting the "previous" counter. If the hardware PMC count triggers overflow that is recognized in the same VM-Exit that KVM manually bumps an event count, KVM would pend PMIs for both the hardware-triggered overflow and for KVM-triggered overflow. - Turn off KVM_WERROR by default for all configs so that it's not inadvertantly enabled by non-KVM developers, which can be problematic for subsystems that require no regressions for W=1 builds. - Advertise all of the host-supported CPUID bits that enumerate IA32_SPEC_CTRL "features". - Don't force a masterclock update when a vCPU synchronizes to the current TSC generation, as updating the masterclock can cause kvmclock's time to "jump" unexpectedly, e.g. when userspace hotplugs a pre-created vCPU. - Use RIP-relative address to read kvm_rebooting in the VM-Enter fault paths, partly as a super minor optimization, but mostly to make KVM play nice with position independent executable builds. - Guard KVM-on-HyperV's range-based TLB flush hooks with an #ifdef on CONFIG_HYPERV as a minor optimization, and to self-document the code. - Add CONFIG_KVM_HYPERV to allow disabling KVM support for HyperV "emulation" at build time. ARM64: - LPA2 support, adding 52bit IPA/PA capability for 4kB and 16kB base granule sizes. Branch shared with the arm64 tree. - Large Fine-Grained Trap rework, bringing some sanity to the feature, although there is more to come. This comes with a prefix branch shared with the arm64 tree. - Some additional Nested Virtualization groundwork, mostly introducing the NV2 VNCR support and retargetting the NV support to that version of the architecture. - A small set of vgic fixes and associated cleanups. Loongarch: - Optimization for memslot hugepage checking - Cleanup and fix some HW/SW timer issues - Add LSX/LASX (128bit/256bit SIMD) support RISC-V: - KVM_GET_REG_LIST improvement for vector registers - Generate ISA extension reg_list using macros in get-reg-list selftest - Support for reporting steal time along with selftest s390: - Bugfixes Selftests: - Fix an annoying goof where the NX hugepage test prints out garbage instead of the magic token needed to run the test. - Fix build errors when a header is delete/moved due to a missing flag in the Makefile. - Detect if KVM bugged/killed a selftest's VM and print out a helpful message instead of complaining that a random ioctl() failed. - Annotate the guest printf/assert helpers with __printf(), and fix the various bugs that were lurking due to lack of said annotation" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (185 commits) x86/kvm: Do not try to disable kvmclock if it was not enabled KVM: x86: add missing "depends on KVM" KVM: fix direction of dependency on MMU notifiers KVM: introduce CONFIG_KVM_COMMON KVM: arm64: Add missing memory barriers when switching to pKVM's hyp pgd KVM: arm64: vgic-its: Avoid potential UAF in LPI translation cache RISC-V: KVM: selftests: Add get-reg-list test for STA registers RISC-V: KVM: selftests: Add steal_time test support RISC-V: KVM: selftests: Add guest_sbi_probe_extension RISC-V: KVM: selftests: Move sbi_ecall to processor.c RISC-V: KVM: Implement SBI STA extension RISC-V: KVM: Add support for SBI STA registers RISC-V: KVM: Add support for SBI extension registers RISC-V: KVM: Add SBI STA info to vcpu_arch RISC-V: KVM: Add steal-update vcpu request RISC-V: KVM: Add SBI STA extension skeleton RISC-V: paravirt: Implement steal-time support RISC-V: Add SBI STA extension definitions RISC-V: paravirt: Add skeleton for pv-time support RISC-V: KVM: Fix indentation in kvm_riscv_vcpu_set_reg_csr() ...
2024-01-10Merge tag 'header_cleanup-2024-01-10' of https://evilpiepirate.org/git/bcachefsLinus Torvalds
Pull header cleanups from Kent Overstreet: "The goal is to get sched.h down to a type only header, so the main thing happening in this patchset is splitting out various _types.h headers and dependency fixups, as well as moving some things out of sched.h to better locations. This is prep work for the memory allocation profiling patchset which adds new sched.h interdepencencies" * tag 'header_cleanup-2024-01-10' of https://evilpiepirate.org/git/bcachefs: (51 commits) Kill sched.h dependency on rcupdate.h kill unnecessary thread_info.h include Kill unnecessary kernel.h include preempt.h: Kill dependency on list.h rseq: Split out rseq.h from sched.h LoongArch: signal.c: add header file to fix build error restart_block: Trim includes lockdep: move held_lock to lockdep_types.h sem: Split out sem_types.h uidgid: Split out uidgid_types.h seccomp: Split out seccomp_types.h refcount: Split out refcount_types.h uapi/linux/resource.h: fix include x86/signal: kill dependency on time.h syscall_user_dispatch.h: split out *_types.h mm_types_task.h: Trim dependencies Split out irqflags_types.h ipc: Kill bogus dependency on spinlock.h shm: Slim down dependencies workqueue: Split out workqueue_types.h ...
2024-01-03arch/x86: Fix typosBjorn Helgaas
Fix typos, most reported by "codespell arch/x86". Only touches comments, no code changes. Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Link: https://lore.kernel.org/r/20240103004011.1758650-1-helgaas@kernel.org
2023-12-20x86/signal: kill dependency on time.hKent Overstreet
this is unecessary, and was pulling in printk.h from uapi headers Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-14Merge branch 'kvm-guestmemfd' into HEADPaolo Bonzini
Introduce several new KVM uAPIs to ultimately create a guest-first memory subsystem within KVM, a.k.a. guest_memfd. Guest-first memory allows KVM to provide features, enhancements, and optimizations that are kludgly or outright impossible to implement in a generic memory subsystem. The core KVM ioctl() for guest_memfd is KVM_CREATE_GUEST_MEMFD, which similar to the generic memfd_create(), creates an anonymous file and returns a file descriptor that refers to it. Again like "regular" memfd files, guest_memfd files live in RAM, have volatile storage, and are automatically released when the last reference is dropped. The key differences between memfd files (and every other memory subystem) is that guest_memfd files are bound to their owning virtual machine, cannot be mapped, read, or written by userspace, and cannot be resized. guest_memfd files do however support PUNCH_HOLE, which can be used to convert a guest memory area between the shared and guest-private states. A second KVM ioctl(), KVM_SET_MEMORY_ATTRIBUTES, allows userspace to specify attributes for a given page of guest memory. In the long term, it will likely be extended to allow userspace to specify per-gfn RWX protections, including allowing memory to be writable in the guest without it also being writable in host userspace. The immediate and driving use case for guest_memfd are Confidential (CoCo) VMs, specifically AMD's SEV-SNP, Intel's TDX, and KVM's own pKVM. For such use cases, being able to map memory into KVM guests without requiring said memory to be mapped into the host is a hard requirement. While SEV+ and TDX prevent untrusted software from reading guest private data by encrypting guest memory, pKVM provides confidentiality and integrity *without* relying on memory encryption. In addition, with SEV-SNP and especially TDX, accessing guest private memory can be fatal to the host, i.e. KVM must be prevent host userspace from accessing guest memory irrespective of hardware behavior. Long term, guest_memfd may be useful for use cases beyond CoCo VMs, for example hardening userspace against unintentional accesses to guest memory. As mentioned earlier, KVM's ABI uses userspace VMA protections to define the allow guest protection (with an exception granted to mapping guest memory executable), and similarly KVM currently requires the guest mapping size to be a strict subset of the host userspace mapping size. Decoupling the mappings sizes would allow userspace to precisely map only what is needed and with the required permissions, without impacting guest performance. A guest-first memory subsystem also provides clearer line of sight to things like a dedicated memory pool (for slice-of-hardware VMs) and elimination of "struct page" (for offload setups where userspace _never_ needs to DMA from or into guest memory). guest_memfd is the result of 3+ years of development and exploration; taking on memory management responsibilities in KVM was not the first, second, or even third choice for supporting CoCo VMs. But after many failed attempts to avoid KVM-specific backing memory, and looking at where things ended up, it is quite clear that of all approaches tried, guest_memfd is the simplest, most robust, and most extensible, and the right thing to do for KVM and the kernel at-large. The "development cycle" for this version is going to be very short; ideally, next week I will merge it as is in kvm/next, taking this through the KVM tree for 6.8 immediately after the end of the merge window. The series is still based on 6.6 (plus KVM changes for 6.7) so it will require a small fixup for changes to get_file_rcu() introduced in 6.7 by commit 0ede61d8589c ("file: convert to SLAB_TYPESAFE_BY_RCU"). The fixup will be done as part of the merge commit, and most of the text above will become the commit message for the merge. Pending post-merge work includes: - hugepage support - looking into using the restrictedmem framework for guest memory - introducing a testing mechanism to poison memory, possibly using the same memory attributes introduced here - SNP and TDX support There are two non-KVM patches buried in the middle of this series: fs: Rename anon_inode_getfile_secure() and anon_inode_getfd_secure() mm: Add AS_UNMOVABLE to mark mapping as completely unmovable The first is small and mostly suggested-by Christian Brauner; the second a bit less so but it was written by an mm person (Vlastimil Babka).
2023-11-14KVM: x86: Add support for "protected VMs" that can utilize private memorySean Christopherson
Add a new x86 VM type, KVM_X86_SW_PROTECTED_VM, to serve as a development and testing vehicle for Confidential (CoCo) VMs, and potentially to even become a "real" product in the distant future, e.g. a la pKVM. The private memory support in KVM x86 is aimed at AMD's SEV-SNP and Intel's TDX, but those technologies are extremely complex (understatement), difficult to debug, don't support running as nested guests, and require hardware that's isn't universally accessible. I.e. relying SEV-SNP or TDX for maintaining guest private memory isn't a realistic option. At the very least, KVM_X86_SW_PROTECTED_VM will enable a variety of selftests for guest_memfd and private memory support without requiring unique hardware. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20231027182217.3615211-24-seanjc@google.com> Reviewed-by: Fuad Tabba <tabba@google.com> Tested-by: Fuad Tabba <tabba@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-10-12platform/x86/amd/hsmp: add support for metrics tblSuma Hegde
AMD MI300 MCM provides GET_METRICS_TABLE message to retrieve all the system management information from SMU. The metrics table is made available as hexadecimal sysfs binary file under per socket sysfs directory created at /sys/devices/platform/amd_hsmp/socket%d/metrics_bin Metrics table definitions will be documented as part of Public PPR. The same is defined in the amd_hsmp.h header. Signed-off-by: Suma Hegde <suma.hegde@amd.com> Reviewed-by: Naveen Krishna Chatradhi <nchatrad@amd.com> Link: https://lore.kernel.org/r/20231010120310.3464066-2-suma.hegde@amd.com [ij: lseek -> lseek(), dram -> DRAM in dev_err()] [ij: added period to terminate a documentation sentence] Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
2023-09-06x86: Remove the arch_calc_vm_prot_bits() macro from the UAPIThomas Huth
The arch_calc_vm_prot_bits() macro uses VM_PKEY_BIT0 etc. which are not part of the UAPI, so the macro is completely useless for userspace. It is also hidden behind the CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS config switch which we shouldn't expose to userspace. Thus let's move this macro into a new internal header instead. Fixes: 8f62c883222c ("x86/mm/pkeys: Add arch-specific VMA protection bits") Signed-off-by: Thomas Huth <thuth@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Nicolas Schier <nicolas@fjasle.eu> Acked-by: Dave Hansen <dave.hansen@intel.com> Link: https://lore.kernel.org/r/20230906162658.142511-1-thuth@redhat.com
2023-08-02x86/shstk: Add ARCH_SHSTK_STATUSRick Edgecombe
CRIU and GDB need to get the current shadow stack and WRSS enablement status. This information is already available via /proc/pid/status, but this is inconvenient for CRIU because it involves parsing the text output in an area of the code where this is difficult. Provide a status arch_prctl(), ARCH_SHSTK_STATUS for retrieving the status. Have arg2 be a userspace address, and make the new arch_prctl simply copy the features out to userspace. Suggested-by: Mike Rapoport <rppt@kernel.org> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Tested-by: Pengfei Xu <pengfei.xu@intel.com> Tested-by: John Allen <john.allen@amd.com> Tested-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/all/20230613001108.3040476-43-rick.p.edgecombe%40intel.com
2023-08-02x86/shstk: Add ARCH_SHSTK_UNLOCKMike Rapoport
Userspace loaders may lock features before a CRIU restore operation has the chance to set them to whatever state is required by the process being restored. Allow a way for CRIU to unlock features. Add it as an arch_prctl() like the other shadow stack operations, but restrict it being called by the ptrace arch_pctl() interface. [Merged into recent API changes, added commit log and docs] Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: David Hildenbrand <david@redhat.com> Tested-by: Pengfei Xu <pengfei.xu@intel.com> Tested-by: John Allen <john.allen@amd.com> Tested-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/all/20230613001108.3040476-42-rick.p.edgecombe%40intel.com
2023-08-02x86/shstk: Support WRSS for userspaceRick Edgecombe
For the current shadow stack implementation, shadow stacks contents can't easily be provisioned with arbitrary data. This property helps apps protect themselves better, but also restricts any potential apps that may want to do exotic things at the expense of a little security. The x86 shadow stack feature introduces a new instruction, WRSS, which can be enabled to write directly to shadow stack memory from userspace. Allow it to get enabled via the prctl interface. Only enable the userspace WRSS instruction, which allows writes to userspace shadow stacks from userspace. Do not allow it to be enabled independently of shadow stack, as HW does not support using WRSS when shadow stack is disabled. >From a fault handler perspective, WRSS will behave very similar to WRUSS, which is treated like a user access from a #PF err code perspective. Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Tested-by: Pengfei Xu <pengfei.xu@intel.com> Tested-by: John Allen <john.allen@amd.com> Tested-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/all/20230613001108.3040476-36-rick.p.edgecombe%40intel.com
2023-08-02x86/shstk: Introduce map_shadow_stack syscallRick Edgecombe
When operating with shadow stacks enabled, the kernel will automatically allocate shadow stacks for new threads, however in some cases userspace will need additional shadow stacks. The main example of this is the ucontext family of functions, which require userspace allocating and pivoting to userspace managed stacks. Unlike most other user memory permissions, shadow stacks need to be provisioned with special data in order to be useful. They need to be setup with a restore token so that userspace can pivot to them via the RSTORSSP instruction. But, the security design of shadow stacks is that they should not be written to except in limited circumstances. This presents a problem for userspace, as to how userspace can provision this special data, without allowing for the shadow stack to be generally writable. Previously, a new PROT_SHADOW_STACK was attempted, which could be mprotect()ed from RW permissions after the data was provisioned. This was found to not be secure enough, as other threads could write to the shadow stack during the writable window. The kernel can use a special instruction, WRUSS, to write directly to userspace shadow stacks. So the solution can be that memory can be mapped as shadow stack permissions from the beginning (never generally writable in userspace), and the kernel itself can write the restore token. First, a new madvise() flag was explored, which could operate on the PROT_SHADOW_STACK memory. This had a couple of downsides: 1. Extra checks were needed in mprotect() to prevent writable memory from ever becoming PROT_SHADOW_STACK. 2. Extra checks/vma state were needed in the new madvise() to prevent restore tokens being written into the middle of pre-used shadow stacks. It is ideal to prevent restore tokens being added at arbitrary locations, so the check was to make sure the shadow stack had never been written to. 3. It stood out from the rest of the madvise flags, as more of direct action than a hint at future desired behavior. So rather than repurpose two existing syscalls (mmap, madvise) that don't quite fit, just implement a new map_shadow_stack syscall to allow userspace to map and setup new shadow stacks in one step. While ucontext is the primary motivator, userspace may have other unforeseen reasons to setup its own shadow stacks using the WRSS instruction. Towards this provide a flag so that stacks can be optionally setup securely for the common case of ucontext without enabling WRSS. Or potentially have the kernel set up the shadow stack in some new way. The following example demonstrates how to create a new shadow stack with map_shadow_stack: void *shstk = map_shadow_stack(addr, stack_size, SHADOW_STACK_SET_TOKEN); Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Tested-by: Pengfei Xu <pengfei.xu@intel.com> Tested-by: John Allen <john.allen@amd.com> Tested-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/all/20230613001108.3040476-35-rick.p.edgecombe%40intel.com
2023-08-02x86/shstk: Add user-mode shadow stack supportRick Edgecombe
Introduce basic shadow stack enabling/disabling/allocation routines. A task's shadow stack is allocated from memory with VM_SHADOW_STACK flag and has a fixed size of min(RLIMIT_STACK, 4GB). Keep the task's shadow stack address and size in thread_struct. This will be copied when cloning new threads, but needs to be cleared during exec, so add a function to do this. 32 bit shadow stack is not expected to have many users and it will complicate the signal implementation. So do not support IA32 emulation or x32. Co-developed-by: Yu-cheng Yu <yu-cheng.yu@intel.com> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Tested-by: Pengfei Xu <pengfei.xu@intel.com> Tested-by: John Allen <john.allen@amd.com> Tested-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/all/20230613001108.3040476-29-rick.p.edgecombe%40intel.com
2023-08-02x86: Introduce userspace API for shadow stackRick Edgecombe
Add three new arch_prctl() handles: - ARCH_SHSTK_ENABLE/DISABLE enables or disables the specified feature. Returns 0 on success or a negative value on error. - ARCH_SHSTK_LOCK prevents future disabling or enabling of the specified feature. Returns 0 on success or a negative value on error. The features are handled per-thread and inherited over fork(2)/clone(2), but reset on exec(). Co-developed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Tested-by: Pengfei Xu <pengfei.xu@intel.com> Tested-by: John Allen <john.allen@amd.com> Tested-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/all/20230613001108.3040476-27-rick.p.edgecombe%40intel.com
2023-07-11x86/mm: Introduce MAP_ABOVE4GRick Edgecombe
The x86 Control-flow Enforcement Technology (CET) feature includes a new type of memory called shadow stack. This shadow stack memory has some unusual properties, which require some core mm changes to function properly. One of the properties is that the shadow stack pointer (SSP), which is a CPU register that points to the shadow stack like the stack pointer points to the stack, can't be pointing outside of the 32 bit address space when the CPU is executing in 32 bit mode. It is desirable to prevent executing in 32 bit mode when shadow stack is enabled because the kernel can't easily support 32 bit signals. On x86 it is possible to transition to 32 bit mode without any special interaction with the kernel, by doing a "far call" to a 32 bit segment. So the shadow stack implementation can use this address space behavior as a feature, by enforcing that shadow stack memory is always mapped outside of the 32 bit address space. This way userspace will trigger a general protection fault which will in turn trigger a segfault if it tries to transition to 32 bit mode with shadow stack enabled. This provides a clean error generating border for the user if they try attempt to do 32 bit mode shadow stack, rather than leave the kernel in a half working state for userspace to be surprised by. So to allow future shadow stack enabling patches to map shadow stacks out of the 32 bit address space, introduce MAP_ABOVE4G. The behavior is pretty much like MAP_32BIT, except that it has the opposite address range. The are a few differences though. If both MAP_32BIT and MAP_ABOVE4G are provided, the kernel will use the MAP_ABOVE4G behavior. Like MAP_32BIT, MAP_ABOVE4G is ignored in a 32 bit syscall. Since the default search behavior is top down, the normal kaslr base can be used for MAP_ABOVE4G. This is unlike MAP_32BIT which has to add its own randomization in the bottom up case. For MAP_32BIT, only the bottom up search path is used. For MAP_ABOVE4G both are potentially valid, so both are used. In the bottomup search path, the default behavior is already consistent with MAP_ABOVE4G since mmap base should be above 4GB. Without MAP_ABOVE4G, the shadow stack will already normally be above 4GB. So without introducing MAP_ABOVE4G, trying to transition to 32 bit mode with shadow stack enabled would usually segfault anyway. This is already pretty decent guard rails. But the addition of MAP_ABOVE4G is some small complexity spent to make it make it more complete. Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Tested-by: Pengfei Xu <pengfei.xu@intel.com> Tested-by: John Allen <john.allen@amd.com> Tested-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/all/20230613001108.3040476-21-rick.p.edgecombe%40intel.com
2023-06-01x86/mtrr: Don't let mtrr_type_lookup() return MTRR_TYPE_INVALIDJuergen Gross
mtrr_type_lookup() should always return a valid memory type. In case there is no information available, it should return the default UC. This will remove the last case where mtrr_type_lookup() can return MTRR_TYPE_INVALID, so adjust the comment in include/uapi/asm/mtrr.h. Note that removing the MTRR_TYPE_INVALID #define from that header could break user code, so it has to stay. At the same time the mtrr_type_lookup() stub for the !CONFIG_MTRR case should set uniform to 1, as if the memory range would be covered by no MTRR at all. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: Michael Kelley <mikelley@microsoft.com> Link: https://lore.kernel.org/r/20230502120931.20719-15-jgross@suse.com Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
2023-06-01x86/mtrr: Replace size_or_mask and size_and_mask with a much easier conceptJuergen Gross
Replace size_or_mask and size_and_mask with the much easier concept of high reserved bits. While at it, instead of using constants in the MTRR code, use some new [ bp: - Drop mtrr_set_mask() - Unbreak long lines - Move struct mtrr_state_type out of the uapi header as it doesn't belong there. It also fixes a HDRTEST breakage "unknown type name ‘bool’" as Reported-by: kernel test robot <lkp@intel.com> - Massage. ] Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230502120931.20719-3-jgross@suse.com Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
2023-05-01Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm updates from Paolo Bonzini: "s390: - More phys_to_virt conversions - Improvement of AP management for VSIE (nested virtualization) ARM64: - Numerous fixes for the pathological lock inversion issue that plagued KVM/arm64 since... forever. - New framework allowing SMCCC-compliant hypercalls to be forwarded to userspace, hopefully paving the way for some more features being moved to VMMs rather than be implemented in the kernel. - Large rework of the timer code to allow a VM-wide offset to be applied to both virtual and physical counters as well as a per-timer, per-vcpu offset that complements the global one. This last part allows the NV timer code to be implemented on top. - A small set of fixes to make sure that we don't change anything affecting the EL1&0 translation regime just after having having taken an exception to EL2 until we have executed a DSB. This ensures that speculative walks started in EL1&0 have completed. - The usual selftest fixes and improvements. x86: - Optimize CR0.WP toggling by avoiding an MMU reload when TDP is enabled, and by giving the guest control of CR0.WP when EPT is enabled on VMX (VMX-only because SVM doesn't support per-bit controls) - Add CR0/CR4 helpers to query single bits, and clean up related code where KVM was interpreting kvm_read_cr4_bits()'s "unsigned long" return as a bool - Move AMD_PSFD to cpufeatures.h and purge KVM's definition - Avoid unnecessary writes+flushes when the guest is only adding new PTEs - Overhaul .sync_page() and .invlpg() to utilize .sync_page()'s optimizations when emulating invalidations - Clean up the range-based flushing APIs - Revamp the TDP MMU's reaping of Accessed/Dirty bits to clear a single A/D bit using a LOCK AND instead of XCHG, and skip all of the "handle changed SPTE" overhead associated with writing the entire entry - Track the number of "tail" entries in a pte_list_desc to avoid having to walk (potentially) all descriptors during insertion and deletion, which gets quite expensive if the guest is spamming fork() - Disallow virtualizing legacy LBRs if architectural LBRs are available, the two are mutually exclusive in hardware - Disallow writes to immutable feature MSRs (notably PERF_CAPABILITIES) after KVM_RUN, similar to CPUID features - Overhaul the vmx_pmu_caps selftest to better validate PERF_CAPABILITIES - Apply PMU filters to emulated events and add test coverage to the pmu_event_filter selftest - AMD SVM: - Add support for virtual NMIs - Fixes for edge cases related to virtual interrupts - Intel AMX: - Don't advertise XTILE_CFG in KVM_GET_SUPPORTED_CPUID if XTILE_DATA is not being reported due to userspace not opting in via prctl() - Fix a bug in emulation of ENCLS in compatibility mode - Allow emulation of NOP and PAUSE for L2 - AMX selftests improvements - Misc cleanups MIPS: - Constify MIPS's internal callbacks (a leftover from the hardware enabling rework that landed in 6.3) Generic: - Drop unnecessary casts from "void *" throughout kvm_main.c - Tweak the layout of "struct kvm_mmu_memory_cache" to shrink the struct size by 8 bytes on 64-bit kernels by utilizing a padding hole Documentation: - Fix goof introduced by the conversion to rST" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (211 commits) KVM: s390: pci: fix virtual-physical confusion on module unload/load KVM: s390: vsie: clarifications on setting the APCB KVM: s390: interrupt: fix virtual-physical confusion for next alert GISA KVM: arm64: Have kvm_psci_vcpu_on() use WRITE_ONCE() to update mp_state KVM: arm64: Acquire mp_state_lock in kvm_arch_vcpu_ioctl_vcpu_init() KVM: selftests: Test the PMU event "Instructions retired" KVM: selftests: Copy full counter values from guest in PMU event filter test KVM: selftests: Use error codes to signal errors in PMU event filter test KVM: selftests: Print detailed info in PMU event filter asserts KVM: selftests: Add helpers for PMC asserts in PMU event filter test KVM: selftests: Add a common helper for the PMU event filter guest code KVM: selftests: Fix spelling mistake "perrmited" -> "permitted" KVM: arm64: vhe: Drop extra isb() on guest exit KVM: arm64: vhe: Synchronise with page table walker on MMU update KVM: arm64: pkvm: Document the side effects of kvm_flush_dcache_to_poc() KVM: arm64: nvhe: Synchronise with page table walker on TLBI KVM: arm64: Handle 32bit CNTPCTSS traps KVM: arm64: nvhe: Synchronise with page table walker on vcpu run KVM: arm64: vgic: Don't acquire its_lock before config_lock KVM: selftests: Add test to verify KVM's supported XCR0 ...
2023-04-28Merge tag 'x86_mm_for_6.4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 LAM (Linear Address Masking) support from Dave Hansen: "Add support for the new Linear Address Masking CPU feature. This is similar to ARM's Top Byte Ignore and allows userspace to store metadata in some bits of pointers without masking it out before use" * tag 'x86_mm_for_6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mm/iommu/sva: Do not allow to set FORCE_TAGGED_SVA bit from outside x86/mm/iommu/sva: Fix error code for LAM enabling failure due to SVA selftests/x86/lam: Add test cases for LAM vs thread creation selftests/x86/lam: Add ARCH_FORCE_TAGGED_SVA test cases for linear-address masking selftests/x86/lam: Add inherit test cases for linear-address masking selftests/x86/lam: Add io_uring test cases for linear-address masking selftests/x86/lam: Add mmap and SYSCALL test cases for linear-address masking selftests/x86/lam: Add malloc and tag-bits test cases for linear-address masking x86/mm/iommu/sva: Make LAM and SVA mutually exclusive iommu/sva: Replace pasid_valid() helper with mm_valid_pasid() mm: Expose untagging mask in /proc/$PID/status x86/mm: Provide arch_prctl() interface for LAM x86/mm: Reduce untagged_addr() overhead for systems without LAM x86/uaccess: Provide untagged_addr() and remove tags before address check mm: Introduce untagged_addr_remote() x86/mm: Handle LAM on context switch x86: CPUID and CR3/CR4 flags for Linear Address Masking x86: Allow atomic MM_CONTEXT flags setting x86/mm: Rework address range check in get_user() and put_user()
2023-04-05KVM: x86: Redefine 'longmode' as a flag for KVM_EXIT_HYPERCALLOliver Upton
The 'longmode' field is a bit annoying as it blows an entire __u32 to represent a boolean value. Since other architectures are looking to add support for KVM_EXIT_HYPERCALL, now is probably a good time to clean it up. Redefine the field (and the remaining padding) as a set of flags. Preserve the existing ABI by using bit 0 to indicate if the guest was in long mode and requiring that the remaining 31 bits must be zero. Cc: Paolo Bonzini <pbonzini@redhat.com> Acked-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20230404154050.2270077-2-oliver.upton@linux.dev
2023-03-22x86/arch_prctl: Add AMX feature numbers as ABI constantsChang S. Bae
Each distinct XSAVE feature has a number assigned to it. Among other things, the number determines the ordering of features in the XSAVE buffer and is also used to generate XSAVE bitmasks like the value for XCR0. AMX state is dynamically enabled by the architecture-specific prctl(). This prctl() takes one XSAVE feature number as an argument. However, the feature numbers are not defined in any readily available userspace headers. The means that each userspace app trying to use dynamic feature prctl()s will likely end up defining their own constants for each feature. Since these feature numbers are a part of the uabi, expose them in the prctl() uabi header. Save everyone the trouble of looking them up and defining their own. [ dhansen: expand changelog a bit ] Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/all/20230121001900.14900-3-chang.seok.bae%40intel.com
2023-03-16x86/mm/iommu/sva: Make LAM and SVA mutually exclusiveKirill A. Shutemov
IOMMU and SVA-capable devices know nothing about LAM and only expect canonical addresses. An attempt to pass down tagged pointer will lead to address translation failure. By default do not allow to enable both LAM and use SVA in the same process. The new ARCH_FORCE_TAGGED_SVA arch_prctl() overrides the limitation. By using the arch_prctl() userspace takes responsibility to never pass tagged address to the device. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Ashok Raj <ashok.raj@intel.com> Reviewed-by: Jacob Pan <jacob.jun.pan@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/all/20230312112612.31869-12-kirill.shutemov%40linux.intel.com
2023-03-16x86/mm: Provide arch_prctl() interface for LAMKirill A. Shutemov
Add a few of arch_prctl() handles: - ARCH_ENABLE_TAGGED_ADDR enabled LAM. The argument is required number of tag bits. It is rounded up to the nearest LAM mode that can provide it. For now only LAM_U57 is supported, with 6 tag bits. - ARCH_GET_UNTAG_MASK returns untag mask. It can indicates where tag bits located in the address. - ARCH_GET_MAX_TAG_BITS returns the maximum tag bits user can request. Zero if LAM is not supported. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Alexander Potapenko <glider@google.com> Link: https://lore.kernel.org/all/20230312112612.31869-9-kirill.shutemov%40linux.intel.com
2023-03-16x86: CPUID and CR3/CR4 flags for Linear Address MaskingKirill A. Shutemov
Enumerate Linear Address Masking and provide defines for CR3 and CR4 flags. The new CONFIG_ADDRESS_MASKING option enables the feature support in kernel. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Alexander Potapenko <glider@google.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Alexander Potapenko <glider@google.com> Link: https://lore.kernel.org/all/20230312112612.31869-4-kirill.shutemov%40linux.intel.com
2023-02-15Merge tag 'kvm-riscv-6.3-1' of https://github.com/kvm-riscv/linux into HEADPaolo Bonzini
KVM/riscv changes for 6.3 - Fix wrong usage of PGDIR_SIZE to check page sizes - Fix privilege mode setting in kvm_riscv_vcpu_trap_redirect() - Redirect illegal instruction traps to guest - SBI PMU support for guest
2023-02-15Merge tag 'kvm-x86-pmu-6.3' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM x86 PMU changes for 6.3: - Add support for created masked events for the PMU filter to allow userspace to heavily restrict what events the guest can use without needing to create an absurd number of events - Clean up KVM's handling of "PMU MSRs to save", especially when vPMU support is disabled - Add PEBS support for Intel SPR
2023-01-24KVM: x86/pmu: Introduce masked events to the pmu event filterAaron Lewis
When building a list of filter events, it can sometimes be a challenge to fit all the events needed to adequately restrict the guest into the limited space available in the pmu event filter. This stems from the fact that the pmu event filter requires each event (i.e. event select + unit mask) be listed, when the intention might be to restrict the event select all together, regardless of it's unit mask. Instead of increasing the number of filter events in the pmu event filter, add a new encoding that is able to do a more generalized match on the unit mask. Introduce masked events as another encoding the pmu event filter understands. Masked events has the fields: mask, match, and exclude. When filtering based on these events, the mask is applied to the guest's unit mask to see if it matches the match value (i.e. umask & mask == match). The exclude bit can then be used to exclude events from that match. E.g. for a given event select, if it's easier to say which unit mask values shouldn't be filtered, a masked event can be set up to match all possible unit mask values, then another masked event can be set up to match the unit mask values that shouldn't be filtered. Userspace can query to see if this feature exists by looking for the capability, KVM_CAP_PMU_EVENT_MASKED_EVENTS. This feature is enabled by setting the flags field in the pmu event filter to KVM_PMU_EVENT_FLAG_MASKED_EVENTS. Events can be encoded by using KVM_PMU_ENCODE_MASKED_ENTRY(). It is an error to have a bit set outside the valid bits for a masked event, and calls to KVM_SET_PMU_EVENT_FILTER will return -EINVAL in such cases, including the high bits of the event select (35:32) if called on Intel. With these updates the filter matching code has been updated to match on a common event. Masked events were flexible enough to handle both event types, so they were used as the common event. This changes how guest events get filtered because regardless of the type of event used in the uAPI, they will be converted to masked events. Because of this there could be a slight performance hit because instead of matching the filter event with a lookup on event select + unit mask, it does a lookup on event select then walks the unit masks to find the match. This shouldn't be a big problem because I would expect the set of common event selects to be small, and if they aren't the set can likely be reduced by using masked events to generalize the unit mask. Using one type of event when filtering guest events allows for a common code path to be used. Signed-off-by: Aaron Lewis <aaronlewis@google.com> Link: https://lore.kernel.org/r/20221220161236.555143-5-aaronlewis@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24KVM: x86: Replace 0-length arrays with flexible arraysKees Cook
Zero-length arrays are deprecated[1]. Replace struct kvm_nested_state's "data" union 0-length arrays with flexible arrays. (How are the sizes of these arrays verified?) Detected with GCC 13, using -fstrict-flex-arrays=3: arch/x86/kvm/svm/nested.c: In function 'svm_get_nested_state': arch/x86/kvm/svm/nested.c:1536:17: error: array subscript 0 is outside array bounds of 'struct kvm_svm_nested_state_data[0]' [-Werror=array-bounds=] 1536 | &user_kvm_nested_state->data.svm[0]; | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ In file included from include/uapi/linux/kvm.h:15, from include/linux/kvm_host.h:40, from arch/x86/kvm/svm/nested.c:18: arch/x86/include/uapi/asm/kvm.h:511:50: note: while referencing 'svm' 511 | struct kvm_svm_nested_state_data svm[0]; | ^~~ [1] https://www.kernel.org/doc/html/latest/process/deprecated.html#zero-length-and-one-element-arrays Cc: Sean Christopherson <seanjc@google.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "Gustavo A. R. Silva" <gustavoars@kernel.org> Cc: x86@kernel.org Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: kvm@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20230105190548.never.323-kees@kernel.org Link: https://lore.kernel.org/r/20230118195905.gonna.693-kees@kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-19x86/sev: Add SEV-SNP guest feature negotiation supportNikunj A Dadhania
The hypervisor can enable various new features (SEV_FEATURES[1:63]) and start a SNP guest. Some of these features need guest side implementation. If any of these features are enabled without it, the behavior of the SNP guest will be undefined. It may fail booting in a non-obvious way making it difficult to debug. Instead of allowing the guest to continue and have it fail randomly later, detect this early and fail gracefully. The SEV_STATUS MSR indicates features which the hypervisor has enabled. While booting, SNP guests should ascertain that all the enabled features have guest side implementation. In case a feature is not implemented in the guest, the guest terminates booting with GHCB protocol Non-Automatic Exit(NAE) termination request event, see "SEV-ES Guest-Hypervisor Communication Block Standardization" document (currently at https://developer.amd.com/wp-content/resources/56421.pdf), section "Termination Request". Populate SW_EXITINFO2 with mask of unsupported features that the hypervisor can easily report to the user. More details in the AMD64 APM Vol 2, Section "SEV_STATUS MSR". [ bp: - Massage. - Move snp_check_features() call to C code. Note: the CC:stable@ aspect here is to be able to protect older, stable kernels when running on newer hypervisors. Or not "running" but fail reliably and in a well-defined manner instead of randomly. ] Fixes: cbd3d4f7c4e5 ("x86/sev: Check SEV-SNP features support") Signed-off-by: Nikunj A Dadhania <nikunj@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Cc: <stable@kernel.org> Link: https://lore.kernel.org/r/20230118061943.534309-1-nikunj@amd.com
2022-12-02KVM: Delete all references to removed KVM_SET_MEMORY_ALIAS ioctlJavier Martinez Canillas
The documentation says that the ioctl has been deprecated, but it has been actually removed and the remaining references are just left overs. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Javier Martinez Canillas <javierm@redhat.com> Message-Id: <20221202105011.185147-3-javierm@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>