summaryrefslogtreecommitdiff
path: root/arch/x86
AgeCommit message (Collapse)Author
2024-03-10Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm fixes from Paolo Bonzini: "KVM GUEST_MEMFD fixes for 6.8: - Make KVM_MEM_GUEST_MEMFD mutually exclusive with KVM_MEM_READONLY to avoid creating an inconsistent ABI (KVM_MEM_GUEST_MEMFD is not writable from userspace, so there would be no way to write to a read-only guest_memfd). - Update documentation for KVM_SW_PROTECTED_VM to make it abundantly clear that such VMs are purely for development and testing. - Limit KVM_SW_PROTECTED_VM guests to the TDP MMU, as the long term plan is to support confidential VMs with deterministic private memory (SNP and TDX) only in the TDP MMU. - Fix a bug in a GUEST_MEMFD dirty logging test that caused false passes. x86 fixes: - Fix missing marking of a guest page as dirty when emulating an atomic access. - Check for mmu_notifier invalidation events before faulting in the pfn, and before acquiring mmu_lock, to avoid unnecessary work and lock contention with preemptible kernels (including CONFIG_PREEMPT_DYNAMIC in non-preemptible mode). - Disable AMD DebugSwap by default, it breaks VMSA signing and will be re-enabled with a better VM creation API in 6.10. - Do the cache flush of converted pages in svm_register_enc_region() before dropping kvm->lock, to avoid a race with unregistering of the same region and the consequent use-after-free issue" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: SEV: disable SEV-ES DebugSwap by default KVM: x86/mmu: Retry fault before acquiring mmu_lock if mapping is changing KVM: SVM: Flush pages under kvm->lock to fix UAF in svm_register_enc_region() KVM: selftests: Add a testcase to verify GUEST_MEMFD and READONLY are exclusive KVM: selftests: Create GUEST_MEMFD for relevant invalid flags testcases KVM: x86/mmu: Restrict KVM_SW_PROTECTED_VM to the TDP MMU KVM: x86: Update KVM_SW_PROTECTED_VM docs to make it clear they're a WIP KVM: Make KVM_MEM_GUEST_MEMFD mutually exclusive with KVM_MEM_READONLY KVM: x86: Mark target gfn of emulated atomic instruction as dirty
2024-03-09SEV: disable SEV-ES DebugSwap by defaultPaolo Bonzini
The DebugSwap feature of SEV-ES provides a way for confidential guests to use data breakpoints. However, because the status of the DebugSwap feature is recorded in the VMSA, enabling it by default invalidates the attestation signatures. In 6.10 we will introduce a new API to create SEV VMs that will allow enabling DebugSwap based on what the user tells KVM to do. Contextually, we will change the legacy KVM_SEV_ES_INIT API to never enable DebugSwap. For compatibility with kernels that pre-date the introduction of DebugSwap, as well as with those where KVM_SEV_ES_INIT will never enable it, do not enable the feature by default. If anybody wants to use it, for now they can enable the sev_es_debug_swap_enabled module parameter, but this will result in a warning. Fixes: d1f85fbe836e ("KVM: SEV: Enable data breakpoints in SEV-ES") Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-03-09Merge tag 'kvm-x86-guest_memfd_fixes-6.8' of ↵Paolo Bonzini
https://github.com/kvm-x86/linux into HEAD KVM GUEST_MEMFD fixes for 6.8: - Make KVM_MEM_GUEST_MEMFD mutually exclusive with KVM_MEM_READONLY to avoid creating ABI that KVM can't sanely support. - Update documentation for KVM_SW_PROTECTED_VM to make it abundantly clear that such VMs are purely a development and testing vehicle, and come with zero guarantees. - Limit KVM_SW_PROTECTED_VM guests to the TDP MMU, as the long term plan is to support confidential VMs with deterministic private memory (SNP and TDX) only in the TDP MMU. - Fix a bug in a GUEST_MEMFD negative test that resulted in false passes when verifying that KVM_MEM_GUEST_MEMFD memslots can't be dirty logged.
2024-03-09Merge tag 'kvm-x86-fixes-6.8-2' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM x86 fixes for 6.8, round 2: - When emulating an atomic access, mark the gfn as dirty in the memslot to fix a bug where KVM could fail to mark the slot as dirty during live migration, ultimately resulting in guest data corruption due to a dirty page not being re-copied from the source to the target. - Check for mmu_notifier invalidation events before faulting in the pfn, and before acquiring mmu_lock, to avoid unnecessary work and lock contention. Contending mmu_lock is especially problematic on preemptible kernels, as KVM may yield mmu_lock in response to the contention, which severely degrades overall performance due to vCPUs making it difficult for the task that triggered invalidation to make forward progress. Note, due to another kernel bug, this fix isn't limited to preemtible kernels, as any kernel built with CONFIG_PREEMPT_DYNAMIC=y will yield contended rwlocks and spinlocks. https://lore.kernel.org/all/20240110214723.695930-1-seanjc@google.com
2024-03-05Merge tag 'hyperv-fixes-signed-20240303' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux Pull hyperv fixes from Wei Liu: - Multiple fixes, cleanups and documentations for Hyper-V core code and drivers * tag 'hyperv-fixes-signed-20240303' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux: Drivers: hv: vmbus: make hv_bus const x86/hyperv: Allow 15-bit APIC IDs for VTL platforms x86/hyperv: Make encrypted/decrypted changes safe for load_unaligned_zeropad() x86/mm: Regularize set_memory_p() parameters and make non-static x86/hyperv: Use slow_virt_to_phys() in page transition hypervisor callback Documentation: hyperv: Add overview of PCI pass-thru device support Drivers: hv: vmbus: Update indentation in create_gpadl_header() Drivers: hv: vmbus: Remove duplication and cleanup code in create_gpadl_header() fbdev/hyperv_fb: Fix logic error for Gen2 VMs in hvfb_getmem() Drivers: hv: vmbus: Calculate ring buffer size for more efficient use of memory hv_utils: Allow implicit ICTIMESYNCFLAG_SYNC
2024-03-03Merge tag 'x86_urgent_for_v6.8_rc7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Borislav Petkov: - Do not reserve SETUP_RNG_SEED setup data in the e820 map as it should be used by kexec only - Make sure MKTME feature detection happens at an earlier time in the boot process so that the physical address size supported by the CPU is properly corrected and MTRR masks are programmed properly, leading to TDX systems booting without disable_mtrr_cleanup on the cmdline - Make sure the different address sizes supported by the CPU are read out as early as possible * tag 'x86_urgent_for_v6.8_rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/e820: Don't reserve SETUP_RNG_SEED in e820 x86/cpu/intel: Detect TME keyid bits before setting MTRR mask registers x86/cpu: Allow reducing x86_phys_bits during early_identify_cpu()
2024-03-01x86/e820: Don't reserve SETUP_RNG_SEED in e820Jiri Bohac
SETUP_RNG_SEED in setup_data is supplied by kexec and should not be reserved in the e820 map. Doing so reserves 16 bytes of RAM when booting with kexec. (16 bytes because data->len is zeroed by parse_setup_data so only sizeof(setup_data) is reserved.) When kexec is used repeatedly, each boot adds two entries in the kexec-provided e820 map as the 16-byte range splits a larger range of usable memory. Eventually all of the 128 available entries get used up. The next split will result in losing usable memory as the new entries cannot be added to the e820 map. Fixes: 68b8e9713c8e ("x86/setup: Use rng seeds from setup_data") Signed-off-by: Jiri Bohac <jbohac@suse.cz> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: <stable@kernel.org> Link: https://lore.kernel.org/r/ZbmOjKnARGiaYBd5@dwarf.suse.cz
2024-03-01x86/hyperv: Allow 15-bit APIC IDs for VTL platformsSaurabh Sengar
The current method for signaling the compatibility of a Hyper-V host with MSIs featuring 15-bit APIC IDs relies on a synthetic cpuid leaf. However, for higher VTLs, this leaf is not reported, due to the absence of an IO-APIC. As an alternative, assume that when running at a high VTL, the host supports 15-bit APIC IDs. This assumption is safe, as Hyper-V does not employ any architectural MSIs at higher VTLs This unblocks startup of VTL2 environments with more than 256 CPUs. Signed-off-by: Saurabh Sengar <ssengar@linux.microsoft.com> Reviewed-by: Michael Kelley <mhklinux@outlook.com> Link: https://lore.kernel.org/r/1705341460-18394-1-git-send-email-ssengar@linux.microsoft.com Signed-off-by: Wei Liu <wei.liu@kernel.org> Message-ID: <1705341460-18394-1-git-send-email-ssengar@linux.microsoft.com>
2024-03-01x86/hyperv: Make encrypted/decrypted changes safe for load_unaligned_zeropad()Michael Kelley
In a CoCo VM, when transitioning memory from encrypted to decrypted, or vice versa, the caller of set_memory_encrypted() or set_memory_decrypted() is responsible for ensuring the memory isn't in use and isn't referenced while the transition is in progress. The transition has multiple steps, and the memory is in an inconsistent state until all steps are complete. A reference while the state is inconsistent could result in an exception that can't be cleanly fixed up. However, the kernel load_unaligned_zeropad() mechanism could cause a stray reference that can't be prevented by the caller of set_memory_encrypted() or set_memory_decrypted(), so there's specific code to handle this case. But a CoCo VM running on Hyper-V may be configured to run with a paravisor, with the #VC or #VE exception routed to the paravisor. There's no architectural way to forward the exceptions back to the guest kernel, and in such a case, the load_unaligned_zeropad() specific code doesn't work. To avoid this problem, mark pages as "not present" while a transition is in progress. If load_unaligned_zeropad() causes a stray reference, a normal page fault is generated instead of #VC or #VE, and the page-fault-based fixup handlers for load_unaligned_zeropad() resolve the reference. When the encrypted/decrypted transition is complete, mark the pages as "present" again. Signed-off-by: Michael Kelley <mhklinux@outlook.com> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Link: https://lore.kernel.org/r/20240116022008.1023398-4-mhklinux@outlook.com Signed-off-by: Wei Liu <wei.liu@kernel.org> Message-ID: <20240116022008.1023398-4-mhklinux@outlook.com>
2024-03-01x86/mm: Regularize set_memory_p() parameters and make non-staticMichael Kelley
set_memory_p() is currently static. It has parameters that don't match set_memory_p() under arch/powerpc and that aren't congruent with the other set_memory_* functions. There's no good reason for the difference. Fix this by making the parameters consistent, and update the one existing call site. Make the function non-static and add it to include/asm/set_memory.h so that it is completely parallel to set_memory_np() and is usable in other modules. No functional change. Signed-off-by: Michael Kelley <mhklinux@outlook.com> Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Link: https://lore.kernel.org/r/20240116022008.1023398-3-mhklinux@outlook.com Signed-off-by: Wei Liu <wei.liu@kernel.org> Message-ID: <20240116022008.1023398-3-mhklinux@outlook.com>
2024-03-01x86/hyperv: Use slow_virt_to_phys() in page transition hypervisor callbackMichael Kelley
In preparation for temporarily marking pages not present during a transition between encrypted and decrypted, use slow_virt_to_phys() in the hypervisor callback. As long as the PFN is correct, slow_virt_to_phys() works even if the leaf PTE is not present. The existing functions that depend on vmalloc_to_page() all require that the leaf PTE be marked present, so they don't work. Update the comments for slow_virt_to_phys() to note this broader usage and the requirement to work even if the PTE is not marked present. Signed-off-by: Michael Kelley <mhklinux@outlook.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Link: https://lore.kernel.org/r/20240116022008.1023398-2-mhklinux@outlook.com Signed-off-by: Wei Liu <wei.liu@kernel.org> Message-ID: <20240116022008.1023398-2-mhklinux@outlook.com>
2024-02-26x86/cpu/intel: Detect TME keyid bits before setting MTRR mask registersPaolo Bonzini
MKTME repurposes the high bit of physical address to key id for encryption key and, even though MAXPHYADDR in CPUID[0x80000008] remains the same, the valid bits in the MTRR mask register are based on the reduced number of physical address bits. detect_tme() in arch/x86/kernel/cpu/intel.c detects TME and subtracts it from the total usable physical bits, but it is called too late. Move the call to early_init_intel() so that it is called in setup_arch(), before MTRRs are setup. This fixes boot on TDX-enabled systems, which until now only worked with "disable_mtrr_cleanup". Without the patch, the values written to the MTRRs mask registers were 52-bit wide (e.g. 0x000fffff_80000800) and the writes failed; with the patch, the values are 46-bit wide, which matches the reduced MAXPHYADDR that is shown in /proc/cpuinfo. Reported-by: Zixi Chen <zixchen@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc:stable@vger.kernel.org Link: https://lore.kernel.org/all/20240131230902.1867092-3-pbonzini%40redhat.com
2024-02-26x86/cpu: Allow reducing x86_phys_bits during early_identify_cpu()Paolo Bonzini
In commit fbf6449f84bf ("x86/sev-es: Set x86_virt_bits to the correct value straight away, instead of a two-phase approach"), the initialization of c->x86_phys_bits was moved after this_cpu->c_early_init(c). This is incorrect because early_init_amd() expected to be able to reduce the value according to the contents of CPUID leaf 0x8000001f. Fortunately, the bug was negated by init_amd()'s call to early_init_amd(), which does reduce x86_phys_bits in the end. However, this is very late in the boot process and, most notably, the wrong value is used for x86_phys_bits when setting up MTRRs. To fix this, call get_cpu_address_sizes() as soon as X86_FEATURE_CPUID is set/cleared, and c->extended_cpuid_level is retrieved. Fixes: fbf6449f84bf ("x86/sev-es: Set x86_virt_bits to the correct value straight away, instead of a two-phase approach") Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc:stable@vger.kernel.org Link: https://lore.kernel.org/all/20240131230902.1867092-2-pbonzini%40redhat.com
2024-02-25Merge tag 'x86_urgent_for_v6.8_rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Borislav Petkov: - Make sure clearing CPU buffers using VERW happens at the latest possible point in the return-to-userspace path, otherwise memory accesses after the VERW execution could cause data to land in CPU buffers again * tag 'x86_urgent_for_v6.8_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: KVM/VMX: Move VERW closer to VMentry for MDS mitigation KVM/VMX: Use BT+JNC, i.e. EFLAGS.CF to select VMRESUME vs. VMLAUNCH x86/bugs: Use ALTERNATIVE() instead of mds_user_clear static key x86/entry_32: Add VERW just before userspace transition x86/entry_64: Add VERW just before userspace transition x86/bugs: Add asm helpers for executing VERW
2024-02-24Merge tag 'cxl-fixes-6.8-rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl Pull cxl fixes from Dan Williams: "A collection of significant fixes for the CXL subsystem. The largest change in this set, that bordered on "new development", is the fix for the fact that the location of the new qos_class attribute did not match the Documentation. The fix ends up deleting more code than it added, and it has a new unit test to backstop basic errors in this interface going forward. So the "red-diff" and unit test saved the "rip it out and try again" response. In contrast, the new notification path for firmware reported CXL errors (CXL CPER notifications) has a locking context bug that can not be fixed with a red-diff. Given where the release cycle stands, it is not comfortable to squeeze in that fix in these waning days. So, that receives the "back it out and try again later" treatment. There is a regression fix in the code that establishes memory NUMA nodes for platform CXL regions. That has an ack from x86 folks. There are a couple more fixups for Linux to understand (reassemble) CXL regions instantiated by platform firmware. The policy around platforms that do not match host-physical-address with system-physical-address (i.e. systems that have an address translation mechanism between the address range reported in the ACPI CEDT.CFMWS and endpoint decoders) has been softened to abort driver load rather than teardown the memory range (can cause system hangs). Lastly, there is a robustness / regression fix for cases where the driver would previously continue in the face of error, and a fixup for PCI error notification handling. Summary: - Fix NUMA initialization from ACPI CEDT.CFMWS - Fix region assembly failures due to async init order - Fix / simplify export of qos_class information - Fix cxl_acpi initialization vs single-window-init failures - Fix handling of repeated 'pci_channel_io_frozen' notifications - Workaround platforms that violate host-physical-address == system-physical address assumptions - Defer CXL CPER notification handling to v6.9" * tag 'cxl-fixes-6.8-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl: cxl/acpi: Fix load failures due to single window creation failure acpi/ghes: Remove CXL CPER notifications cxl/pci: Fix disabling memory if DVSEC CXL Range does not match a CFMWS window cxl/test: Add support for qos_class checking cxl: Fix sysfs export of qos_class for memdev cxl: Remove unnecessary type cast in cxl_qos_class_verify() cxl: Change 'struct cxl_memdev_state' *_perf_list to single 'struct cxl_dpa_perf' cxl/region: Allow out of order assembly of autodiscovered regions cxl/region: Handle endpoint decoders in cxl_region_find_decoder() x86/numa: Fix the sort compare func used in numa_fill_memblks() x86/numa: Fix the address overlap check in numa_fill_memblks() cxl/pci: Skip to handle RAS errors if CXL.mem device is detached
2024-02-23KVM: x86/mmu: Retry fault before acquiring mmu_lock if mapping is changingSean Christopherson
Retry page faults without acquiring mmu_lock, and without even faulting the page into the primary MMU, if the resolved gfn is covered by an active invalidation. Contending for mmu_lock is especially problematic on preemptible kernels as the mmu_notifier invalidation task will yield mmu_lock (see rwlock_needbreak()), delay the in-progress invalidation, and ultimately increase the latency of resolving the page fault. And in the worst case scenario, yielding will be accompanied by a remote TLB flush, e.g. if the invalidation covers a large range of memory and vCPUs are accessing addresses that were already zapped. Faulting the page into the primary MMU is similarly problematic, as doing so may acquire locks that need to be taken for the invalidation to complete (the primary MMU has finer grained locks than KVM's MMU), and/or may cause unnecessary churn (getting/putting pages, marking them accessed, etc). Alternatively, the yielding issue could be mitigated by teaching KVM's MMU iterators to perform more work before yielding, but that wouldn't solve the lock contention and would negatively affect scenarios where a vCPU is trying to fault in an address that is NOT covered by the in-progress invalidation. Add a dedicated lockess version of the range-based retry check to avoid false positives on the sanity check on start+end WARN, and so that it's super obvious that checking for a racing invalidation without holding mmu_lock is unsafe (though obviously useful). Wrap mmu_invalidate_in_progress in READ_ONCE() to ensure that pre-checking invalidation in a loop won't put KVM into an infinite loop, e.g. due to caching the in-progress flag and never seeing it go to '0'. Force a load of mmu_invalidate_seq as well, even though it isn't strictly necessary to avoid an infinite loop, as doing so improves the probability that KVM will detect an invalidation that already completed before acquiring mmu_lock and bailing anyways. Do the pre-check even for non-preemptible kernels, as waiting to detect the invalidation until mmu_lock is held guarantees the vCPU will observe the worst case latency in terms of handling the fault, and can generate even more mmu_lock contention. E.g. the vCPU will acquire mmu_lock, detect retry, drop mmu_lock, re-enter the guest, retake the fault, and eventually re-acquire mmu_lock. This behavior is also why there are no new starvation issues due to losing the fairness guarantees provided by rwlocks: if the vCPU needs to retry, it _must_ drop mmu_lock, i.e. waiting on mmu_lock doesn't guarantee forward progress in the face of _another_ mmu_notifier invalidation event. Note, adding READ_ONCE() isn't entirely free, e.g. on x86, the READ_ONCE() may generate a load into a register instead of doing a direct comparison (MOV+TEST+Jcc instead of CMP+Jcc), but practically speaking the added cost is a few bytes of code and maaaaybe a cycle or three. Reported-by: Yan Zhao <yan.y.zhao@intel.com> Closes: https://lore.kernel.org/all/ZNnPF4W26ZbAyGto@yzhao56-desk.sh.intel.com Reported-by: Friedrich Weber <f.weber@proxmox.com> Cc: Kai Huang <kai.huang@intel.com> Cc: Yan Zhao <yan.y.zhao@intel.com> Cc: Yuan Yao <yuan.yao@linux.intel.com> Cc: Xu Yilun <yilun.xu@linux.intel.com> Acked-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Link: https://lore.kernel.org/r/20240222012640.2820927-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-23KVM: SVM: Flush pages under kvm->lock to fix UAF in svm_register_enc_region()Sean Christopherson
Do the cache flush of converted pages in svm_register_enc_region() before dropping kvm->lock to fix use-after-free issues where region and/or its array of pages could be freed by a different task, e.g. if userspace has __unregister_enc_region_locked() already queued up for the region. Note, the "obvious" alternative of using local variables doesn't fully resolve the bug, as region->pages is also dynamically allocated. I.e. the region structure itself would be fine, but region->pages could be freed. Flushing multiple pages under kvm->lock is unfortunate, but the entire flow is a rare slow path, and the manual flush is only needed on CPUs that lack coherency for encrypted memory. Fixes: 19a23da53932 ("Fix unsynchronized access to sev members through svm_register_enc_region") Reported-by: Gabe Kirkpatrick <gkirkpatrick@google.com> Cc: Josh Eads <josheads@google.com> Cc: Peter Gonda <pgonda@google.com> Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20240217013430.2079561-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-02-22KVM: x86/mmu: Restrict KVM_SW_PROTECTED_VM to the TDP MMUSean Christopherson
Advertise and support software-protected VMs if and only if the TDP MMU is enabled, i.e. disallow KVM_SW_PROTECTED_VM if TDP is enabled for KVM's legacy/shadow MMU. TDP support for the shadow MMU is maintenance-only, e.g. support for TDX and SNP will also be restricted to the TDP MMU. Fixes: 89ea60c2c7b5 ("KVM: x86: Add support for "protected VMs" that can utilize private memory") Link: https://lore.kernel.org/r/20240222190612.2942589-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-22KVM: x86: Update KVM_SW_PROTECTED_VM docs to make it clear they're a WIPSean Christopherson
Rewrite the help message for KVM_SW_PROTECTED_VM to make it clear that software-protected VMs are a development and testing vehicle for guest_memfd(), and that attempting to use KVM_SW_PROTECTED_VM for anything remotely resembling a "real" VM will fail. E.g. any memory accesses from KVM will incorrectly access shared memory, nested TDP is wildly broken, and so on and so forth. Update KVM's API documentation with similar warnings to discourage anyone from attempting to run anything but selftests with KVM_X86_SW_PROTECTED_VM. Fixes: 89ea60c2c7b5 ("KVM: x86: Add support for "protected VMs" that can utilize private memory") Link: https://lore.kernel.org/r/20240222190612.2942589-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-22Merge tag 'net-6.8.0-rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Including fixes from bpf and netfilter. Current release - regressions: - af_unix: fix another unix GC hangup Previous releases - regressions: - core: fix a possible AF_UNIX deadlock - bpf: fix NULL pointer dereference in sk_psock_verdict_data_ready() - netfilter: nft_flow_offload: release dst in case direct xmit path is used - bridge: switchdev: ensure MDB events are delivered exactly once - l2tp: pass correct message length to ip6_append_data - dccp/tcp: unhash sk from ehash for tb2 alloc failure after check_estalblished() - tls: fixes for record type handling with PEEK - devlink: fix possible use-after-free and memory leaks in devlink_init() Previous releases - always broken: - bpf: fix an oops when attempting to read the vsyscall page through bpf_probe_read_kernel - sched: act_mirred: use the backlog for mirred ingress - netfilter: nft_flow_offload: fix dst refcount underflow - ipv6: sr: fix possible use-after-free and null-ptr-deref - mptcp: fix several data races - phonet: take correct lock to peek at the RX queue Misc: - handful of fixes and reliability improvements for selftests" * tag 'net-6.8.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (72 commits) l2tp: pass correct message length to ip6_append_data net: phy: realtek: Fix rtl8211f_config_init() for RTL8211F(D)(I)-VD-CG PHY selftests: ioam: refactoring to align with the fix Fix write to cloned skb in ipv6_hop_ioam() phonet/pep: fix racy skb_queue_empty() use phonet: take correct lock to peek at the RX queue net: sparx5: Add spinlock for frame transmission from CPU net/sched: flower: Add lock protection when remove filter handle devlink: fix port dump cmd type net: stmmac: Fix EST offset for dwmac 5.10 tools: ynl: don't leak mcast_groups on init error tools: ynl: make sure we always pass yarg to mnl_cb_run net: mctp: put sock on tag allocation failure netfilter: nf_tables: use kzalloc for hook allocation netfilter: nf_tables: register hooks last when adding new chain/flowtable netfilter: nft_flow_offload: release dst in case direct xmit path is used netfilter: nft_flow_offload: reset dst in route object after setting up flow netfilter: nf_tables: set dormant flag on hook register failure selftests: tls: add test for peeking past a record of a different type selftests: tls: add test for merging of same-type control messages ...
2024-02-22Merge tag 'for-netdev' of ↵Paolo Abeni
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Daniel Borkmann says: ==================== pull-request: bpf 2024-02-22 The following pull-request contains BPF updates for your *net* tree. We've added 11 non-merge commits during the last 24 day(s) which contain a total of 15 files changed, 217 insertions(+), 17 deletions(-). The main changes are: 1) Fix a syzkaller-triggered oops when attempting to read the vsyscall page through bpf_probe_read_kernel and friends, from Hou Tao. 2) Fix a kernel panic due to uninitialized iter position pointer in bpf_iter_task, from Yafang Shao. 3) Fix a race between bpf_timer_cancel_and_free and bpf_timer_cancel, from Martin KaFai Lau. 4) Fix a xsk warning in skb_add_rx_frag() (under CONFIG_DEBUG_NET) due to incorrect truesize accounting, from Sebastian Andrzej Siewior. 5) Fix a NULL pointer dereference in sk_psock_verdict_data_ready, from Shigeru Yoshida. 6) Fix a resolve_btfids warning when bpf_cpumask symbol cannot be resolved, from Hari Bathini. bpf-for-netdev * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: bpf, sockmap: Fix NULL pointer dereference in sk_psock_verdict_data_ready() selftests/bpf: Add negtive test cases for task iter bpf: Fix an issue due to uninitialized bpf_iter_task selftests/bpf: Test racing between bpf_timer_cancel_and_free and bpf_timer_cancel bpf: Fix racing between bpf_timer_cancel_and_free and bpf_timer_cancel selftest/bpf: Test the read of vsyscall page under x86-64 x86/mm: Disallow vsyscall page read for copy_from_kernel_nofault() x86/mm: Move is_vsyscall_vaddr() into asm/vsyscall.h bpf, scripts: Correct GPL license name xsk: Add truesize to skb_add_rx_frag(). bpf: Fix warning for bpf_cpumask in verifier ==================== Link: https://lore.kernel.org/r/20240221231826.1404-1-daniel@iogearbox.net Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-02-20Merge branch 'for-6.8/cxl-cper' into for-6.8/cxlDan Williams
Pick up CXL CPER notification removal for v6.8-rc6, to return in a later merge window.
2024-02-19KVM/VMX: Move VERW closer to VMentry for MDS mitigationPawan Gupta
During VMentry VERW is executed to mitigate MDS. After VERW, any memory access like register push onto stack may put host data in MDS affected CPU buffers. A guest can then use MDS to sample host data. Although likelihood of secrets surviving in registers at current VERW callsite is less, but it can't be ruled out. Harden the MDS mitigation by moving the VERW mitigation late in VMentry path. Note that VERW for MMIO Stale Data mitigation is unchanged because of the complexity of per-guest conditional VERW which is not easy to handle that late in asm with no GPRs available. If the CPU is also affected by MDS, VERW is unconditionally executed late in asm regardless of guest having MMIO access. Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/all/20240213-delay-verw-v8-6-a6216d83edb7%40linux.intel.com
2024-02-19KVM/VMX: Use BT+JNC, i.e. EFLAGS.CF to select VMRESUME vs. VMLAUNCHSean Christopherson
Use EFLAGS.CF instead of EFLAGS.ZF to track whether to use VMRESUME versus VMLAUNCH. Freeing up EFLAGS.ZF will allow doing VERW, which clobbers ZF, for MDS mitigations as late as possible without needing to duplicate VERW for both paths. Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Nikolay Borisov <nik.borisov@suse.com> Link: https://lore.kernel.org/all/20240213-delay-verw-v8-5-a6216d83edb7%40linux.intel.com
2024-02-19x86/bugs: Use ALTERNATIVE() instead of mds_user_clear static keyPawan Gupta
The VERW mitigation at exit-to-user is enabled via a static branch mds_user_clear. This static branch is never toggled after boot, and can be safely replaced with an ALTERNATIVE() which is convenient to use in asm. Switch to ALTERNATIVE() to use the VERW mitigation late in exit-to-user path. Also remove the now redundant VERW in exc_nmi() and arch_exit_to_user_mode(). Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20240213-delay-verw-v8-4-a6216d83edb7%40linux.intel.com
2024-02-19x86/entry_32: Add VERW just before userspace transitionPawan Gupta
As done for entry_64, add support for executing VERW late in exit to user path for 32-bit mode. Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20240213-delay-verw-v8-3-a6216d83edb7%40linux.intel.com
2024-02-19x86/entry_64: Add VERW just before userspace transitionPawan Gupta
Mitigation for MDS is to use VERW instruction to clear any secrets in CPU Buffers. Any memory accesses after VERW execution can still remain in CPU buffers. It is safer to execute VERW late in return to user path to minimize the window in which kernel data can end up in CPU buffers. There are not many kernel secrets to be had after SWITCH_TO_USER_CR3. Add support for deploying VERW mitigation after user register state is restored. This helps minimize the chances of kernel data ending up into CPU buffers after executing VERW. Note that the mitigation at the new location is not yet enabled. Corner case not handled ======================= Interrupts returning to kernel don't clear CPUs buffers since the exit-to-user path is expected to do that anyways. But, there could be a case when an NMI is generated in kernel after the exit-to-user path has cleared the buffers. This case is not handled and NMI returning to kernel don't clear CPU buffers because: 1. It is rare to get an NMI after VERW, but before returning to userspace. 2. For an unprivileged user, there is no known way to make that NMI less rare or target it. 3. It would take a large number of these precisely-timed NMIs to mount an actual attack. There's presumably not enough bandwidth. 4. The NMI in question occurs after a VERW, i.e. when user state is restored and most interesting data is already scrubbed. Whats left is only the data that NMI touches, and that may or may not be of any interest. Suggested-by: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20240213-delay-verw-v8-2-a6216d83edb7%40linux.intel.com
2024-02-19x86/bugs: Add asm helpers for executing VERWPawan Gupta
MDS mitigation requires clearing the CPU buffers before returning to user. This needs to be done late in the exit-to-user path. Current location of VERW leaves a possibility of kernel data ending up in CPU buffers for memory accesses done after VERW such as: 1. Kernel data accessed by an NMI between VERW and return-to-user can remain in CPU buffers since NMI returning to kernel does not execute VERW to clear CPU buffers. 2. Alyssa reported that after VERW is executed, CONFIG_GCC_PLUGIN_STACKLEAK=y scrubs the stack used by a system call. Memory accesses during stack scrubbing can move kernel stack contents into CPU buffers. 3. When caller saved registers are restored after a return from function executing VERW, the kernel stack accesses can remain in CPU buffers(since they occur after VERW). To fix this VERW needs to be moved very late in exit-to-user path. In preparation for moving VERW to entry/exit asm code, create macros that can be used in asm. Also make VERW patching depend on a new feature flag X86_FEATURE_CLEAR_CPU_BUF. Reported-by: Alyssa Milburn <alyssa.milburn@intel.com> Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com> Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20240213-delay-verw-v8-1-a6216d83edb7%40linux.intel.com
2024-02-18Merge tag 'kbuild-fixes-v6.8-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull Kbuild fixes from Masahiro Yamada: - Reformat nested if-conditionals in Makefiles with 4 spaces - Fix CONFIG_DEBUG_INFO_BTF builds for big endian - Fix modpost for module srcversion - Fix an escape sequence warning in gen_compile_commands.py - Fix kallsyms to ignore ARMv4 thunk symbols * tag 'kbuild-fixes-v6.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: kallsyms: ignore ARMv4 thunks along with others modpost: trim leading spaces when processing source files list gen_compile_commands: fix invalid escape sequence warning kbuild: Fix changing ELF file type for output of gen_btf for big endian docs: kconfig: Fix grammar and formatting kbuild: use 4-space indentation when followed by conditionals
2024-02-18Merge tag 'x86_urgent_for_v6.8_rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fix from Borislav Petkov: - Use a GB page for identity mapping only when memory of this size is requested so that mapping of reserved regions is prevented which would otherwise lead to system crashes on UV machines * tag 'x86_urgent_for_v6.8_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mm/ident_map: Use gbpages only where full GB page should be mapped.
2024-02-16x86/numa: Fix the sort compare func used in numa_fill_memblks()Alison Schofield
The compare function used to sort memblks into starting address order fails when the result of its u64 address subtraction gets truncated to an int upon return. The impact of the bad sort is that memblks will be filled out incorrectly. Depending on the set of memblks, a user may see no errors at all but still have a bad fill, or see messages reporting a node overlap that leads to numa init failure: [] node 0 [mem: ] overlaps with node 1 [mem: ] [] No NUMA configuration found Replace with a comparison that can only result in: 1, 0, -1. Fixes: 8f012db27c95 ("x86/numa: Introduce numa_fill_memblks()") Signed-off-by: Alison Schofield <alison.schofield@intel.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Link: https://lore.kernel.org/r/99dcb3ae87e04995e9f293f6158dc8fa0749a487.1705085543.git.alison.schofield@intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2024-02-16x86/numa: Fix the address overlap check in numa_fill_memblks()Alison Schofield
numa_fill_memblks() fills in the gaps in numa_meminfo memblks over a physical address range. To do so, it first creates a list of existing memblks that overlap that address range. The issue is that it is off by one when comparing to the end of the address range, so memblks that do not overlap are selected. The impact of selecting a memblk that does not actually overlap is that an existing memblk may be filled when the expected action is to do nothing and return NUMA_NO_MEMBLK to the caller. The caller can then add a new NUMA node and memblk. Replace the broken open-coded search for address overlap with the memblock helper memblock_addrs_overlap(). Update the kernel doc and in code comments. Suggested by: "Huang, Ying" <ying.huang@intel.com> Fixes: 8f012db27c95 ("x86/numa: Introduce numa_fill_memblks()") Signed-off-by: Alison Schofield <alison.schofield@intel.com> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Link: https://lore.kernel.org/r/10a3e6109c34c21a8dd4c513cf63df63481a2b07.1705085543.git.alison.schofield@intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2024-02-16KVM: x86: Mark target gfn of emulated atomic instruction as dirtySean Christopherson
When emulating an atomic access on behalf of the guest, mark the target gfn dirty if the CMPXCHG by KVM is attempted and doesn't fault. This fixes a bug where KVM effectively corrupts guest memory during live migration by writing to guest memory without informing userspace that the page is dirty. Marking the page dirty got unintentionally dropped when KVM's emulated CMPXCHG was converted to do a user access. Before that, KVM explicitly mapped the guest page into kernel memory, and marked the page dirty during the unmap phase. Mark the page dirty even if the CMPXCHG fails, as the old data is written back on failure, i.e. the page is still written. The value written is guaranteed to be the same because the operation is atomic, but KVM's ABI is that all writes are dirty logged regardless of the value written. And more importantly, that's what KVM did before the buggy commit. Huge kudos to the folks on the Cc list (and many others), who did all the actual work of triaging and debugging. Fixes: 1c2361f667f3 ("KVM: x86: Use __try_cmpxchg_user() to emulate atomic accesses") Cc: stable@vger.kernel.org Cc: David Matlack <dmatlack@google.com> Cc: Pasha Tatashin <tatashin@google.com> Cc: Michael Krebs <mkrebs@google.com> base-commit: 6769ea8da8a93ed4630f1ce64df6aafcaabfce64 Reviewed-by: Jim Mattson <jmattson@google.com> Link: https://lore.kernel.org/r/20240215010004.1456078-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-16Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull KVM fixes from Paolo Bonzini: "ARM: - Avoid dropping the page refcount twice when freeing an unlinked page-table subtree. - Don't source the VFIO Kconfig twice - Fix protected-mode locking order between kvm and vcpus RISC-V: - Fix steal-time related sparse warnings x86: - Cleanup gtod_is_based_on_tsc() to return "bool" instead of an "int" - Make a KVM_REQ_NMI request while handling KVM_SET_VCPU_EVENTS if and only if the incoming events->nmi.pending is non-zero. If the target vCPU is in the UNITIALIZED state, the spurious request will result in KVM exiting to userspace, which in turn causes QEMU to constantly acquire and release QEMU's global mutex, to the point where the BSP is unable to make forward progress. - Fix a type (u8 versus u64) goof that results in pmu->fixed_ctr_ctrl being incorrectly truncated, and ultimately causes KVM to think a fixed counter has already been disabled (KVM thinks the old value is '0'). - Fix a stack leak in KVM_GET_MSRS where a failed MSR read from userspace that is ultimately ignored due to ignore_msrs=true doesn't zero the output as intended. Selftests cleanups and fixes: - Remove redundant newlines from error messages. - Delete an unused variable in the AMX test (which causes build failures when compiling with -Werror). - Fail instead of skipping tests if open(), e.g. of /dev/kvm, fails with an error code other than ENOENT (a Hyper-V selftest bug resulted in an EMFILE, and the test eventually got skipped). - Fix TSC related bugs in several Hyper-V selftests. - Fix a bug in the dirty ring logging test where a sem_post() could be left pending across multiple runs, resulting in incorrect synchronization between the main thread and the vCPU worker thread. - Relax the dirty log split test's assertions on 4KiB mappings to fix false positives due to the number of mappings for memslot 0 (used for code and data that is NOT being dirty logged) changing, e.g. due to NUMA balancing" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (25 commits) KVM: arm64: Fix double-free following kvm_pgtable_stage2_free_unlinked() RISC-V: KVM: Use correct restricted types RISC-V: paravirt: Use correct restricted types RISC-V: paravirt: steal_time should be static KVM: selftests: Don't assert on exact number of 4KiB in dirty log split test KVM: selftests: Fix a semaphore imbalance in the dirty ring logging test KVM: x86: Fix KVM_GET_MSRS stack info leak KVM: arm64: Do not source virt/lib/Kconfig twice KVM: x86/pmu: Fix type length error when reading pmu->fixed_ctr_ctrl KVM: x86: Make gtod_is_based_on_tsc() return 'bool' KVM: selftests: Make hyperv_clock require TSC based system clocksource KVM: selftests: Run clocksource dependent tests with hyperv_clocksource_tsc_page too KVM: selftests: Use generic sys_clocksource_is_tsc() in vmx_nested_tsc_scaling_test KVM: selftests: Generalize check_clocksource() from kvm_clock_test KVM: x86: make KVM_REQ_NMI request iff NMI pending for vcpu KVM: arm64: Fix circular locking dependency KVM: selftests: Fail tests when open() fails with !ENOENT KVM: selftests: Avoid infinite loop in hyperv_features when invtsc is missing KVM: selftests: Delete superfluous, unused "stage" variable in AMX test KVM: selftests: x86_64: Remove redundant newlines ...
2024-02-15x86/mm: Disallow vsyscall page read for copy_from_kernel_nofault()Hou Tao
When trying to use copy_from_kernel_nofault() to read vsyscall page through a bpf program, the following oops was reported: BUG: unable to handle page fault for address: ffffffffff600000 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 3231067 P4D 3231067 PUD 3233067 PMD 3235067 PTE 0 Oops: 0000 [#1] PREEMPT SMP PTI CPU: 1 PID: 20390 Comm: test_progs ...... 6.7.0+ #58 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) ...... RIP: 0010:copy_from_kernel_nofault+0x6f/0x110 ...... Call Trace: <TASK> ? copy_from_kernel_nofault+0x6f/0x110 bpf_probe_read_kernel+0x1d/0x50 bpf_prog_2061065e56845f08_do_probe_read+0x51/0x8d trace_call_bpf+0xc5/0x1c0 perf_call_bpf_enter.isra.0+0x69/0xb0 perf_syscall_enter+0x13e/0x200 syscall_trace_enter+0x188/0x1c0 do_syscall_64+0xb5/0xe0 entry_SYSCALL_64_after_hwframe+0x6e/0x76 </TASK> ...... ---[ end trace 0000000000000000 ]--- The oops is triggered when: 1) A bpf program uses bpf_probe_read_kernel() to read from the vsyscall page and invokes copy_from_kernel_nofault() which in turn calls __get_user_asm(). 2) Because the vsyscall page address is not readable from kernel space, a page fault exception is triggered accordingly. 3) handle_page_fault() considers the vsyscall page address as a user space address instead of a kernel space address. This results in the fix-up setup by bpf not being applied and a page_fault_oops() is invoked due to SMAP. Considering handle_page_fault() has already considered the vsyscall page address as a userspace address, fix the problem by disallowing vsyscall page read for copy_from_kernel_nofault(). Originally-by: Thomas Gleixner <tglx@linutronix.de> Reported-by: syzbot+72aa0161922eba61b50e@syzkaller.appspotmail.com Closes: https://lore.kernel.org/bpf/CAG48ez06TZft=ATH1qh2c5mpS5BT8UakwNkzi6nvK5_djC-4Nw@mail.gmail.com Reported-by: xingwei lee <xrivendell7@gmail.com> Closes: https://lore.kernel.org/bpf/CABOYnLynjBoFZOf3Z4BhaZkc5hx_kHfsjiW+UWLoB=w33LvScw@mail.gmail.com Signed-off-by: Hou Tao <houtao1@huawei.com> Reviewed-by: Sohil Mehta <sohil.mehta@intel.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240202103935.3154011-3-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-02-15x86/mm: Move is_vsyscall_vaddr() into asm/vsyscall.hHou Tao
Move is_vsyscall_vaddr() into asm/vsyscall.h to make it available for copy_from_kernel_nofault_allowed() in arch/x86/mm/maccess.c. Reviewed-by: Sohil Mehta <sohil.mehta@intel.com> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20240202103935.3154011-2-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-02-15kbuild: use 4-space indentation when followed by conditionalsMasahiro Yamada
GNU Make manual [1] clearly forbids a tab at the beginning of the conditional directive line: "Extra spaces are allowed and ignored at the beginning of the conditional directive line, but a tab is not allowed." This will not work for the next release of GNU Make, hence commit 82175d1f9430 ("kbuild: Replace tabs with spaces when followed by conditionals") replaced the inappropriate tabs with 8 spaces. However, the 8-space indentation cannot be visually distinguished. Linus suggested 2-4 spaces for those nested if-statements. [2] This commit redoes the replacement with 4 spaces. [1]: https://www.gnu.org/software/make/manual/make.html#Conditional-Syntax [2]: https://lore.kernel.org/all/CAHk-=whJKZNZWsa-VNDKafS_VfY4a5dAjG-r8BZgWk_a-xSepw@mail.gmail.com/ Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2024-02-14Merge tag 'kvm-x86-selftests-6.8-rcN' of https://github.com/kvm-x86/linux ↵Paolo Bonzini
into HEAD KVM selftests fixes/cleanups (and one KVM x86 cleanup) for 6.8: - Remove redundant newlines from error messages. - Delete an unused variable in the AMX test (which causes build failures when compiling with -Werror). - Fail instead of skipping tests if open(), e.g. of /dev/kvm, fails with an error code other than ENOENT (a Hyper-V selftest bug resulted in an EMFILE, and the test eventually got skipped). - Fix TSC related bugs in several Hyper-V selftests. - Fix a bug in the dirty ring logging test where a sem_post() could be left pending across multiple runs, resulting in incorrect synchronization between the main thread and the vCPU worker thread. - Relax the dirty log split test's assertions on 4KiB mappings to fix false positives due to the number of mappings for memslot 0 (used for code and data that is NOT being dirty logged) changing, e.g. due to NUMA balancing. - Have KVM's gtod_is_based_on_tsc() return "bool" instead of an "int" (the function generates boolean values, and all callers treat the return value as a bool).
2024-02-14Merge tag 'kvm-x86-fixes-6.8-rcN' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM x86 fixes for 6.8: - Make a KVM_REQ_NMI request while handling KVM_SET_VCPU_EVENTS if and only if the incoming events->nmi.pending is non-zero. If the target vCPU is in the UNITIALIZED state, the spurious request will result in KVM exiting to userspace, which in turn causes QEMU to constantly acquire and release QEMU's global mutex, to the point where the BSP is unable to make forward progress. - Fix a type (u8 versus u64) goof that results in pmu->fixed_ctr_ctrl being incorrectly truncated, and ultimately causes KVM to think a fixed counter has already been disabled (KVM thinks the old value is '0'). - Fix a stack leak in KVM_GET_MSRS where a failed MSR read from userspace that is ultimately ignored due to ignore_msrs=true doesn't zero the output as intended.
2024-02-12x86/mm/ident_map: Use gbpages only where full GB page should be mapped.Steve Wahl
When ident_pud_init() uses only gbpages to create identity maps, large ranges of addresses not actually requested can be included in the resulting table; a 4K request will map a full GB. On UV systems, this ends up including regions that will cause hardware to halt the system if accessed (these are marked "reserved" by BIOS). Even processor speculation into these regions is enough to trigger the system halt. Only use gbpages when map creation requests include the full GB page of space. Fall back to using smaller 2M pages when only portions of a GB page are included in the request. No attempt is made to coalesce mapping requests. If a request requires a map entry at the 2M (pmd) level, subsequent mapping requests within the same 1G region will also be at the pmd level, even if adjacent or overlapping such requests could have been combined to map a full gbpage. Existing usage starts with larger regions and then adds smaller regions, so this should not have any great consequence. [ dhansen: fix up comment formatting, simplifty changelog ] Signed-off-by: Steve Wahl <steve.wahl@hpe.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/20240126164841.170866-1-steve.wahl%40hpe.com
2024-02-12x86/xen: Add some null pointer checking to smp.cKunwu Chan
kasprintf() returns a pointer to dynamically allocated memory which can be NULL upon failure. Ensure the allocation was successful by checking the pointer validity. Signed-off-by: Kunwu Chan <chentao@kylinos.cn> Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202401161119.iof6BQsf-lkp@intel.com/ Suggested-by: Markus Elfring <Markus.Elfring@web.de> Reviewed-by: Juergen Gross <jgross@suse.com> Link: https://lore.kernel.org/r/20240119094948.275390-1-chentao@kylinos.cn Signed-off-by: Juergen Gross <jgross@suse.com>
2024-02-11Merge tag 'x86_urgent_for_v6.8_rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Borislav Petkov: - Correct the minimum CPU family for Transmeta Crusoe in Kconfig so that such hw can boot again - Do not take into accout XSTATE buffer size info supplied by userspace when constructing a sigreturn frame - Switch get_/put_user* to EX_TYPE_UACCESS exception handling when an MCE is encountered so that it can be properly recovered from instead of simply panicking * tag 'x86_urgent_for_v6.8_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/Kconfig: Transmeta Crusoe is CPU family 5, not 6 x86/fpu: Stop relying on userspace for info to fault in xsave buffer x86/lib: Revert to _ASM_EXTABLE_UA() for {get,put}_user() fixups
2024-02-09work around gcc bugs with 'asm goto' with outputsLinus Torvalds
We've had issues with gcc and 'asm goto' before, and we created a 'asm_volatile_goto()' macro for that in the past: see commits 3f0116c3238a ("compiler/gcc4: Add quirk for 'asm goto' miscompilation bug") and a9f180345f53 ("compiler/gcc4: Make quirk for asm_volatile_goto() unconditional"). Then, much later, we ended up removing the workaround in commit 43c249ea0b1e ("compiler-gcc.h: remove ancient workaround for gcc PR 58670") because we no longer supported building the kernel with the affected gcc versions, but we left the macro uses around. Now, Sean Christopherson reports a new version of a very similar problem, which is fixed by re-applying that ancient workaround. But the problem in question is limited to only the 'asm goto with outputs' cases, so instead of re-introducing the old workaround as-is, let's rename and limit the workaround to just that much less common case. It looks like there are at least two separate issues that all hit in this area: (a) some versions of gcc don't mark the asm goto as 'volatile' when it has outputs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98619 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110420 which is easy to work around by just adding the 'volatile' by hand. (b) Internal compiler errors: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110422 which are worked around by adding the extra empty 'asm' as a barrier, as in the original workaround. but the problem Sean sees may be a third thing since it involves bad code generation (not an ICE) even with the manually added 'volatile'. but the same old workaround works for this case, even if this feels a bit like voodoo programming and may only be hiding the issue. Reported-and-tested-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/all/20240208220604.140859-1-seanjc@google.com/ Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Uros Bizjak <ubizjak@gmail.com> Cc: Jakub Jelinek <jakub@redhat.com> Cc: Andrew Pinski <quic_apinski@quicinc.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-02-09Merge tag 'efi-fixes-for-v6.8-1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi Pull EFI fixes from Ard Biesheuvel: "The only notable change here is the patch that changes the way we deal with spurious errors from the EFI memory attribute protocol. This will be backported to v6.6, and is intended to ensure that we will not paint ourselves into a corner when we tighten this further in order to comply with MS requirements on signed EFI code. Note that this protocol does not currently exist in x86 production systems in the field, only in Microsoft's fork of OVMF, but it will be mandatory for Windows logo certification for x86 PCs in the future. - Tighten ELF relocation checks on the RISC-V EFI stub - Give up if the new EFI memory attributes protocol fails spuriously on x86 - Take care not to place the kernel in the lowest 16 MB of DRAM on x86 - Omit special purpose EFI memory from memblock - Some fixes for the CXL CPER reporting code - Make the PE/COFF layout of mixed-mode capable images comply with a strict interpretation of the spec" * tag 'efi-fixes-for-v6.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi: x86/efistub: Use 1:1 file:memory mapping for PE/COFF .compat section cxl/trace: Remove unnecessary memcpy's cxl/cper: Fix errant CPER prints for CXL events efi: Don't add memblocks for soft-reserved memory efi: runtime: Fix potential overflow of soft-reserved region size efi/libstub: Add one kernel-doc comment x86/efistub: Avoid placing the kernel below LOAD_PHYSICAL_ADDR x86/efistub: Give up if memory attribute protocol returns an error riscv/efistub: Tighten ELF relocation check riscv/efistub: Ensure GP-relative addressing is not used
2024-02-09x86/Kconfig: Transmeta Crusoe is CPU family 5, not 6Aleksander Mazur
The kernel built with MCRUSOE is unbootable on Transmeta Crusoe. It shows the following error message: This kernel requires an i686 CPU, but only detected an i586 CPU. Unable to boot - please use a kernel appropriate for your CPU. Remove MCRUSOE from the condition introduced in commit in Fixes, effectively changing X86_MINIMUM_CPU_FAMILY back to 5 on that machine, which matches the CPU family given by CPUID. [ bp: Massage commit message. ] Fixes: 25d76ac88821 ("x86/Kconfig: Explicitly enumerate i686-class CPUs in Kconfig") Signed-off-by: Aleksander Mazur <deweloper@wp.pl> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: H. Peter Anvin <hpa@zytor.com> Cc: <stable@kernel.org> Link: https://lore.kernel.org/r/20240123134309.1117782-1-deweloper@wp.pl
2024-02-07Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm fixes from Paolo Bonzini: "x86 guest: - Avoid false positive for check that only matters on AMD processors x86: - Give a hint when Win2016 might fail to boot due to XSAVES && !XSAVEC configuration - Do not allow creating an in-kernel PIT unless an IOAPIC already exists RISC-V: - Allow ISA extensions that were enabled for bare metal in 6.8 (Zbc, scalar and vector crypto, Zfh[min], Zihintntl, Zvfh[min], Zfa) S390: - fix CC for successful PQAP instruction - fix a race when creating a shadow page" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: x86/coco: Define cc_vendor without CONFIG_ARCH_HAS_CC_PLATFORM x86/kvm: Fix SEV check in sev_map_percpu_data() KVM: x86: Give a hint when Win2016 might fail to boot due to XSAVES erratum KVM: x86: Check irqchip mode before create PIT KVM: riscv: selftests: Add Zfa extension to get-reg-list test RISC-V: KVM: Allow Zfa extension for Guest/VM KVM: riscv: selftests: Add Zvfh[min] extensions to get-reg-list test RISC-V: KVM: Allow Zvfh[min] extensions for Guest/VM KVM: riscv: selftests: Add Zihintntl extension to get-reg-list test RISC-V: KVM: Allow Zihintntl extension for Guest/VM KVM: riscv: selftests: Add Zfh[min] extensions to get-reg-list test RISC-V: KVM: Allow Zfh[min] extensions for Guest/VM KVM: riscv: selftests: Add vector crypto extensions to get-reg-list test RISC-V: KVM: Allow vector crypto extensions for Guest/VM KVM: riscv: selftests: Add scaler crypto extensions to get-reg-list test RISC-V: KVM: Allow scalar crypto extensions for Guest/VM KVM: riscv: selftests: Add Zbc extension to get-reg-list test RISC-V: KVM: Allow Zbc extension for Guest/VM KVM: s390: fix cc for successful PQAP KVM: s390: vsie: fix race during shadow creation
2024-02-06x86/coco: Define cc_vendor without CONFIG_ARCH_HAS_CC_PLATFORMNathan Chancellor
After commit a9ef277488cf ("x86/kvm: Fix SEV check in sev_map_percpu_data()"), there is a build error when building x86_64_defconfig with GCOV using LLVM: ld.lld: error: undefined symbol: cc_vendor >>> referenced by kvm.c >>> arch/x86/kernel/kvm.o:(kvm_smp_prepare_boot_cpu) in archive vmlinux.a which corresponds to if (cc_vendor != CC_VENDOR_AMD || !cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT)) return; Without GCOV, clang is able to eliminate the use of cc_vendor because cc_platform_has() evaluates to false when CONFIG_ARCH_HAS_CC_PLATFORM is not set, meaning that if statement will be true no matter what value cc_vendor has. With GCOV, the instrumentation keeps the use of cc_vendor around for code coverage purposes but cc_vendor is only declared, not defined, without CONFIG_ARCH_HAS_CC_PLATFORM, leading to the build error above. Provide a macro definition of cc_vendor when CONFIG_ARCH_HAS_CC_PLATFORM is not set with a value of CC_VENDOR_NONE, so that the first condition can always be evaluated/eliminated at compile time, avoiding the build error altogether. This is very similar to the situation prior to commit da86eb961184 ("x86/coco: Get rid of accessor functions"). Signed-off-by: Nathan Chancellor <nathan@kernel.org> Acked-by: Borislav Petkov (AMD) <bp@alien8.de> Message-Id: <20240202-provide-cc_vendor-without-arch_has_cc_platform-v1-1-09ad5f2a3099@kernel.org> Fixes: a9ef277488cf ("x86/kvm: Fix SEV check in sev_map_percpu_data()", 2024-01-31) Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-02-05KVM: x86: Fix KVM_GET_MSRS stack info leakMathias Krause
Commit 6abe9c1386e5 ("KVM: X86: Move ignore_msrs handling upper the stack") changed the 'ignore_msrs' handling, including sanitizing return values to the caller. This was fine until commit 12bc2132b15e ("KVM: X86: Do the same ignore_msrs check for feature msrs") which allowed non-existing feature MSRs to be ignored, i.e. to not generate an error on the ioctl() level. It even tried to preserve the sanitization of the return value. However, the logic is flawed, as '*data' will be overwritten again with the uninitialized stack value of msr.data. Fix this by simplifying the logic and always initializing msr.data, vanishing the need for an additional error exit path. Fixes: 12bc2132b15e ("KVM: X86: Do the same ignore_msrs check for feature msrs") Signed-off-by: Mathias Krause <minipli@grsecurity.net> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20240203124522.592778-2-minipli@grsecurity.net Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-05x86/efistub: Use 1:1 file:memory mapping for PE/COFF .compat sectionArd Biesheuvel
The .compat section is a dummy PE section that contains the address of the 32-bit entrypoint of the 64-bit kernel image if it is bootable from 32-bit firmware (i.e., CONFIG_EFI_MIXED=y) This section is only 8 bytes in size and is only referenced from the loader, and so it is placed at the end of the memory view of the image, to avoid the need for padding it to 4k, which is required for sections appearing in the middle of the image. Unfortunately, this violates the PE/COFF spec, and even if most EFI loaders will work correctly (including the Tianocore reference implementation), PE loaders do exist that reject such images, on the basis that both the file and memory views of the file contents should be described by the section headers in a monotonically increasing manner without leaving any gaps. So reorganize the sections to avoid this issue. This results in a slight padding overhead (< 4k) which can be avoided if desired by disabling CONFIG_EFI_MIXED (which is only needed in rare cases these days) Fixes: 3e3eabe26dc8 ("x86/boot: Increase section and file alignment to 4k/512") Reported-by: Mike Beaton <mjsbeaton@gmail.com> Link: https://lkml.kernel.org/r/CAHzAAWQ6srV6LVNdmfbJhOwhBw5ZzxxZZ07aHt9oKkfYAdvuQQ%40mail.gmail.com Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
2024-02-02KVM: x86/pmu: Fix type length error when reading pmu->fixed_ctr_ctrlMingwei Zhang
Use a u64 instead of a u8 when taking a snapshot of pmu->fixed_ctr_ctrl when reprogramming fixed counters, as truncating the value results in KVM thinking fixed counter 2 is already disabled (the bug also affects fixed counters 3+, but KVM doesn't yet support those). As a result, if the guest disables fixed counter 2, KVM will get a false negative and fail to reprogram/disable emulation of the counter, which can leads to incorrect counts and spurious PMIs in the guest. Fixes: 76d287b2342e ("KVM: x86/pmu: Drop "u8 ctrl, int idx" for reprogram_fixed_counter()") Cc: stable@vger.kernel.org Signed-off-by: Mingwei Zhang <mizhang@google.com> Link: https://lore.kernel.org/r/20240123221220.3911317-1-mizhang@google.com [sean: rewrite changelog to call out the effects of the bug] Signed-off-by: Sean Christopherson <seanjc@google.com>