summaryrefslogtreecommitdiff
path: root/arch/x86
AgeCommit message (Collapse)Author
2021-12-16Merge tag 'net-5.16-rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Networking fixes, including fixes from mac80211, wifi, bpf. Relatively large batches of fixes from BPF and the WiFi stack, calm in general networking. Current release - regressions: - dpaa2-eth: fix buffer overrun when reporting ethtool statistics Current release - new code bugs: - bpf: fix incorrect state pruning for <8B spill/fill - iavf: - add missing unlocks in iavf_watchdog_task() - do not override the adapter state in the watchdog task (again) - mlxsw: spectrum_router: consolidate MAC profiles when possible Previous releases - regressions: - mac80211 fixes: - rate control, avoid driver crash for retransmitted frames - regression in SSN handling of addba tx - a memory leak where sta_info is not freed - marking TX-during-stop for TX in in_reconfig, prevent stall - cfg80211: acquire wiphy mutex on regulatory work - wifi drivers: fix build regressions and LED config dependency - virtio_net: fix rx_drops stat for small pkts - dsa: mv88e6xxx: unforce speed & duplex in mac_link_down() Previous releases - always broken: - bpf fixes: - kernel address leakage in atomic fetch - kernel address leakage in atomic cmpxchg's r0 aux reg - signed bounds propagation after mov32 - extable fixup offset - extable address check - mac80211: - fix the size used for building probe request - send ADDBA requests using the tid/queue of the aggregation session - agg-tx: don't schedule_and_wake_txq() under sta->lock, avoid deadlocks - validate extended element ID is present - mptcp: - never allow the PM to close a listener subflow (null-defer) - clear 'kern' flag from fallback sockets, prevent crash - fix deadlock in __mptcp_push_pending() - inet_diag: fix kernel-infoleak for UDP sockets - xsk: do not sleep in poll() when need_wakeup set - smc: avoid very long waits in smc_release() - sch_ets: don't remove idle classes from the round-robin list - netdevsim: - zero-initialize memory for bpf map's value, prevent info leak - don't let user space overwrite read only (max) ethtool parms - ixgbe: set X550 MDIO speed before talking to PHY - stmmac: - fix null-deref in flower deletion w/ VLAN prio Rx steering - dwmac-rk: fix oob read in rk_gmac_setup - ice: time stamping fixes - systemport: add global locking for descriptor life cycle" * tag 'net-5.16-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (89 commits) bpf, selftests: Fix racing issue in btf_skc_cls_ingress test selftest/bpf: Add a test that reads various addresses. bpf: Fix extable address check. bpf: Fix extable fixup offset. bpf, selftests: Add test case trying to taint map value pointer bpf: Make 32->64 bounds propagation slightly more robust bpf: Fix signed bounds propagation after mov32 sit: do not call ipip6_dev_free() from sit_init_net() net: systemport: Add global locking for descriptor lifecycle net/smc: Prevent smc_release() from long blocking net: Fix double 0x prefix print in SKB dump virtio_net: fix rx_drops stat for small pkts dsa: mv88e6xxx: fix debug print for SPEED_UNFORCED sfc_ef100: potential dereference of null pointer net: stmmac: dwmac-rk: fix oob read in rk_gmac_setup net: usb: lan78xx: add Allied Telesis AT29M2-AF net/packet: rx_owner_map depends on pg_vec netdevsim: Zero-initialize memory for new map's value in function nsim_bpf_map_alloc dpaa2-eth: fix ethtool statistics ixgbe: set X550 MDIO speed before talking to PHY ...
2021-12-16bpf: Fix extable address check.Alexei Starovoitov
The verifier checks that PTR_TO_BTF_ID pointer is either valid or NULL, but it cannot distinguish IS_ERR pointer from valid one. When offset is added to IS_ERR pointer it may become small positive value which is a user address that is not handled by extable logic and has to be checked for at the runtime. Tighten BPF_PROBE_MEM pointer check code to prevent this case. Fixes: 4c5de127598e ("bpf: Emit explicit NULL pointer checks for PROBE_LDX instructions.") Reported-by: Lorenzo Fontana <lorenzo.fontana@elastic.co> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2021-12-16bpf: Fix extable fixup offset.Alexei Starovoitov
The prog - start_of_ldx is the offset before the faulting ldx to the location after it, so this will be used to adjust pt_regs->ip for jumping over it and continuing, and with old temp it would have been fixed up to the wrong offset, causing crash. Fixes: 4c5de127598e ("bpf: Emit explicit NULL pointer checks for PROBE_LDX instructions.") Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2021-12-12Merge tag 'sched-urgent-2021-12-12' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Thomas Gleixner: "A single fix for the x86 scheduler topology: Using cluster topology on hybrid CPUs, e.g. Alder Lake, biases the scheduler towards the ATOM cluster as that has more total capacity. Use selection based on CPU priority instead" * tag 'sched-urgent-2021-12-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched,x86: Don't use cluster topology for x86 hybrid CPUs
2021-12-10Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm fixes from Paolo Bonzini: "More x86 fixes: - Logic bugs in CR0 writes and Hyper-V hypercalls - Don't use Enlightened MSR Bitmap for L3 - Remove user-triggerable WARN Plus a few selftest fixes and a regression test for the user-triggerable WARN" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: selftests: KVM: Add test to verify KVM doesn't explode on "bad" I/O KVM: x86: Don't WARN if userspace mucks with RCX during string I/O exit KVM: X86: Raise #GP when clearing CR0_PG in 64 bit mode selftests: KVM: avoid failures due to reserved HyperTransport region KVM: x86: Ignore sparse banks size for an "all CPUs", non-sparse IPI req KVM: x86: Wait for IPIs to be delivered when handling Hyper-V TLB flush hypercall KVM: x86: selftests: svm_int_ctl_test: fix intercept calculation KVM: nVMX: Don't use Enlightened MSR Bitmap for L3
2021-12-10KVM: x86: Don't WARN if userspace mucks with RCX during string I/O exitSean Christopherson
Replace a WARN with a comment to call out that userspace can modify RCX during an exit to userspace to handle string I/O. KVM doesn't actually support changing the rep count during an exit, i.e. the scenario can be ignored, but the WARN needs to go as it's trivial to trigger from userspace. Cc: stable@vger.kernel.org Fixes: 3b27de271839 ("KVM: x86: split the two parts of emulator_pio_in") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211025201311.1881846-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-10KVM: X86: Raise #GP when clearing CR0_PG in 64 bit modeLai Jiangshan
In the SDM: If the logical processor is in 64-bit mode or if CR4.PCIDE = 1, an attempt to clear CR0.PG causes a general-protection exception (#GP). Software should transition to compatibility mode and clear CR4.PCIDE before attempting to disable paging. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211207095230.53437-1-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-10KVM: x86: Ignore sparse banks size for an "all CPUs", non-sparse IPI reqSean Christopherson
Do not bail early if there are no bits set in the sparse banks for a non-sparse, a.k.a. "all CPUs", IPI request. Per the Hyper-V spec, it is legal to have a variable length of '0', e.g. VP_SET's BankContents in this case, if the request can be serviced without the extra info. It is possible that for a given invocation of a hypercall that does accept variable sized input headers that all the header input fits entirely within the fixed size header. In such cases the variable sized input header is zero-sized and the corresponding bits in the hypercall input should be set to zero. Bailing early results in KVM failing to send IPIs to all CPUs as expected by the guest. Fixes: 214ff83d4473 ("KVM: x86: hyperv: implement PV IPI send hypercalls") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20211207220926.718794-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-10KVM: x86: Wait for IPIs to be delivered when handling Hyper-V TLB flush ↵Vitaly Kuznetsov
hypercall Prior to commit 0baedd792713 ("KVM: x86: make Hyper-V PV TLB flush use tlb_flush_guest()"), kvm_hv_flush_tlb() was using 'KVM_REQ_TLB_FLUSH | KVM_REQUEST_NO_WAKEUP' when making a request to flush TLBs on other vCPUs and KVM_REQ_TLB_FLUSH is/was defined as: (0 | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP) so KVM_REQUEST_WAIT was lost. Hyper-V TLFS, however, requires that "This call guarantees that by the time control returns back to the caller, the observable effects of all flushes on the specified virtual processors have occurred." and without KVM_REQUEST_WAIT there's a small chance that the vCPU making the TLB flush will resume running before all IPIs get delivered to other vCPUs and a stale mapping can get read there. Fix the issue by adding KVM_REQUEST_WAIT flag to KVM_REQ_TLB_FLUSH_GUEST: kvm_hv_flush_tlb() is the sole caller which uses it for kvm_make_all_cpus_request()/kvm_make_vcpus_request_mask() where KVM_REQUEST_WAIT makes a difference. Cc: stable@kernel.org Fixes: 0baedd792713 ("KVM: x86: make Hyper-V PV TLB flush use tlb_flush_guest()") Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20211209102937.584397-1-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-08sched,x86: Don't use cluster topology for x86 hybrid CPUsPeter Zijlstra
For x86 hybrid CPUs like Alder Lake, the order of CPU selection should be based strictly on CPU priority. Don't include cluster topology for hybrid CPUs to avoid interference with such CPU selection order. On Alder Lake, the Atom CPU cluster has more capacity (4 Atom CPUs) vs Big core cluster (2 hyperthread CPUs). This could potentially bias CPU selection towards Atom over Big Core, when Big core CPU has higher priority. Fixes: 66558b730f25 ("sched: Add cluster scheduler level for x86") Suggested-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Tim Chen <tim.c.chen@linux.intel.com> Tested-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com> Link: https://lkml.kernel.org/r/20211204091402.GM16608@worktop.programming.kicks-ass.net
2021-12-08KVM: nVMX: Don't use Enlightened MSR Bitmap for L3Vitaly Kuznetsov
When KVM runs as a nested hypervisor on top of Hyper-V it uses Enlightened VMCS and enables Enlightened MSR Bitmap feature for its L1s and L2s (which are actually L2s and L3s from Hyper-V's perspective). When MSR bitmap is updated, KVM has to reset HV_VMX_ENLIGHTENED_CLEAN_FIELD_MSR_BITMAP from clean fields to make Hyper-V aware of the change. For KVM's L1s, this is done in vmx_disable_intercept_for_msr()/vmx_enable_intercept_for_msr(). MSR bitmap for L2 is build in nested_vmx_prepare_msr_bitmap() by blending MSR bitmap for L1 and L1's idea of MSR bitmap for L2. KVM, however, doesn't check if the resulting bitmap is different and never cleans HV_VMX_ENLIGHTENED_CLEAN_FIELD_MSR_BITMAP in eVMCS02. This is incorrect and may result in Hyper-V missing the update. The issue could've been solved by calling evmcs_touch_msr_bitmap() for eVMCS02 from nested_vmx_prepare_msr_bitmap() unconditionally but doing so would not give any performance benefits (compared to not using Enlightened MSR Bitmap at all). 3-level nesting is also not a very common setup nowadays. Don't enable 'Enlightened MSR Bitmap' feature for KVM's L2s (real L3s) for now. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20211129094704.326635-2-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-06Merge tag 'efi-urgent-for-v5.16' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi Pull EFI fix from Ard Biesheuvel: "Ensure that the EFI memory map resides in encrypted memory even after it has been reallocated" * tag 'efi-urgent-for-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi: x86/sme: Explicitly map new EFI memmap table as encrypted
2021-12-05Merge tag 'x86_urgent_for_v5.16_rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Borislav Petkov: - Fix a couple of SWAPGS fencing issues in the x86 entry code - Use the proper operand types in __{get,put}_user() to prevent truncation in SEV-ES string io - Make sure the kernel mappings are present in trampoline_pgd in order to prevent any potential accesses to unmapped memory after switching to it - Fix a trivial list corruption in objtool's pv_ops validation - Disable the clocksource watchdog for TSC on platforms which claim that the TSC is constant, doesn't stop in sleep states, CPU has TSC adjust and the number of sockets of the platform are max 2, to prevent erroneous markings of the TSC as unstable. - Make sure TSC adjust is always checked not only when going idle - Prevent a stack leak by initializing struct _fpx_sw_bytes properly in the FPU code - Fix INTEL_FAM6_RAPTORLAKE define naming to adhere to the convention * tag 'x86_urgent_for_v5.16_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/xen: Add xenpv_restore_regs_and_return_to_usermode() x86/entry: Use the correct fence macro after swapgs in kernel CR3 x86/entry: Add a fence for kernel entry SWAPGS in paranoid_entry() x86/sev: Fix SEV-ES INS/OUTS instructions for word, dword, and qword x86/64/mm: Map all kernel memory into trampoline_pgd objtool: Fix pv_ops noinstr validation x86/tsc: Disable clocksource watchdog for TSC on qualified platorms x86/tsc: Add a timer to make sure TSC_adjust is always checked x86/fpu/signal: Initialize sw_bytes in save_xstate_epilog() x86/cpu: Drop spurious underscore from RAPTOR_LAKE #define
2021-12-05Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull more kvm fixes from Paolo Bonzini: - Static analysis fix - New SEV-ES protocol for communicating invalid VMGEXIT requests - Ensure APICv is considered inactive if there is no APIC - Fix reserved bits for AMD PerfEvtSeln register * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: SVM: Do not terminate SEV-ES guests on GHCB validation failure KVM: SEV: Fall back to vmalloc for SEV-ES scratch area if necessary KVM: SEV: Return appropriate error codes if SEV-ES scratch setup fails KVM: x86/mmu: Retry page fault if root is invalidated by memslot update KVM: VMX: Set failure code in prepare_vmcs02() KVM: ensure APICv is considered inactive if there is no APIC KVM: x86/pmu: Fix reserved bits for AMD PerfEvtSeln register
2021-12-05x86/sme: Explicitly map new EFI memmap table as encryptedTom Lendacky
Reserving memory using efi_mem_reserve() calls into the x86 efi_arch_mem_reserve() function. This function will insert a new EFI memory descriptor into the EFI memory map representing the area of memory to be reserved and marking it as EFI runtime memory. As part of adding this new entry, a new EFI memory map is allocated and mapped. The mapping is where a problem can occur. This new memory map is mapped using early_memremap() and generally mapped encrypted, unless the new memory for the mapping happens to come from an area of memory that is marked as EFI_BOOT_SERVICES_DATA memory. In this case, the new memory will be mapped unencrypted. However, during replacement of the old memory map, efi_mem_type() is disabled, so the new memory map will now be long-term mapped encrypted (in efi.memmap), resulting in the map containing invalid data and causing the kernel boot to crash. Since it is known that the area will be mapped encrypted going forward, explicitly map the new memory map as encrypted using early_memremap_prot(). Cc: <stable@vger.kernel.org> # 4.14.x Fixes: 8f716c9b5feb ("x86/mm: Add support to access boot related data in the clear") Link: https://lore.kernel.org/all/ebf1eb2940405438a09d51d121ec0d02c8755558.1634752931.git.thomas.lendacky@amd.com/ Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> [ardb: incorporate Kconfig fix by Arnd] Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
2021-12-05KVM: SVM: Do not terminate SEV-ES guests on GHCB validation failureTom Lendacky
Currently, an SEV-ES guest is terminated if the validation of the VMGEXIT exit code or exit parameters fails. The VMGEXIT instruction can be issued from userspace, even though userspace (likely) can't update the GHCB. To prevent userspace from being able to kill the guest, return an error through the GHCB when validation fails rather than terminating the guest. For cases where the GHCB can't be updated (e.g. the GHCB can't be mapped, etc.), just return back to the guest. The new error codes are documented in the lasest update to the GHCB specification. Fixes: 291bd20d5d88 ("KVM: SVM: Add initial support for a VMGEXIT VMEXIT") Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Message-Id: <b57280b5562893e2616257ac9c2d4525a9aeeb42.1638471124.git.thomas.lendacky@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-05KVM: SEV: Fall back to vmalloc for SEV-ES scratch area if necessarySean Christopherson
Use kvzalloc() to allocate KVM's buffer for SEV-ES's GHCB scratch area so that KVM falls back to __vmalloc() if physically contiguous memory isn't available. The buffer is purely a KVM software construct, i.e. there's no need for it to be physically contiguous. Cc: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211109222350.2266045-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-05KVM: SEV: Return appropriate error codes if SEV-ES scratch setup failsSean Christopherson
Return appropriate error codes if setting up the GHCB scratch area for an SEV-ES guest fails. In particular, returning -EINVAL instead of -ENOMEM when allocating the kernel buffer could be confusing as userspace would likely suspect a guest issue. Fixes: 8f423a80d299 ("KVM: SVM: Support MMIO for an SEV-ES guest") Cc: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211109222350.2266045-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-03x86/xen: Add xenpv_restore_regs_and_return_to_usermode()Lai Jiangshan
In the native case, PER_CPU_VAR(cpu_tss_rw + TSS_sp0) is the trampoline stack. But XEN pv doesn't use trampoline stack, so PER_CPU_VAR(cpu_tss_rw + TSS_sp0) is also the kernel stack. In that case, source and destination stacks are identical, which means that reusing swapgs_restore_regs_and_return_to_usermode() in XEN pv would cause %rsp to move up to the top of the kernel stack and leave the IRET frame below %rsp. This is dangerous as it can be corrupted if #NMI / #MC hit as either of these events occurring in the middle of the stack pushing would clobber data on the (original) stack. And, with XEN pv, swapgs_restore_regs_and_return_to_usermode() pushing the IRET frame on to the original address is useless and error-prone when there is any future attempt to modify the code. [ bp: Massage commit message. ] Fixes: 7f2590a110b8 ("x86/entry/64: Use a per-CPU trampoline stack for IDT entries") Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Link: https://lkml.kernel.org/r/20211126101209.8613-4-jiangshanlai@gmail.com
2021-12-03x86/entry: Use the correct fence macro after swapgs in kernel CR3Lai Jiangshan
The commit c75890700455 ("x86/entry/64: Remove unneeded kernel CR3 switching") removed a CR3 write in the faulting path of load_gs_index(). But the path's FENCE_SWAPGS_USER_ENTRY has no fence operation if PTI is enabled, see spectre_v1_select_mitigation(). Rather, it depended on the serializing CR3 write of SWITCH_TO_KERNEL_CR3 and since it got removed, add a FENCE_SWAPGS_KERNEL_ENTRY call to make sure speculation is blocked. [ bp: Massage commit message and comment. ] Fixes: c75890700455 ("x86/entry/64: Remove unneeded kernel CR3 switching") Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20211126101209.8613-3-jiangshanlai@gmail.com
2021-12-03x86/entry: Add a fence for kernel entry SWAPGS in paranoid_entry()Lai Jiangshan
Commit 18ec54fdd6d18 ("x86/speculation: Prepare entry code for Spectre v1 swapgs mitigations") added FENCE_SWAPGS_{KERNEL|USER}_ENTRY for conditional SWAPGS. In paranoid_entry(), it uses only FENCE_SWAPGS_KERNEL_ENTRY for both branches. This is because the fence is required for both cases since the CR3 write is conditional even when PTI is enabled. But 96b2371413e8f ("x86/entry/64: Switch CR3 before SWAPGS in paranoid entry") changed the order of SWAPGS and the CR3 write. And it missed the needed FENCE_SWAPGS_KERNEL_ENTRY for the user gsbase case. Add it back by changing the branches so that FENCE_SWAPGS_KERNEL_ENTRY can cover both branches. [ bp: Massage, fix typos, remove obsolete comment while at it. ] Fixes: 96b2371413e8f ("x86/entry/64: Switch CR3 before SWAPGS in paranoid entry") Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20211126101209.8613-2-jiangshanlai@gmail.com
2021-12-03x86/sev: Fix SEV-ES INS/OUTS instructions for word, dword, and qwordMichael Sterritt
Properly type the operands being passed to __put_user()/__get_user(). Otherwise, these routines truncate data for dependent instructions (e.g., INSW) and only read/write one byte. This has been tested by sending a string with REP OUTSW to a port and then reading it back in with REP INSW on the same port. Previous behavior was to only send and receive the first char of the size. For example, word operations for "abcd" would only read/write "ac". With change, the full string is now written and read back. Fixes: f980f9c31a923 (x86/sev-es: Compile early handler code into kernel image) Signed-off-by: Michael Sterritt <sterritt@google.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Marc Orr <marcorr@google.com> Reviewed-by: Peter Gonda <pgonda@google.com> Reviewed-by: Joerg Roedel <jroedel@suse.de> Link: https://lkml.kernel.org/r/20211119232757.176201-1-sterritt@google.com
2021-12-03x86/64/mm: Map all kernel memory into trampoline_pgdJoerg Roedel
The trampoline_pgd only maps the 0xfffffff000000000-0xffffffffffffffff range of kernel memory (with 4-level paging). This range contains the kernel's text+data+bss mappings and the module mapping space but not the direct mapping and the vmalloc area. This is enough to get the application processors out of real-mode, but for code that switches back to real-mode the trampoline_pgd is missing important parts of the address space. For example, consider this code from arch/x86/kernel/reboot.c, function machine_real_restart() for a 64-bit kernel: #ifdef CONFIG_X86_32 load_cr3(initial_page_table); #else write_cr3(real_mode_header->trampoline_pgd); /* Exiting long mode will fail if CR4.PCIDE is set. */ if (boot_cpu_has(X86_FEATURE_PCID)) cr4_clear_bits(X86_CR4_PCIDE); #endif /* Jump to the identity-mapped low memory code */ #ifdef CONFIG_X86_32 asm volatile("jmpl *%0" : : "rm" (real_mode_header->machine_real_restart_asm), "a" (type)); #else asm volatile("ljmpl *%0" : : "m" (real_mode_header->machine_real_restart_asm), "D" (type)); #endif The code switches to the trampoline_pgd, which unmaps the direct mapping and also the kernel stack. The call to cr4_clear_bits() will find no stack and crash the machine. The real_mode_header pointer below points into the direct mapping, and dereferencing it also causes a crash. The reason this does not crash always is only that kernel mappings are global and the CR3 switch does not flush those mappings. But if theses mappings are not in the TLB already, the above code will crash before it can jump to the real-mode stub. Extend the trampoline_pgd to contain all kernel mappings to prevent these crashes and to make code which runs on this page-table more robust. Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20211202153226.22946-5-joro@8bytes.org
2021-12-02KVM: x86/mmu: Retry page fault if root is invalidated by memslot updateSean Christopherson
Bail from the page fault handler if the root shadow page was obsoleted by a memslot update. Do the check _after_ acuiring mmu_lock, as the TDP MMU doesn't rely on the memslot/MMU generation, and instead relies on the root being explicit marked invalid by kvm_mmu_zap_all_fast(), which takes mmu_lock for write. For the TDP MMU, inserting a SPTE into an obsolete root can leak a SP if kvm_tdp_mmu_zap_invalidated_roots() has already zapped the SP, i.e. has moved past the gfn associated with the SP. For other MMUs, the resulting behavior is far more convoluted, though unlikely to be truly problematic. Installing SPs/SPTEs into the obsolete root isn't directly problematic, as the obsolete root will be unloaded and dropped before the vCPU re-enters the guest. But because the legacy MMU tracks shadow pages by their role, any SP created by the fault can can be reused in the new post-reload root. Again, that _shouldn't_ be problematic as any leaf child SPTEs will be created for the current/valid memslot generation, and kvm_mmu_get_page() will not reuse child SPs from the old generation as they will be flagged as obsolete. But, given that continuing with the fault is pointess (the root will be unloaded), apply the check to all MMUs. Fixes: b7cccd397f31 ("KVM: x86/mmu: Fast invalidation for TDP MMU") Cc: stable@vger.kernel.org Cc: Ben Gardon <bgardon@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211120045046.3940942-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-02KVM: VMX: Set failure code in prepare_vmcs02()Dan Carpenter
The error paths in the prepare_vmcs02() function are supposed to set *entry_failure_code but this path does not. It leads to using an uninitialized variable in the caller. Fixes: 71f7347025bf ("KVM: nVMX: Load GUEST_IA32_PERF_GLOBAL_CTRL MSR on VM-Entry") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Message-Id: <20211130125337.GB24578@kili> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-02KVM: ensure APICv is considered inactive if there is no APICPaolo Bonzini
kvm_vcpu_apicv_active() returns false if a virtual machine has no in-kernel local APIC, however kvm_apicv_activated might still be true if there are no reasons to disable APICv; in fact it is quite likely that there is none because APICv is inhibited by specific configurations of the local APIC and those configurations cannot be programmed. This triggers a WARN: WARN_ON_ONCE(kvm_apicv_activated(vcpu->kvm) != kvm_vcpu_apicv_active(vcpu)); To avoid this, introduce another cause for APICv inhibition, namely the absence of an in-kernel local APIC. This cause is enabled by default, and is dropped by either KVM_CREATE_IRQCHIP or the enabling of KVM_CAP_IRQCHIP_SPLIT. Reported-by: Ignat Korchagin <ignat@cloudflare.com> Fixes: ee49a8932971 ("KVM: x86: Move SVM's APICv sanity check to common x86", 2021-10-22) Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Tested-by: Ignat Korchagin <ignat@cloudflare.com> Message-Id: <20211130123746.293379-1-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-02KVM: x86/pmu: Fix reserved bits for AMD PerfEvtSeln registerLike Xu
If we run the following perf command in an AMD Milan guest: perf stat \ -e cpu/event=0x1d0/ \ -e cpu/event=0x1c7/ \ -e cpu/umask=0x1f,event=0x18e/ \ -e cpu/umask=0x7,event=0x18e/ \ -e cpu/umask=0x18,event=0x18e/ \ ./workload dmesg will report a #GP warning from an unchecked MSR access error on MSR_F15H_PERF_CTLx. This is because according to APM (Revision: 4.03) Figure 13-7, the bits [35:32] of AMD PerfEvtSeln register is a part of the event select encoding, which extends the EVENT_SELECT field from 8 bits to 12 bits. Opportunistically update pmu->reserved_bits for reserved bit 19. Reported-by: Jim Mattson <jmattson@google.com> Fixes: ca724305a2b0 ("KVM: x86/vPMU: Implement AMD vPMU code for KVM") Signed-off-by: Like Xu <likexu@tencent.com> Message-Id: <20211118130320.95997-1-likexu@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-02x86/tsc: Disable clocksource watchdog for TSC on qualified platormsFeng Tang
There are cases that the TSC clocksource is wrongly judged as unstable by the clocksource watchdog mechanism which tries to validate the TSC against HPET, PM_TIMER or jiffies. While there is hardly a general reliable way to check the validity of a watchdog, Thomas Gleixner proposed [1]: "I'm inclined to lift that requirement when the CPU has: 1) X86_FEATURE_CONSTANT_TSC 2) X86_FEATURE_NONSTOP_TSC 3) X86_FEATURE_NONSTOP_TSC_S3 4) X86_FEATURE_TSC_ADJUST 5) At max. 4 sockets After two decades of horrors we're finally at a point where TSC seems to be halfway reliable and less abused by BIOS tinkerers. TSC_ADJUST was really key as we can now detect even small modifications reliably and the important point is that we can cure them as well (not pretty but better than all other options)." As feature #3 X86_FEATURE_NONSTOP_TSC_S3 only exists on several generations of Atom processorz, and is always coupled with X86_FEATURE_CONSTANT_TSC and X86_FEATURE_NONSTOP_TSC, skip checking it, and also be more defensive to use maximal 2 sockets. The check is done inside tsc_init() before registering 'tsc-early' and 'tsc' clocksources, as there were cases that both of them had been wrongly judged as unreliable. For more background of tsc/watchdog, there is a good summary in [2] [tglx} Update vs. jiffies: On systems where the only remaining clocksource aside of TSC is jiffies there is no way to make this work because that creates a circular dependency. Jiffies accuracy depends on not missing a periodic timer interrupt, which is not guaranteed. That could be detected by TSC, but as TSC is not trusted this cannot be compensated. The consequence is a circulus vitiosus which results in shutting down TSC and falling back to the jiffies clocksource which is even more unreliable. [1]. https://lore.kernel.org/lkml/87eekfk8bd.fsf@nanos.tec.linutronix.de/ [2]. https://lore.kernel.org/lkml/87a6pimt1f.ffs@nanos.tec.linutronix.de/ [ tglx: Refine comment and amend changelog ] Fixes: 6e3cd95234dc ("x86/hpet: Use another crystalball to evaluate HPET usability") Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Feng Tang <feng.tang@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20211117023751.24190-2-feng.tang@intel.com
2021-12-02x86/tsc: Add a timer to make sure TSC_adjust is always checkedFeng Tang
The TSC_ADJUST register is checked every time a CPU enters idle state, but Thomas Gleixner mentioned there is still a caveat that a system won't enter idle [1], either because it's too busy or configured purposely to not enter idle. Setup a periodic timer (every 10 minutes) to make sure the check is happening on a regular base. [1] https://lore.kernel.org/lkml/875z286xtk.fsf@nanos.tec.linutronix.de/ Fixes: 6e3cd95234dc ("x86/hpet: Use another crystalball to evaluate HPET usability") Requested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Feng Tang <feng.tang@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20211117023751.24190-1-feng.tang@intel.com
2021-11-30x86/fpu/signal: Initialize sw_bytes in save_xstate_epilog()Marco Elver
save_sw_bytes() did not fully initialize sw_bytes, which caused KMSAN to report an infoleak (see below). Initialize sw_bytes explicitly to avoid this. KMSAN report follows: ===================================================== BUG: KMSAN: kernel-infoleak in instrument_copy_to_user ./include/linux/instrumented.h:121 BUG: KMSAN: kernel-infoleak in __copy_to_user ./include/linux/uaccess.h:154 BUG: KMSAN: kernel-infoleak in save_xstate_epilog+0x2df/0x510 arch/x86/kernel/fpu/signal.c:127 instrument_copy_to_user ./include/linux/instrumented.h:121 __copy_to_user ./include/linux/uaccess.h:154 save_xstate_epilog+0x2df/0x510 arch/x86/kernel/fpu/signal.c:127 copy_fpstate_to_sigframe+0x861/0xb60 arch/x86/kernel/fpu/signal.c:245 get_sigframe+0x656/0x7e0 arch/x86/kernel/signal.c:296 __setup_rt_frame+0x14d/0x2a60 arch/x86/kernel/signal.c:471 setup_rt_frame arch/x86/kernel/signal.c:781 handle_signal arch/x86/kernel/signal.c:825 arch_do_signal_or_restart+0x417/0xdd0 arch/x86/kernel/signal.c:870 handle_signal_work kernel/entry/common.c:149 exit_to_user_mode_loop+0x1f6/0x490 kernel/entry/common.c:173 exit_to_user_mode_prepare kernel/entry/common.c:208 __syscall_exit_to_user_mode_work kernel/entry/common.c:290 syscall_exit_to_user_mode+0x7e/0xc0 kernel/entry/common.c:302 do_syscall_64+0x60/0xd0 arch/x86/entry/common.c:88 entry_SYSCALL_64_after_hwframe+0x44/0xae ??:? Local variable sw_bytes created at: save_xstate_epilog+0x80/0x510 arch/x86/kernel/fpu/signal.c:121 copy_fpstate_to_sigframe+0x861/0xb60 arch/x86/kernel/fpu/signal.c:245 Bytes 20-47 of 48 are uninitialized Memory access of size 48 starts at ffff8880801d3a18 Data copied to user address 00007ffd90e2ef50 ===================================================== Link: https://lore.kernel.org/all/CAG_fn=V9T6OKPonSjsi9PmWB0hMHFC=yawozdft8i1-MSxrv=w@mail.gmail.com/ Fixes: 53599b4d54b9b8dd ("x86/fpu/signal: Prepare for variable sigframe length") Reported-by: Alexander Potapenko <glider@google.com> Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Alexander Potapenko <glider@google.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Tested-by: Alexander Potapenko <glider@google.com> Link: https://lkml.kernel.org/r/20211126124746.761278-1-glider@google.com
2021-11-30x86/cpu: Drop spurious underscore from RAPTOR_LAKE #defineTony Luck
Convention for all the other "lake" CPUs is all one word. So s/RAPTOR_LAKE/RAPTORLAKE/ Fixes: fbdb5e8f2926 ("x86/cpu: Add Raptor Lake to Intel family") Reported-by: Rui Zhang <rui.zhang@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lkml.kernel.org/r/20211119170832.1034220-1-tony.luck@intel.com
2021-11-30Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm fixes from Paolo Bonzini: "ARM64: - Fix constant sign extension affecting TCR_EL2 and preventing running on ARMv8.7 models due to spurious bits being set - Fix use of helpers using PSTATE early on exit by always sampling it as soon as the exit takes place - Move pkvm's 32bit handling into a common helper RISC-V: - Fix incorrect KVM_MAX_VCPUS value - Unmap stage2 mapping when deleting/moving a memslot x86: - Fix and downgrade BUG_ON due to uninitialized cache - Many APICv and MOVE_ENC_CONTEXT_FROM fixes - Correctly emulate TLB flushes around nested vmentry/vmexit and when the nested hypervisor uses VPID - Prevent modifications to CPUID after the VM has run - Other smaller bugfixes Generic: - Memslot handling bugfixes" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (44 commits) KVM: fix avic_set_running for preemptable kernels KVM: VMX: clear vmx_x86_ops.sync_pir_to_irr if APICv is disabled KVM: SEV: accept signals in sev_lock_two_vms KVM: SEV: do not take kvm->lock when destroying KVM: SEV: Prohibit migration of a VM that has mirrors KVM: SEV: Do COPY_ENC_CONTEXT_FROM with both VMs locked selftests: sev_migrate_tests: add tests for KVM_CAP_VM_COPY_ENC_CONTEXT_FROM KVM: SEV: move mirror status to destination of KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM KVM: SEV: initialize regions_list of a mirror VM KVM: SEV: cleanup locking for KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM KVM: SEV: do not use list_replace_init on an empty list KVM: x86: Use a stable condition around all VT-d PI paths KVM: x86: check PIR even for vCPUs with disabled APICv KVM: VMX: prepare sync_pir_to_irr for running with APICv disabled KVM: selftests: page_table_test: fix calculation of guest_test_phys_mem KVM: x86/mmu: Handle "default" period when selectively waking kthread KVM: MMU: shadow nested paging does not have PKU KVM: x86/mmu: Remove spurious TLB flushes in TDP MMU zap collapsible path KVM: x86/mmu: Use yield-safe TDP MMU root iter in MMU notifier unmapping KVM: X86: Use vcpu->arch.walk_mmu for kvm_mmu_invlpg() ...
2021-11-30KVM: fix avic_set_running for preemptable kernelsPaolo Bonzini
avic_set_running() passes the current CPU to avic_vcpu_load(), albeit via vcpu->cpu rather than smp_processor_id(). If the thread is migrated while avic_set_running runs, the call to avic_vcpu_load() can use a stale value for the processor id. Avoid this by blocking preemption over the entire execution of avic_set_running(). Reported-by: Sean Christopherson <seanjc@google.com> Fixes: 8221c1370056 ("svm: Manage vcpu load/unload when enable AVIC") Cc: stable@vger.kernel.org Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-30KVM: VMX: clear vmx_x86_ops.sync_pir_to_irr if APICv is disabledPaolo Bonzini
There is nothing to synchronize if APICv is disabled, since neither other vCPUs nor assigned devices can set PIR.ON. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-30KVM: SEV: accept signals in sev_lock_two_vmsPaolo Bonzini
Generally, kvm->lock is not taken for a long time, but sev_lock_two_vms is different: it takes vCPU locks inside, so userspace can hold it back just by calling a vCPU ioctl. Play it safe and use mutex_lock_killable. Message-Id: <20211123005036.2954379-13-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-30KVM: SEV: do not take kvm->lock when destroyingPaolo Bonzini
Taking the lock is useless since there are no other references, and there are already accesses (e.g. to sev->enc_context_owner) that do not take it. So get rid of it. Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211123005036.2954379-12-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-30KVM: SEV: Prohibit migration of a VM that has mirrorsPaolo Bonzini
VMs that mirror an encryption context rely on the owner to keep the ASID allocated. Performing a KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM would cause a dangling ASID: 1. copy context from A to B (gets ref to A) 2. move context from A to L (moves ASID from A to L) 3. close L (releases ASID from L, B still references it) The right way to do the handoff instead is to create a fresh mirror VM on the destination first: 1. copy context from A to B (gets ref to A) [later] 2. close B (releases ref to A) 3. move context from A to L (moves ASID from A to L) 4. copy context from L to M So, catch the situation by adding a count of how many VMs are mirroring this one's encryption context. Fixes: 0b020f5af092 ("KVM: SEV: Add support for SEV-ES intra host migration") Message-Id: <20211123005036.2954379-11-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-30KVM: SEV: Do COPY_ENC_CONTEXT_FROM with both VMs lockedPaolo Bonzini
Now that we have a facility to lock two VMs with deadlock protection, use it for the creation of mirror VMs as well. One of COPY_ENC_CONTEXT_FROM(dst, src) and COPY_ENC_CONTEXT_FROM(src, dst) would always fail, so the combination is nonsensical and it is okay to return -EBUSY if it is attempted. This sidesteps the question of what happens if a VM is MOVE_ENC_CONTEXT_FROM'd at the same time as it is COPY_ENC_CONTEXT_FROM'd: the locking prevents that from happening. Cc: Peter Gonda <pgonda@google.com> Cc: Sean Christopherson <seanjc@google.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211123005036.2954379-10-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-30KVM: SEV: move mirror status to destination of KVM_CAP_VM_MOVE_ENC_CONTEXT_FROMPaolo Bonzini
Allow intra-host migration of a mirror VM; the destination VM will be a mirror of the same ASID as the source. Fixes: b56639318bb2 ("KVM: SEV: Add support for SEV intra host migration") Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211123005036.2954379-8-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-30KVM: SEV: initialize regions_list of a mirror VMPaolo Bonzini
This was broken before the introduction of KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM, but technically harmless because the region list was unused for a mirror VM. However, it is untidy and it now causes a NULL pointer access when attempting to move the encryption context of a mirror VM. Fixes: 54526d1fd593 ("KVM: x86: Support KVM VMs sharing SEV context") Message-Id: <20211123005036.2954379-7-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-30KVM: SEV: cleanup locking for KVM_CAP_VM_MOVE_ENC_CONTEXT_FROMPaolo Bonzini
Encapsulate the handling of the migration_in_progress flag for both VMs in two functions sev_lock_two_vms and sev_unlock_two_vms. It does not matter if KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM locks the destination struct kvm a bit later, and this change 1) keeps the cleanup chain of labels smaller 2) makes it possible for KVM_CAP_VM_COPY_ENC_CONTEXT_FROM to reuse the logic. Cc: Peter Gonda <pgonda@google.com> Cc: Sean Christopherson <seanjc@google.com> Message-Id: <20211123005036.2954379-6-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-30KVM: SEV: do not use list_replace_init on an empty listPaolo Bonzini
list_replace_init cannot be used if the source is an empty list, because "new->next->prev = new" will overwrite "old->next": new old prev = new, next = new prev = old, next = old new->next = old->next prev = new, next = old prev = old, next = old new->next->prev = new prev = new, next = old prev = old, next = new new->prev = old->prev prev = old, next = old prev = old, next = old new->next->prev = new prev = old, next = old prev = new, next = new The desired outcome instead would be to leave both old and new the same as they were (two empty circular lists). Use list_cut_before, which already has the necessary check and is documented to discard the previous contents of the list that will hold the result. Fixes: b56639318bb2 ("KVM: SEV: Add support for SEV intra host migration") Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211123005036.2954379-5-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-30KVM: x86: Use a stable condition around all VT-d PI pathsPaolo Bonzini
Currently, checks for whether VT-d PI can be used refer to the current status of the feature in the current vCPU; or they more or less pick vCPU 0 in case a specific vCPU is not available. However, these checks do not attempt to synchronize with changes to the IRTE. In particular, there is no path that updates the IRTE when APICv is re-activated on vCPU 0; and there is no path to wakeup a CPU that has APICv disabled, if the wakeup occurs because of an IRTE that points to a posted interrupt. To fix this, always go through the VT-d PI path as long as there are assigned devices and APICv is available on both the host and the VM side. Since the relevant condition was copied over three times, take the hint and factor it into a separate function. Suggested-by: Sean Christopherson <seanjc@google.com> Cc: stable@vger.kernel.org Reviewed-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: David Matlack <dmatlack@google.com> Message-Id: <20211123004311.2954158-5-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-30KVM: x86: check PIR even for vCPUs with disabled APICvPaolo Bonzini
The IRTE for an assigned device can trigger a POSTED_INTR_VECTOR even if APICv is disabled on the vCPU that receives it. In that case, the interrupt will just cause a vmexit and leave the ON bit set together with the PIR bit corresponding to the interrupt. Right now, the interrupt would not be delivered until APICv is re-enabled. However, fixing this is just a matter of always doing the PIR->IRR synchronization, even if the vCPU has temporarily disabled APICv. This is not a problem for performance, or if anything it is an improvement. First, in the common case where vcpu->arch.apicv_active is true, one fewer check has to be performed. Second, static_call_cond will elide the function call if APICv is not present or disabled. Finally, in the case for AMD hardware we can remove the sync_pir_to_irr callback: it is only needed for apic_has_interrupt_for_ppr, and that function already has a fallback for !APICv. Cc: stable@vger.kernel.org Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: David Matlack <dmatlack@google.com> Message-Id: <20211123004311.2954158-4-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-30KVM: VMX: prepare sync_pir_to_irr for running with APICv disabledPaolo Bonzini
If APICv is disabled for this vCPU, assigned devices may still attempt to post interrupts. In that case, we need to cancel the vmentry and deliver the interrupt with KVM_REQ_EVENT. Extend the existing code that handles injection of L1 interrupts into L2 to cover this case as well. vmx_hwapic_irr_update is only called when APICv is active so it would be confusing to add a check for vcpu->arch.apicv_active in there. Instead, just use vmx_set_rvi directly in vmx_sync_pir_to_irr. Cc: stable@vger.kernel.org Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: David Matlack <dmatlack@google.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211123004311.2954158-3-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-30KVM: x86/mmu: Handle "default" period when selectively waking kthreadSean Christopherson
Account for the '0' being a default, "let KVM choose" period, when determining whether or not the recovery worker needs to be awakened in response to userspace reducing the period. Failure to do so results in the worker not being awakened properly, e.g. when changing the period from '0' to any small-ish value. Fixes: 4dfe4f40d845 ("kvm: x86: mmu: Make NX huge page recovery period configurable") Cc: stable@vger.kernel.org Cc: Junaid Shahid <junaids@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211120015706.3830341-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-30KVM: MMU: shadow nested paging does not have PKUPaolo Bonzini
Initialize the mask for PKU permissions as if CR4.PKE=0, avoiding incorrect interpretations of the nested hypervisor's page tables. Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-30KVM: x86/mmu: Remove spurious TLB flushes in TDP MMU zap collapsible pathSean Christopherson
Drop the "flush" param and return values to/from the TDP MMU's helper for zapping collapsible SPTEs. Because the helper runs with mmu_lock held for read, not write, it uses tdp_mmu_zap_spte_atomic(), and the atomic zap handles the necessary remote TLB flush. Similarly, because mmu_lock is dropped and re-acquired between zapping legacy MMUs and zapping TDP MMUs, kvm_mmu_zap_collapsible_sptes() must handle remote TLB flushes from the legacy MMU before calling into the TDP MMU. Fixes: e2209710ccc5d ("KVM: x86/mmu: Skip rmap operations if rmaps not allocated") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211120045046.3940942-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-30KVM: x86/mmu: Use yield-safe TDP MMU root iter in MMU notifier unmappingSean Christopherson
Use the yield-safe variant of the TDP MMU iterator when handling an unmapping event from the MMU notifier, as most occurences of the event allow yielding. Fixes: e1eed5847b09 ("KVM: x86/mmu: Allow yielding during MMU notifier unmap/zap, if possible") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211120015008.3780032-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-28Merge tag 'x86-urgent-2021-11-28' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 build fix from Thomas Gleixner: "A single fix for a missing __init annotation of prepare_command_line()" * tag 'x86-urgent-2021-11-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/boot: Mark prepare_command_line() __init