summaryrefslogtreecommitdiff
path: root/arch/x86
AgeCommit message (Collapse)Author
2020-05-16x86/fpu/xstate: Restore supervisor states for signal returnYu-cheng Yu
The signal return fast path directly restores user states from the user buffer. Once that succeeds, restore supervisor states (but only when they are not yet restored). For the slow path, save supervisor states to preserve them across context switches, and restore after the user states are restored. The previous version has the overhead of an XSAVES in both the fast and the slow paths. It is addressed as the following: - In the fast path, only do an XRSTORS. - In the slow path, do a supervisor-state-only XSAVES, and relocate the buffer contents. Some thoughts in the implementation: - In the slow path, can any supervisor state become stale between save/restore? Answer: set_thread_flag(TIF_NEED_FPU_LOAD) protects the xstate buffer. - In the slow path, can any code reference a stale supervisor state register between save/restore? Answer: In the current lazy-restore scheme, any reference to xstate registers needs fpregs_lock()/fpregs_unlock() and __fpregs_load_activate(). - Are there other options? One other option is eagerly restoring all supervisor states. Currently, CET user-mode states and ENQCMD's PASID do not need to be eagerly restored. The upcoming CET kernel-mode states (24 bytes) need to be eagerly restored. To me, eagerly restoring all supervisor states adds more overhead then benefit at this point. Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lkml.kernel.org/r/20200512145444.15483-11-yu-cheng.yu@intel.com
2020-05-16x86/fpu/xstate: Preserve supervisor states for the slow path in ↵Yu-cheng Yu
__fpu__restore_sig() The signal return code is responsible for taking an XSAVE buffer present in user memory and loading it into the hardware registers. This operation only affects user XSAVE state and never affects supervisor state. The fast path through this code simply points XRSTOR directly at the user buffer. However, since user memory is not guaranteed to be always mapped, this XRSTOR can fail. If it fails, the signal return code falls back to a slow path which can tolerate page faults. That slow path copies the xfeatures one by one out of the user buffer into the task's fpu state area. However, by being in a context where it can handle page faults, the code can also schedule. The lazy-fpu-load code would think it has an up-to-date fpstate and would fail to save the supervisor state when scheduling the task out. When scheduling back in, it would likely restore stale supervisor state. To fix that, preserve supervisor state before the slow path. Modify copy_user_to_fpregs_zeroing() so that if it fails, fpregs are not zeroed, and there is no need for fpregs_deactivate() and supervisor states are preserved. Move set_thread_flag(TIF_NEED_FPU_LOAD) to the slow path. Without doing this, the fast path also needs supervisor states to be saved first. Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20200512145444.15483-10-yu-cheng.yu@intel.com
2020-05-16x86/fpu: Introduce copy_supervisor_to_kernel()Yu-cheng Yu
The XSAVES instruction takes a mask and saves only the features specified in that mask. The kernel normally specifies that all features be saved. XSAVES also unconditionally uses the "compacted format" which means that all specified features are saved next to each other in memory. If a feature is removed from the mask, all the features after it will "move up" into earlier locations in the buffer. Introduce copy_supervisor_to_kernel(), which saves only supervisor states and then moves those states into the standard location where they are normally found. Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20200512145444.15483-9-yu-cheng.yu@intel.com
2020-05-16x86/nmi: Remove edac.h include leftoverBorislav Petkov
... which db47d5f85646 ("x86/nmi, EDAC: Get rid of DRAM error reporting thru PCI SERR NMI") forgot to remove. No functional changes. Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20200515182246.3553-1-bp@alien8.de
2020-05-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller
Move the bpf verifier trace check into the new switch statement in HEAD. Resolve the overlapping changes in hinic, where bug fixes overlap the addition of VF support. Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netLinus Torvalds
Pull networking fixes from David Miller: 1) Fix sk_psock reference count leak on receive, from Xiyu Yang. 2) CONFIG_HNS should be invisible, from Geert Uytterhoeven. 3) Don't allow locking route MTUs in ipv6, RFCs actually forbid this, from Maciej Żenczykowski. 4) ipv4 route redirect backoff wasn't actually enforced, from Paolo Abeni. 5) Fix netprio cgroup v2 leak, from Zefan Li. 6) Fix infinite loop on rmmod in conntrack, from Florian Westphal. 7) Fix tcp SO_RCVLOWAT hangs, from Eric Dumazet. 8) Various bpf probe handling fixes, from Daniel Borkmann. * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (68 commits) selftests: mptcp: pm: rm the right tmp file dpaa2-eth: properly handle buffer size restrictions bpf: Restrict bpf_trace_printk()'s %s usage and add %pks, %pus specifier bpf: Add bpf_probe_read_{user, kernel}_str() to do_refine_retval_range bpf: Restrict bpf_probe_read{, str}() only to archs where they work MAINTAINERS: Mark networking drivers as Maintained. ipmr: Add lockdep expression to ipmr_for_each_table macro ipmr: Fix RCU list debugging warning drivers: net: hamradio: Fix suspicious RCU usage warning in bpqether.c net: phy: broadcom: fix BCM54XX_SHD_SCR3_TRDDAPD value for BCM54810 tcp: fix error recovery in tcp_zerocopy_receive() MAINTAINERS: Add Jakub to networking drivers. MAINTAINERS: another add of Karsten Graul for S390 networking drivers: ipa: fix typos for ipa_smp2p structure doc pppoe: only process PADT targeted at local interfaces selftests/bpf: Enforce returning 0 for fentry/fexit programs bpf: Enforce returning 0 for fentry/fexit progs net: stmmac: fix num_por initialization security: Fix the default value of secid_to_secctx hook libbpf: Fix register naming in PT_REGS s390 macros ...
2020-05-15x86/PCI: Mark Intel C620 MROMs as having non-compliant BARsXiaochun Lee
The Intel C620 Platform Controller Hub has MROM functions that have non-PCI registers (undocumented in the public spec) where BAR 0 is supposed to be, which results in messages like this: pci 0000:00:11.0: [Firmware Bug]: reg 0x30: invalid BAR (can't size) Mark these MROM functions as having non-compliant BARs so we don't try to probe any of them. There are no other BARs on these devices. See the Intel C620 Series Chipset Platform Controller Hub Datasheet, May 2019, Document Number 336067-007US, sec 2.1, 35.5, 35.6. [bhelgaas: commit log, add 0xa26d] Link: https://lore.kernel.org/r/1589513467-17070-1-git-send-email-lixiaochun.2888@163.com Signed-off-by: Xiaochun Lee <lixc17@lenovo.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Cc: stable@vger.kernel.org
2020-05-15KVM: x86: Fix off-by-one error in kvm_vcpu_ioctl_x86_setup_mceJim Mattson
Bank_num is a one-based count of banks, not a zero-based index. It overflows the allocated space only when strictly greater than KVM_MAX_MCE_BANKS. Fixes: a9e38c3e01ad ("KVM: x86: Catch potential overrun in MCE setup") Signed-off-by: Jue Wang <juew@google.com> Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Peter Shier <pshier@google.com> Message-Id: <20200511225616.19557-1-jmattson@google.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-15kvm: add halt-polling cpu usage statsDavid Matlack
Two new stats for exposing halt-polling cpu usage: halt_poll_success_ns halt_poll_fail_ns Thus sum of these 2 stats is the total cpu time spent polling. "success" means the VCPU polled until a virtual interrupt was delivered. "fail" means the VCPU had to schedule out (either because the maximum poll time was reached or it needed to yield the CPU). To avoid touching every arch's kvm_vcpu_stat struct, only update and export halt-polling cpu usage stats if we're on x86. Exporting cpu usage as a u64 and in nanoseconds means we will overflow at ~500 years, which seems reasonably large. Signed-off-by: David Matlack <dmatlack@google.com> Signed-off-by: Jon Cargille <jcargill@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Message-Id: <20200508182240.68440-1-jcargill@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-15KVM: nVMX: Migrate the VMX-preemption timerJim Mattson
The hrtimer used to emulate the VMX-preemption timer must be pinned to the same logical processor as the vCPU thread to be interrupted if we want to have any hope of adhering to the architectural specification of the VMX-preemption timer. Even with this change, the emulated VMX-preemption timer VM-exit occasionally arrives too late. Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Peter Shier <pshier@google.com> Reviewed-by: Oliver Upton <oupton@google.com> Message-Id: <20200508203643.85477-4-jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-15KVM: nVMX: Change emulated VMX-preemption timer hrtimer to absoluteJim Mattson
Prepare for migration of this hrtimer, by changing it from relative to absolute. (I couldn't get migration to work with a relative timer.) Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Peter Shier <pshier@google.com> Reviewed-by: Oliver Upton <oupton@google.com> Message-Id: <20200508203643.85477-3-jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-15KVM: nVMX: Really make emulated nested preemption timer pinnedJim Mattson
The PINNED bit is ignored by hrtimer_init. It is only considered when starting the timer. When the hrtimer isn't pinned to the same logical processor as the vCPU thread to be interrupted, the emulated VMX-preemption timer often fails to adhere to the architectural specification. Fixes: f15a75eedc18e ("KVM: nVMX: make emulated nested preemption timer pinned") Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Peter Shier <pshier@google.com> Reviewed-by: Oliver Upton <oupton@google.com> Message-Id: <20200508203643.85477-2-jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-15KVM: nVMX: Remove unused 'ops' param from nested_vmx_hardware_setup()Sean Christopherson
Remove a 'struct kvm_x86_ops' param that got left behind when the nested ops were moved to their own struct. Fixes: 33b22172452f0 ("KVM: x86: move nested-related kvm_x86_ops to a separate struct") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200506204653.14683-1-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-15KVM: SVM: Remove unnecessary V_IRQ unsettingSuravee Suthikulpanit
This has already been handled in the prior call to svm_clear_vintr(). Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Message-Id: <1588771076-73790-5-git-send-email-suravee.suthikulpanit@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-15KVM: SVM: Merge svm_enable_vintr into svm_set_vintrSuravee Suthikulpanit
Code clean up and remove unnecessary intercept check for INTERCEPT_VINTR. Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Message-Id: <1588771076-73790-4-git-send-email-suravee.suthikulpanit@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-15KVM: VMX: Handle preemption timer fastpathWanpeng Li
This patch implements a fastpath for the preemption timer vmexit. The vmexit can be handled quickly so it can be performed with interrupts off and going back directly to the guest. Testing on SKX Server. cyclictest in guest(w/o mwait exposed, adaptive advance lapic timer is default -1): 5540.5ns -> 4602ns 17% kvm-unit-test/vmexit.flat: w/o avanced timer: tscdeadline_immed: 3028.5 -> 2494.75 17.6% tscdeadline: 5765.7 -> 5285 8.3% w/ adaptive advance timer default -1: tscdeadline_immed: 3123.75 -> 2583 17.3% tscdeadline: 4663.75 -> 4537 2.7% Tested-by: Haiwei Li <lihaiwei@tencent.com> Cc: Haiwei Li <lihaiwei@tencent.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Message-Id: <1588055009-12677-8-git-send-email-wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-15KVM: X86: TSCDEADLINE MSR emulation fastpathWanpeng Li
This patch implements a fast path for emulation of writes to the TSCDEADLINE MSR. Besides shortcutting various housekeeping tasks in the vCPU loop, the fast path can also deliver the timer interrupt directly without going through KVM_REQ_PENDING_TIMER because it runs in vCPU context. Tested-by: Haiwei Li <lihaiwei@tencent.com> Cc: Haiwei Li <lihaiwei@tencent.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Message-Id: <1588055009-12677-7-git-send-email-wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-15KVM: x86: introduce kvm_can_use_hv_timerPaolo Bonzini
Replace the ad hoc test in vmx_set_hv_timer with a test in the caller, start_hv_timer. This test is not Intel-specific and would be duplicated when introducing the fast path for the TSC deadline MSR. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-15KVM: VMX: Optimize posted-interrupt delivery for timer fastpathWanpeng Li
While optimizing posted-interrupt delivery especially for the timer fastpath scenario, I measured kvm_x86_ops.deliver_posted_interrupt() to introduce substantial latency because the processor has to perform all vmentry tasks, ack the posted interrupt notification vector, read the posted-interrupt descriptor etc. This is not only slow, it is also unnecessary when delivering an interrupt to the current CPU (as is the case for the LAPIC timer) because PIR->IRR and IRR->RVI synchronization is already performed on vmentry Therefore skip kvm_vcpu_trigger_posted_interrupt in this case, and instead do vmx_sync_pir_to_irr() on the EXIT_FASTPATH_REENTER_GUEST fastpath as well. Tested-by: Haiwei Li <lihaiwei@tencent.com> Cc: Haiwei Li <lihaiwei@tencent.com> Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Message-Id: <1588055009-12677-6-git-send-email-wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-15KVM: X86: Introduce more exit_fastpath_completion enum valuesWanpeng Li
Adds a fastpath_t typedef since enum lines are a bit long, and replace EXIT_FASTPATH_SKIP_EMUL_INS with two new exit_fastpath_completion enum values. - EXIT_FASTPATH_EXIT_HANDLED kvm will still go through it's full run loop, but it would skip invoking the exit handler. - EXIT_FASTPATH_REENTER_GUEST complete fastpath, guest can be re-entered without invoking the exit handler or going back to vcpu_run Tested-by: Haiwei Li <lihaiwei@tencent.com> Cc: Haiwei Li <lihaiwei@tencent.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Message-Id: <1588055009-12677-4-git-send-email-wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-15KVM: X86: Introduce kvm_vcpu_exit_request() helperWanpeng Li
Introduce kvm_vcpu_exit_request() helper, we need to check some conditions before enter guest again immediately, we skip invoking the exit handler and go through full run loop if complete fastpath but there is stuff preventing we enter guest again immediately. Tested-by: Haiwei Li <lihaiwei@tencent.com> Cc: Haiwei Li <lihaiwei@tencent.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Message-Id: <1588055009-12677-5-git-send-email-wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-15KVM: x86: Print symbolic names of VMX VM-Exit flags in tracesSean Christopherson
Use __print_flags() to display the names of VMX flags in VM-Exit traces and strip the flags when printing the basic exit reason, e.g. so that a failed VM-Entry due to invalid guest state gets recorded as "INVALID_STATE FAILED_VMENTRY" instead of "0x80000021". Opportunstically fix misaligned variables in the kvm_exit and kvm_nested_vmexit_inject tracepoints. Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200508235348.19427-3-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-15KVM: VMX: Introduce generic fastpath handlerWanpeng Li
Introduce generic fastpath handler to handle MSR fastpath, VMX-preemption timer fastpath etc; move it after vmx_complete_interrupts() in order to catch events delivered to the guest, and abort the fast path in later patches. While at it, move the kvm_exit tracepoint so that it is printed for fastpath vmexits as well. There is no observed performance effect for the IPI fastpath after this patch. Tested-by: Haiwei Li <lihaiwei@tencent.com> Cc: Haiwei Li <lihaiwei@tencent.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Suggested-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <1588055009-12677-2-git-send-email-wanpengli@tencent.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-15KVM: nVMX: Drop superfluous VMREAD of vmcs02.GUEST_SYSENTER_*Sean Christopherson
Don't propagate GUEST_SYSENTER_* from vmcs02 to vmcs12 on nested VM-Exit as the vmcs12 fields are updated in vmx_set_msr(), and writes to the corresponding MSRs are always intercepted by KVM when running L2. Dropping the propagation was intended to be done in the same commit that added vmcs12 writes in vmx_set_msr()[1], but for reasons unknown was only shuffled around[2][3]. [1] https://patchwork.kernel.org/patch/10933215 [2] https://patchwork.kernel.org/patch/10933215/#22682289 [3] https://lore.kernel.org/patchwork/patch/1088643 Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200428231025.12766-3-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-15KVM: nVMX: Truncate writes to vmcs.SYSENTER_EIP/ESP for 32-bit vCPUSean Christopherson
Explicitly truncate the data written to vmcs.SYSENTER_EIP/ESP on WRMSR if the virtual CPU doesn't support 64-bit mode. The SYSENTER address fields in the VMCS are natural width, i.e. bits 63:32 are dropped if the CPU doesn't support Intel 64 architectures. This behavior is visible to the guest after a VM-Exit/VM-Exit roundtrip, e.g. if the guest sets bits 63:32 in the actual MSR. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200428231025.12766-2-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-15KVM: VMX: Improve handle_external_interrupt_irqoff inline assemblyUros Bizjak
Improve handle_external_interrupt_irqoff inline assembly in several ways: - remove unneeded %c operand modifiers and "$" prefixes - use %rsp instead of _ASM_SP, since we are in CONFIG_X86_64 part - use $-16 immediate to align %rsp - remove unneeded use of __ASM_SIZE macro - define "ss" named operand only for X86_64 The patch introduces no functional changes. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Message-Id: <20200504155706.2516956-1-ubizjak@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-15KVM: X86: Sanity check on gfn before removalPeter Xu
The index returned by kvm_async_pf_gfn_slot() will be removed when an async pf gfn is going to be removed. However kvm_async_pf_gfn_slot() is not reliable in that it can return the last key it loops over even if the gfn is not found in the async gfn array. It should never happen, but it's still better to sanity check against that to make sure no unexpected gfn will be removed. Signed-off-by: Peter Xu <peterx@redhat.com> Message-Id: <20200416155910.267514-1-peterx@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-15KVM: X86: Force ASYNC_PF_PER_VCPU to be power of twoPeter Xu
Forcing the ASYNC_PF_PER_VCPU to be power of two is much easier to be used rather than calling roundup_pow_of_two() from time to time. Do this by adding a BUILD_BUG_ON() inside the hash function. Another point is that generally async pf does not allow concurrency over ASYNC_PF_PER_VCPU after all (see kvm_setup_async_pf()), so it does not make much sense either to have it not a power of two or some of the entries will definitely be wasted. Signed-off-by: Peter Xu <peterx@redhat.com> Message-Id: <20200416155859.267366-1-peterx@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-15KVM: VMX: Remove unneeded __ASM_SIZE usage with POP instructionUros Bizjak
POP [mem] defaults to the word size, and the only legal non-default size is 16 bits, e.g. a 32-bit POP will #UD in 64-bit mode and vice versa, no need to use __ASM_SIZE macro to force operating mode. Changes since v1: - Fix commit message. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Message-Id: <20200427205035.1594232-1-ubizjak@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-15KVM: x86/mmu: Add a helper to consolidate root sp allocationSean Christopherson
Add a helper, mmu_alloc_root(), to consolidate the allocation of a root shadow page, which has the same basic mechanics for all flavors of TDP and shadow paging. Note, __pa(sp->spt) doesn't need to be protected by mmu_lock, sp->spt points at a kernel page. No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200428023714.31923-1-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-15KVM: x86/mmu: Drop KVM's hugepage enums in favor of the kernel's enumsSean Christopherson
Replace KVM's PT_PAGE_TABLE_LEVEL, PT_DIRECTORY_LEVEL and PT_PDPE_LEVEL with the kernel's PG_LEVEL_4K, PG_LEVEL_2M and PG_LEVEL_1G. KVM's enums are borderline impossible to remember and result in code that is visually difficult to audit, e.g. if (!enable_ept) ept_lpage_level = 0; else if (cpu_has_vmx_ept_1g_page()) ept_lpage_level = PT_PDPE_LEVEL; else if (cpu_has_vmx_ept_2m_page()) ept_lpage_level = PT_DIRECTORY_LEVEL; else ept_lpage_level = PT_PAGE_TABLE_LEVEL; versus if (!enable_ept) ept_lpage_level = 0; else if (cpu_has_vmx_ept_1g_page()) ept_lpage_level = PG_LEVEL_1G; else if (cpu_has_vmx_ept_2m_page()) ept_lpage_level = PG_LEVEL_2M; else ept_lpage_level = PG_LEVEL_4K; No functional change intended. Suggested-by: Barret Rhoden <brho@google.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200428005422.4235-4-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-15KVM: x86/mmu: Move max hugepage level to a separate #defineSean Christopherson
Rename PT_MAX_HUGEPAGE_LEVEL to KVM_MAX_HUGEPAGE_LEVEL and make it a separate define in anticipation of dropping KVM's PT_*_LEVEL enums in favor of the kernel's PG_LEVEL_* enums. No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200428005422.4235-3-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-15KVM: x86/mmu: Tweak PSE hugepage handling to avoid 2M vs 4M conundrumSean Christopherson
Change the PSE hugepage handling in walk_addr_generic() to fire on any page level greater than PT_PAGE_TABLE_LEVEL, a.k.a. PG_LEVEL_4K. PSE paging only has two levels, so "== 2" and "> 1" are functionally the same, i.e. this is a nop. A future patch will drop KVM's PT_*_LEVEL enums in favor of the kernel's PG_LEVEL_* enums, at which point "walker->level == PG_LEVEL_2M" is semantically incorrect (though still functionally ok). No functional change intended. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200428005422.4235-2-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-15kvm: x86: Cleanup vcpu->arch.guest_xstate_sizeXiaoyao Li
vcpu->arch.guest_xstate_size lost its only user since commit df1daba7d1cb ("KVM: x86: support XSAVES usage in the host"), so clean it up. Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Message-Id: <20200429154312.1411-1-xiaoyao.li@intel.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-15KVM: nVMX: Tweak handling of failure code for nested VM-Enter failureSean Christopherson
Use an enum for passing around the failure code for a failed VM-Enter that results in VM-Exit to provide a level of indirection from the final resting place of the failure code, vmcs.EXIT_QUALIFICATION. The exit qualification field is an unsigned long, e.g. passing around 'u32 exit_qual' throws up red flags as it suggests KVM may be dropping bits when reporting errors to L1. This is a red herring because the only defined failure codes are 0, 2, 3, and 4, i.e. don't come remotely close to overflowing a u32. Setting vmcs.EXIT_QUALIFICATION on entry failure is further complicated by the MSR load list, which returns the (1-based) entry that failed, and the number of MSRs to load is a 32-bit VMCS field. At first blush, it would appear that overflowing a u32 is possible, but the number of MSRs that can be loaded is hardcapped at 4096 (limited by MSR_IA32_VMX_MISC). In other words, there are two completely disparate types of data that eventually get stuffed into vmcs.EXIT_QUALIFICATION, neither of which is an 'unsigned long' in nature. This was presumably the reasoning for switching to 'u32' when the related code was refactored in commit ca0bde28f2ed6 ("kvm: nVMX: Split VMCS checks from nested_vmx_run()"). Using an enum for the failure code addresses the technically-possible- but-will-never-happen scenario where Intel defines a failure code that doesn't fit in a 32-bit integer. The enum variables and values will either be automatically sized (gcc 5.4 behavior) or be subjected to some combination of truncation. The former case will simply work, while the latter will trigger a compile-time warning unless the compiler is being particularly unhelpful. Separating the failure code from the failed MSR entry allows for disassociating both from vmcs.EXIT_QUALIFICATION, which avoids the conundrum where KVM has to choose between 'u32 exit_qual' and tracking values as 'unsigned long' that have no business being tracked as such. To cement the split, set vmcs12->exit_qualification directly from the entry error code or failed MSR index instead of bouncing through a local variable. Opportunistically rename the variables in load_vmcs12_host_state() and vmx_set_nested_state() to call out that they're ignored, set exit_reason on demand on nested VM-Enter failure, and add a comment in nested_vmx_load_msr() to call out that returning 'i + 1' can't wrap. No functional change intended. Reported-by: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Jim Mattson <jmattson@google.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200511220529.11402-1-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-15bpf: Restrict bpf_probe_read{, str}() only to archs where they workDaniel Borkmann
Given the legacy bpf_probe_read{,str}() BPF helpers are broken on archs with overlapping address ranges, we should really take the next step to disable them from BPF use there. To generally fix the situation, we've recently added new helper variants bpf_probe_read_{user,kernel}() and bpf_probe_read_{user,kernel}_str(). For details on them, see 6ae08ae3dea2 ("bpf: Add probe_read_{user, kernel} and probe_read_{user,kernel}_str helpers"). Given bpf_probe_read{,str}() have been around for ~5 years by now, there are plenty of users at least on x86 still relying on them today, so we cannot remove them entirely w/o breaking the BPF tracing ecosystem. However, their use should be restricted to archs with non-overlapping address ranges where they are working in their current form. Therefore, move this behind a CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE and have x86, arm64, arm select it (other archs supporting it can follow-up on it as well). For the remaining archs, they can workaround easily by relying on the feature probe from bpftool which spills out defines that can be used out of BPF C code to implement the drop-in replacement for old/new kernels via: bpftool feature probe macro Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Brendan Gregg <brendan.d.gregg@gmail.com> Cc: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/bpf/20200515101118.6508-2-daniel@iogearbox.net
2020-05-15x86: Fix early boot crash on gcc-10, third tryBorislav Petkov
... or the odyssey of trying to disable the stack protector for the function which generates the stack canary value. The whole story started with Sergei reporting a boot crash with a kernel built with gcc-10: Kernel panic — not syncing: stack-protector: Kernel stack is corrupted in: start_secondary CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.6.0-rc5—00235—gfffb08b37df9 #139 Hardware name: Gigabyte Technology Co., Ltd. To be filled by O.E.M./H77M—D3H, BIOS F12 11/14/2013 Call Trace: dump_stack panic ? start_secondary __stack_chk_fail start_secondary secondary_startup_64 -—-[ end Kernel panic — not syncing: stack—protector: Kernel stack is corrupted in: start_secondary This happens because gcc-10 tail-call optimizes the last function call in start_secondary() - cpu_startup_entry() - and thus emits a stack canary check which fails because the canary value changes after the boot_init_stack_canary() call. To fix that, the initial attempt was to mark the one function which generates the stack canary with: __attribute__((optimize("-fno-stack-protector"))) ... start_secondary(void *unused) however, using the optimize attribute doesn't work cumulatively as the attribute does not add to but rather replaces previously supplied optimization options - roughly all -fxxx options. The key one among them being -fno-omit-frame-pointer and thus leading to not present frame pointer - frame pointer which the kernel needs. The next attempt to prevent compilers from tail-call optimizing the last function call cpu_startup_entry(), shy of carving out start_secondary() into a separate compilation unit and building it with -fno-stack-protector, was to add an empty asm(""). This current solution was short and sweet, and reportedly, is supported by both compilers but we didn't get very far this time: future (LTO?) optimization passes could potentially eliminate this, which leads us to the third attempt: having an actual memory barrier there which the compiler cannot ignore or move around etc. That should hold for a long time, but hey we said that about the other two solutions too so... Reported-by: Sergei Trofimovich <slyfox@gentoo.org> Signed-off-by: Borislav Petkov <bp@suse.de> Tested-by: Kalle Valo <kvalo@codeaurora.org> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20200314164451.346497-1-slyfox@gentoo.org
2020-05-15x86/unwind/orc: Fix error handling in __unwind_start()Josh Poimboeuf
The unwind_state 'error' field is used to inform the reliable unwinding code that the stack trace can't be trusted. Set this field for all errors in __unwind_start(). Also, move the zeroing out of the unwind_state struct to before the ORC table initialization check, to prevent the caller from reading uninitialized data if the ORC table is corrupted. Fixes: af085d9084b4 ("stacktrace/x86: add function for detecting reliable stack traces") Fixes: d3a09104018c ("x86/unwinder/orc: Dont bail on stack overflow") Fixes: 98d0c8ebf77e ("x86/unwind/orc: Prevent unwinding before ORC initialization") Reported-by: Pavel Machek <pavel@denx.de> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/d6ac7215a84ca92b895fdd2e1aa546729417e6e6.1589487277.git.jpoimboe@redhat.com
2020-05-14Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller
Alexei Starovoitov says: ==================== pull-request: bpf-next 2020-05-14 The following pull-request contains BPF updates for your *net-next* tree. The main changes are: 1) Merged tag 'perf-for-bpf-2020-05-06' from tip tree that includes CAP_PERFMON. 2) support for narrow loads in bpf_sock_addr progs and additional helpers in cg-skb progs, from Andrey. 3) bpf benchmark runner, from Andrii. 4) arm and riscv JIT optimizations, from Luke. 5) bpf iterator infrastructure, from Yonghong. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-14Merge tag 'trace-v5.7-rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull more tracing fixes from Steven Rostedt: "Various tracing fixes: - Fix a crash when having function tracing and function stack tracing on the command line. The ftrace trampolines are created as executable and read only. But the stack tracer tries to modify them with text_poke() which expects all kernel text to still be writable at boot. Keep the trampolines writable at boot, and convert them to read-only with the rest of the kernel. - A selftest was triggering in the ring buffer iterator code, that is no longer valid with the update of keeping the ring buffer writable while a iterator is reading. Just bail after three failed attempts to get an event and remove the warning and disabling of the ring buffer. - While modifying the ring buffer code, decided to remove all the unnecessary BUG() calls" * tag 'trace-v5.7-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: ring-buffer: Remove all BUG() calls ring-buffer: Don't deactivate the ring buffer on failed iterator reads x86/ftrace: Have ftrace trampolines turn read-only at the end of system boot up
2020-05-14x86/fpu/xstate: Update copy_kernel_to_xregs_err() for supervisor statesYu-cheng Yu
The function copy_kernel_to_xregs_err() uses XRSTOR which can work with standard or compacted format without supervisor xstates. However, when supervisor xstates are present, XRSTORS must be used. Fix it by using XRSTORS when supervisor state handling is enabled. I also considered if there were additional cases where XRSTOR might be mistakenly called instead of XRSTORS. There are only three XRSTOR sites in the kernel: 1. copy_kernel_to_xregs_booting(), already switches between XRSTOR and XRSTORS based on X86_FEATURE_XSAVES. 2. copy_user_to_xregs(), which *needs* XRSTOR because it is copying from userspace and must never copy supervisor state with XRSTORS. 3. copy_kernel_to_xregs_err() mistakenly used XRSTOR only. Fix it. [ bp: Massage commit message. ] Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lkml.kernel.org/r/20200512145444.15483-8-yu-cheng.yu@intel.com
2020-05-14vfs: add faccessat2 syscallMiklos Szeredi
POSIX defines faccessat() as having a fourth "flags" argument, while the linux syscall doesn't have it. Glibc tries to emulate AT_EACCESS and AT_SYMLINK_NOFOLLOW, but AT_EACCESS emulation is broken. Add a new faccessat(2) syscall with the added flags argument and implement both flags. The value of AT_EACCESS is defined in glibc headers to be the same as AT_REMOVEDIR. Use this value for the kernel interface as well, together with the explanatory comment. Also add AT_EMPTY_PATH support, which is not documented by POSIX, but can be useful and is trivial to implement. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-05-14x86/boot: Mark global variables as staticArvind Sankar
Mike Lothian reports that after commit 964124a97b97 ("efi/x86: Remove extra headroom for setup block") gcc 10.1.0 fails with HOSTCC arch/x86/boot/tools/build /usr/lib/gcc/x86_64-pc-linux-gnu/10.1.0/../../../../x86_64-pc-linux-gnu/bin/ld: error: linker defined: multiple definition of '_end' /usr/lib/gcc/x86_64-pc-linux-gnu/10.1.0/../../../../x86_64-pc-linux-gnu/bin/ld: /tmp/ccEkW0jM.o: previous definition here collect2: error: ld returned 1 exit status make[1]: *** [scripts/Makefile.host:103: arch/x86/boot/tools/build] Error 1 make: *** [arch/x86/Makefile:303: bzImage] Error 2 The issue is with the _end variable that was added, to hold the end of the compressed kernel from zoffsets.h (ZO__end). The name clashes with the linker-defined _end symbol that indicates the end of the build program itself. Even when there is no compile-time error, this causes build to use memory past the end of its .bss section. To solve this, mark _end as static, and for symmetry, mark the rest of the variables that keep track of symbols from the compressed kernel as static as well. Fixes: 964124a97b97 ("efi/x86: Remove extra headroom for setup block") Reported-by: Mike Lothian <mike@fireburn.co.uk> Tested-by: Mike Lothian <mike@fireburn.co.uk> Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu> Link: https://lore.kernel.org/r/20200511225849.1311869-1-nivedita@alum.mit.edu Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
2020-05-13x86/fpu/xstate: Update sanitize_restored_xstate() for supervisor xstatesYu-cheng Yu
The function sanitize_restored_xstate() sanitizes user xstates of an XSAVE buffer by clearing bits not in the input 'xfeatures' from the buffer's header->xfeatures, effectively resetting those features back to the init state. When supervisor xstates are introduced, it is necessary to make sure only user xstates are sanitized. Ensure supervisor bits in header->xfeatures stay set and supervisor states are not modified. To make names clear, also: - Rename the function to sanitize_restored_user_xstate(). - Rename input parameter 'xfeatures' to 'user_xfeatures'. - In __fpu__restore_sig(), rename 'xfeatures' to 'user_xfeatures'. Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lkml.kernel.org/r/20200512145444.15483-7-yu-cheng.yu@intel.com
2020-05-13KVM: x86/mmu: Capture TDP level when updating CPUIDSean Christopherson
Snapshot the TDP level now that it's invariant (SVM) or dependent only on host capabilities and guest CPUID (VMX). This avoids having to call kvm_x86_ops.get_tdp_level() when initializing a TDP MMU and/or calculating the page role, and thus avoids the associated retpoline. Drop the WARN in vmx_get_tdp_level() as updating CPUID while L2 is active is legal, if dodgy. No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200502043234.12481-11-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-13KVM: VMX: Move nested EPT out of kvm_x86_ops.get_tdp_level() hookSean Christopherson
Separate the "core" TDP level handling from the nested EPT path to make it clear that kvm_x86_ops.get_tdp_level() is used if and only if nested EPT is not in use (kvm_init_shadow_ept_mmu() calculates the level from the passed in vmcs12->eptp). Add a WARN_ON() to enforce that the kvm_x86_ops hook is not called for nested EPT. This sets the stage for snapshotting the non-"nested EPT" TDP page level during kvm_cpuid_update() to avoid the retpoline associated with kvm_x86_ops.get_tdp_level() when resetting the MMU, a relatively frequent operation when running a nested guest. No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200502043234.12481-10-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-13KVM: VMX: Add proper cache tracking for CR0Sean Christopherson
Move CR0 caching into the standard register caching mechanism in order to take advantage of the availability checks provided by regs_avail. This avoids multiple VMREADs in the (uncommon) case where kvm_read_cr0() is called multiple times in a single VM-Exit, and more importantly eliminates a kvm_x86_ops hook, saves a retpoline on SVM when reading CR0, and squashes the confusing naming discrepancy of "cache_reg" vs. "decache_cr0_guest_bits". No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200502043234.12481-8-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-13KVM: VMX: Add proper cache tracking for CR4Sean Christopherson
Move CR4 caching into the standard register caching mechanism in order to take advantage of the availability checks provided by regs_avail. This avoids multiple VMREADs and retpolines (when configured) during nested VMX transitions as kvm_read_cr4_bits() is invoked multiple times on each transition, e.g. when stuffing CR0 and CR3. As an added bonus, this eliminates a kvm_x86_ops hook, saves a retpoline on SVM when reading CR4, and squashes the confusing naming discrepancy of "cache_reg" vs. "decache_cr4_guest_bits". No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200502043234.12481-7-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-13KVM: nVMX: Unconditionally validate CR3 during nested transitionsSean Christopherson
Unconditionally check the validity of the incoming CR3 during nested VM-Enter/VM-Exit to avoid invoking kvm_read_cr3() in the common case where the guest isn't using PAE paging. If vmcs.GUEST_CR3 hasn't yet been cached (common case), kvm_read_cr3() will trigger a VMREAD. The VMREAD (~30 cycles) alone is likely slower than nested_cr3_valid() (~5 cycles if vcpu->arch.maxphyaddr gets a cache hit), and the poor exchange only gets worse when retpolines are enabled as the call to kvm_x86_ops.cache_reg() will incur a retpoline (60+ cycles). Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200502043234.12481-3-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-13KVM: x86: Save L1 TSC offset in 'struct kvm_vcpu_arch'Sean Christopherson
Save L1's TSC offset in 'struct kvm_vcpu_arch' and drop the kvm_x86_ops hook read_l1_tsc_offset(). This avoids a retpoline (when configured) when reading L1's effective TSC, which is done at least once on every VM-Exit. No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200502043234.12481-2-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>