summaryrefslogtreecommitdiff
path: root/arch/x86
AgeCommit message (Collapse)Author
2020-07-08KVM: x86/mmu: Make .write_log_dirty a nested operationSean Christopherson
Move .write_log_dirty() into kvm_x86_nested_ops to help differentiate it from the non-nested dirty log hooks. And because it's a nested-only operation. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200622215832.22090-5-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-07-08KVM: nVMX: WARN if PML emulation helper is invoked outside of nested guestSean Christopherson
WARN if vmx_write_pml_buffer() is called outside of guest mode instead of silently ignoring the condition. The only caller is nested EPT's ept_update_accessed_dirty_bits(), which should only be reachable when L2 is active. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200622215832.22090-4-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-07-08KVM: x86/mmu: Drop kvm_arch_write_log_dirty() wrapperSean Christopherson
Drop kvm_arch_write_log_dirty() in favor of invoking .write_log_dirty() directly from FNAME(update_accessed_dirty_bits). "kvm_arch" is usually used for x86 functions that are invoked from generic KVM, and implies that there are external callers, neither of which is true. Remove the check for a non-NULL kvm_x86_ops hook as the call is wrapped in PTTYPE_EPT and is unconditionally set by VMX. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200622215832.22090-3-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-07-08KVM: async_pf: change kvm_setup_async_pf()/kvm_arch_setup_async_pf() return ↵Vitaly Kuznetsov
type to bool Unlike normal 'int' functions returning '0' on success, kvm_setup_async_pf()/ kvm_arch_setup_async_pf() return '1' when a job to handle page fault asynchronously was scheduled and '0' otherwise. To avoid the confusion change return type to 'bool'. No functional change intended. Suggested-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20200615121334.91300-1-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-07-08KVM: x86: drop KVM_PV_REASON_PAGE_READY case from kvm_handle_page_fault()Vitaly Kuznetsov
KVM guest code in Linux enables APF only when KVM_FEATURE_ASYNC_PF_INT is supported, this means we will never see KVM_PV_REASON_PAGE_READY when handling page fault vmexit in KVM. While on it, make sure we only follow genuine page fault path when APF reason is zero. If we happen to see something else this means that the underlying hypervisor is misbehaving. Leave WARN_ON_ONCE() to catch that. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-07-08Merge branch 'kvm-master' into HEADPaolo Bonzini
Merge 5.8-rc bugfixes.
2020-07-08Merge branch 'kvm-async-pf-int' into HEADPaolo Bonzini
2020-07-03KVM: VMX: Use KVM_POSSIBLE_CR*_GUEST_BITS to initialize guest/host masksSean Christopherson
Use the "common" KVM_POSSIBLE_CR*_GUEST_BITS defines to initialize the CR0/CR4 guest host masks instead of duplicating most of the CR4 mask and open coding the CR0 mask. SVM doesn't utilize the masks, i.e. the masks are effectively VMX specific even if they're not named as such. This avoids duplicate code, better documents the guest owned CR0 bit, and eliminates the need for a build-time assertion to keep VMX and x86 synchronized. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200703040422.31536-3-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-07-03KVM: x86: Mark CR4.TSD as being possibly owned by the guestSean Christopherson
Mark CR4.TSD as being possibly owned by the guest as that is indeed the case on VMX. Without TSD being tagged as possibly owned by the guest, a targeted read of CR4 to get TSD could observe a stale value. This bug is benign in the current code base as the sole consumer of TSD is the emulator (for RDTSC) and the emulator always "reads" the entirety of CR4 when grabbing bits. Add a build-time assertion in to ensure VMX doesn't hand over more CR4 bits without also updating x86. Fixes: 52ce3c21aec3 ("x86,kvm,vmx: Don't trap writes to CR4.TSD") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200703040422.31536-2-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-07-03KVM: x86: Inject #GP if guest attempts to toggle CR4.LA57 in 64-bit modeSean Christopherson
Inject a #GP on MOV CR4 if CR4.LA57 is toggled in 64-bit mode, which is illegal per Intel's SDM: CR4.LA57 57-bit linear addresses (bit 12 of CR4) ... blah blah blah ... This bit cannot be modified in IA-32e mode. Note, the pseudocode for MOV CR doesn't call out the fault condition, which is likely why the check was missed during initial development. This is arguably an SDM bug and will hopefully be fixed in future release of the SDM. Fixes: fd8cb433734ee ("KVM: MMU: Expose the LA57 feature to VM.") Cc: stable@vger.kernel.org Reported-by: Sebastien Boeuf <sebastien.boeuf@intel.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200703021714.5549-1-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-30KVM: x86: bit 8 of non-leaf PDPEs is not reservedPaolo Bonzini
Bit 8 would be the "global" bit, which does not quite make sense for non-leaf page table entries. Intel ignores it; AMD ignores it in PDEs and PDPEs, but reserves it in PML4Es. Probably, earlier versions of the AMD manual documented it as reserved in PDPEs as well, and that behavior made it into KVM as well as kvm-unit-tests; fix it. Cc: stable@vger.kernel.org Reported-by: Nadav Amit <namit@vmware.com> Fixes: a0c0feb57992 ("KVM: x86: reserve bit 8 of non-leaf PDPEs and PML4Es in 64-bit mode on AMD", 2014-09-03) Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-29KVM: X86: Fix async pf caused null-ptr-derefWanpeng Li
Syzbot reported that: CPU: 1 PID: 6780 Comm: syz-executor153 Not tainted 5.7.0-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:__apic_accept_irq+0x46/0xb80 Call Trace: kvm_arch_async_page_present+0x7de/0x9e0 kvm_check_async_pf_completion+0x18d/0x400 kvm_arch_vcpu_ioctl_run+0x18bf/0x69f0 kvm_vcpu_ioctl+0x46a/0xe20 ksys_ioctl+0x11a/0x180 __x64_sys_ioctl+0x6f/0xb0 do_syscall_64+0xf6/0x7d0 entry_SYSCALL_64_after_hwframe+0x49/0xb3 The testcase enables APF mechanism in MSR_KVM_ASYNC_PF_EN with ASYNC_PF_INT enabled w/o setting MSR_KVM_ASYNC_PF_INT before, what's worse, interrupt based APF 'page ready' event delivery depends on in kernel lapic, however, we didn't bail out when lapic is not in kernel during guest setting MSR_KVM_ASYNC_PF_EN which causes the null-ptr-deref in host later. This patch fixes it. Reported-by: syzbot+1bf777dfdde86d64b89b@syzkaller.appspotmail.com Fixes: 2635b5c4a0 (KVM: x86: interrupt based APF 'page ready' event delivery) Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Message-Id: <1593426391-8231-1-git-send-email-wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-23KVM: VMX: Remove vcpu_vmx's defunct copy of host_pkruSean Christopherson
Remove vcpu_vmx.host_pkru, which got left behind when PKRU support was moved to common x86 code. No functional change intended. Fixes: 37486135d3a7b ("KVM: x86: Fix pkru save/restore when guest CR4.PKE=0, move it to x86.c") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200617034123.25647-1-sean.j.christopherson@intel.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-23KVM: x86: allow TSC to differ by NTP correction bounds without TSC scalingMarcelo Tosatti
The Linux TSC calibration procedure is subject to small variations (its common to see +-1 kHz difference between reboots on a given CPU, for example). So migrating a guest between two hosts with identical processor can fail, in case of a small variation in calibrated TSC between them. Without TSC scaling, the current kernel interface will either return an error (if user_tsc_khz <= tsc_khz) or enable TSC catchup mode. This change enables the following TSC tolerance check to accept KVM_SET_TSC_KHZ within tsc_tolerance_ppm (which is 250ppm by default). /* * Compute the variation in TSC rate which is acceptable * within the range of tolerance and decide if the * rate being applied is within that bounds of the hardware * rate. If so, no scaling or compensation need be done. */ thresh_lo = adjust_tsc_khz(tsc_khz, -tsc_tolerance_ppm); thresh_hi = adjust_tsc_khz(tsc_khz, tsc_tolerance_ppm); if (user_tsc_khz < thresh_lo || user_tsc_khz > thresh_hi) { pr_debug("kvm: requested TSC rate %u falls outside tolerance [%u,%u]\n", user_tsc_khz, thresh_lo, thresh_hi); use_scaling = 1; } NTP daemon in the guest can correct this difference (NTP can correct upto 500ppm). Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Message-Id: <20200616114741.GA298183@fuller.cnet> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-23KVM: X86: Fix MSR range of APIC registers in X2APIC modeXiaoyao Li
Only MSR address range 0x800 through 0x8ff is architecturally reserved and dedicated for accessing APIC registers in x2APIC mode. Fixes: 0105d1a52640 ("KVM: x2apic interface to lapic") Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Message-Id: <20200616073307.16440-1-xiaoyao.li@intel.com> Cc: stable@vger.kernel.org Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-22KVM: VMX: Stop context switching MSR_IA32_UMWAIT_CONTROLSean Christopherson
Remove support for context switching between the guest's and host's desired UMWAIT_CONTROL. Propagating the guest's value to hardware isn't required for correct functionality, e.g. KVM intercepts reads and writes to the MSR, and the latency effects of the settings controlled by the MSR are not architecturally visible. As a general rule, KVM should not allow the guest to control power management settings unless explicitly enabled by userspace, e.g. see KVM_CAP_X86_DISABLE_EXITS. E.g. Intel's SDM explicitly states that C0.2 can improve the performance of SMT siblings. A devious guest could disable C0.2 so as to improve the performance of their workloads at the detriment to workloads running in the host or on other VMs. Wholesale removal of UMWAIT_CONTROL context switching also fixes a race condition where updates from the host may cause KVM to enter the guest with the incorrect value. Because updates are are propagated to all CPUs via IPI (SMP function callback), the value in hardware may be stale with respect to the cached value and KVM could enter the guest with the wrong value in hardware. As above, the guest can't observe the bad value, but it's a weird and confusing wart in the implementation. Removal also fixes the unnecessary usage of VMX's atomic load/store MSR lists. Using the lists is only necessary for MSRs that are required for correct functionality immediately upon VM-Enter/VM-Exit, e.g. EFER on old hardware, or for MSRs that need to-the-uop precision, e.g. perf related MSRs. For UMWAIT_CONTROL, the effects are only visible in the kernel via TPAUSE/delay(), and KVM doesn't do any form of delay in vcpu_vmx_run(). Using the atomic lists is undesirable as they are more expensive than direct RDMSR/WRMSR. Furthermore, even if giving the guest control of the MSR is legitimate, e.g. in pass-through scenarios, it's not clear that the benefits would outweigh the overhead. E.g. saving and restoring an MSR across a VMX roundtrip costs ~250 cycles, and if the guest diverged from the host that cost would be paid on every run of the guest. In other words, if there is a legitimate use case then it should be enabled by a new per-VM capability. Note, KVM still needs to emulate MSR_IA32_UMWAIT_CONTROL so that it can correctly expose other WAITPKG features to the guest, e.g. TPAUSE, UMWAIT and UMONITOR. Fixes: 6e3ba4abcea56 ("KVM: vmx: Emulate MSR IA32_UMWAIT_CONTROL") Cc: stable@vger.kernel.org Cc: Jingqi Liu <jingqi.liu@intel.com> Cc: Tao Xu <tao3.xu@intel.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200623005135.10414-1-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-22KVM: nVMX: Plumb L2 GPA through to PML emulationSean Christopherson
Explicitly pass the L2 GPA to kvm_arch_write_log_dirty(), which for all intents and purposes is vmx_write_pml_buffer(), instead of having the latter pull the GPA from vmcs.GUEST_PHYSICAL_ADDRESS. If the dirty bit update is the result of KVM emulation (rare for L2), then the GPA in the VMCS may be stale and/or hold a completely unrelated GPA. Fixes: c5f983f6e8455 ("nVMX: Implement emulated Page Modification Logging") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200622215832.22090-2-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-22KVM: x86/mmu: Avoid mixing gpa_t with gfn_t in walk_addr_generic()Vitaly Kuznetsov
translate_gpa() returns a GPA, assigning it to 'real_gfn' seems obviously wrong. There is no real issue because both 'gpa_t' and 'gfn_t' are u64 and we don't use the value in 'real_gfn' as a GFN, we do real_gfn = gpa_to_gfn(real_gfn); instead. 'If you see a "buffalo" sign on an elephant's cage, do not trust your eyes', but let's fix it for good. No functional change intended. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20200622151435.752560-1-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-22KVM: LAPIC: ensure APIC map is up to date on concurrent update requestsPaolo Bonzini
The following race can cause lost map update events: cpu1 cpu2 apic_map_dirty = true ------------------------------------------------------------ kvm_recalculate_apic_map: pass check mutex_lock(&kvm->arch.apic_map_lock); if (!kvm->arch.apic_map_dirty) and in process of updating map ------------------------------------------------------------- other calls to apic_map_dirty = true might be too late for affected cpu ------------------------------------------------------------- apic_map_dirty = false ------------------------------------------------------------- kvm_recalculate_apic_map: bail out on if (!kvm->arch.apic_map_dirty) To fix it, record the beginning of an update of the APIC map in apic_map_dirty. If another APIC map change switches apic_map_dirty back to DIRTY during the update, kvm_recalculate_apic_map should not make it CLEAN, and the other caller will go through the slow path. Reported-by: Igor Mammedov <imammedo@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-22kvm: lapic: fix broken vcpu hotplugIgor Mammedov
Guest fails to online hotplugged CPU with error smpboot: do_boot_cpu failed(-1) to wakeup CPU#4 It's caused by the fact that kvm_apic_set_state(), which used to call recalculate_apic_map() unconditionally and pulled hotplugged CPU into apic map, is updating map conditionally on state changes. In this case the APIC map is not considered dirty and the is not updated. Fix the issue by forcing unconditional update from kvm_apic_set_state(), like it used to be. Fixes: 4abaffce4d25a ("KVM: LAPIC: Recalculate apic map in batch") Cc: stable@vger.kernel.org Signed-off-by: Igor Mammedov <imammedo@redhat.com> Message-Id: <20200622160830.426022-1-imammedo@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-19Revert "KVM: VMX: Micro-optimize vmexit time when not exposing PMU"Vitaly Kuznetsov
Guest crashes are observed on a Cascade Lake system when 'perf top' is launched on the host, e.g. BUG: unable to handle kernel paging request at fffffe0000073038 PGD 7ffa7067 P4D 7ffa7067 PUD 7ffa6067 PMD 7ffa5067 PTE ffffffffff120 Oops: 0000 [#1] SMP PTI CPU: 1 PID: 1 Comm: systemd Not tainted 4.18.0+ #380 ... Call Trace: serial8250_console_write+0xfe/0x1f0 call_console_drivers.constprop.0+0x9d/0x120 console_unlock+0x1ea/0x460 Call traces are different but the crash is imminent. The problem was blindly bisected to the commit 041bc42ce2d0 ("KVM: VMX: Micro-optimize vmexit time when not exposing PMU"). It was also confirmed that the issue goes away if PMU is exposed to the guest. With some instrumentation of the guest we can see what is being switched (when we do atomic_switch_perf_msrs()): vmx_vcpu_run: switching 2 msrs vmx_vcpu_run: switching MSR38f guest: 70000000d host: 70000000f vmx_vcpu_run: switching MSR3f1 guest: 0 host: 2 The current guess is that PEBS (MSR_IA32_PEBS_ENABLE, 0x3f1) is to blame. Regardless of whether PMU is exposed to the guest or not, PEBS needs to be disabled upon switch. This reverts commit 041bc42ce2d0efac3b85bbb81dea8c74b81f4ef9. Reported-by: Maxime Coquelin <maxime.coquelin@redhat.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20200619094046.654019-1-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-15KVM: VMX: Add helpers to identify interrupt type from intr_infoSean Christopherson
Add is_intr_type() and is_intr_type_n() to consolidate the boilerplate code for querying a specific type of interrupt given an encoded value from VMCS.VM_{ENTER,EXIT}_INTR_INFO, with and without an associated vector respectively. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200609014518.26756-1-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-15kvm/svm: disable KCSAN for svm_vcpu_run()Qian Cai
For some reasons, running a simple qemu-kvm command with KCSAN will reset AMD hosts. It turns out svm_vcpu_run() could not be instrumented. Disable it for now. # /usr/libexec/qemu-kvm -name ubuntu-18.04-server-cloudimg -cpu host -smp 2 -m 2G -hda ubuntu-18.04-server-cloudimg.qcow2 === console output === Kernel 5.6.0-next-20200408+ on an x86_64 hp-dl385g10-05 login: <...host reset...> HPE ProLiant System BIOS A40 v1.20 (03/09/2018) (C) Copyright 1982-2018 Hewlett Packard Enterprise Development LP Early system initialization, please wait... Signed-off-by: Qian Cai <cai@lca.pw> Message-Id: <20200415153709.1559-1-cai@lca.pw> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-15KVM: x86: Switch KVM guest to using interrupts for page ready APF deliveryVitaly Kuznetsov
KVM now supports using interrupt for 'page ready' APF event delivery and legacy mechanism was deprecated. Switch KVM guests to the new one. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20200525144125.143875-9-vkuznets@redhat.com> [Use HYPERVISOR_CALLBACK_VECTOR instead of a separate vector. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-13Merge tag 'kbuild-v5.8-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull more Kbuild updates from Masahiro Yamada: - fix build rules in binderfs sample - fix build errors when Kbuild recurses to the top Makefile - covert '---help---' in Kconfig to 'help' * tag 'kbuild-v5.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: treewide: replace '---help---' in Kconfig files with 'help' kbuild: fix broken builds because of GZIP,BZIP2,LZOP variables samples: binderfs: really compile this sample and fix build issues
2020-06-13Merge tag 'ras-core-2020-06-12' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 RAS updates from Thomas Gleixner: "RAS updates from Borislav Petkov: - Unmap a whole guest page if an MCE is encountered in it to avoid follow-on MCEs leading to the guest crashing, by Tony Luck. This change collided with the entry changes and the merge resolution would have been rather unpleasant. To avoid that the entry branch was merged in before applying this. The resulting code did not change over the rebase. - AMD MCE error thresholding machinery cleanup and hotplug sanitization, by Thomas Gleixner. - Change the MCE notifiers to denote whether they have handled the error and not break the chain early by returning NOTIFY_STOP, thus giving the opportunity for the later handlers in the chain to see it. By Tony Luck. - Add AMD family 0x17, models 0x60-6f support, by Alexander Monakov. - Last but not least, the usual round of fixes and improvements" * tag 'ras-core-2020-06-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits) x86/mce/dev-mcelog: Fix -Wstringop-truncation warning about strncpy() x86/{mce,mm}: Unmap the entire page if the whole page is affected and poisoned EDAC/amd64: Add AMD family 17h model 60h PCI IDs hwmon: (k10temp) Add AMD family 17h model 60h PCI match x86/amd_nb: Add AMD family 17h model 60h PCI IDs x86/mcelog: Add compat_ioctl for 32-bit mcelog support x86/mce: Drop bogus comment about mce.kflags x86/mce: Fixup exception only for the correct MCEs EDAC: Drop the EDAC report status checks x86/mce: Add mce=print_all option x86/mce: Change default MCE logger to check mce->kflags x86/mce: Fix all mce notifiers to update the mce->kflags bitmask x86/mce: Add a struct mce.kflags field x86/mce: Convert the CEC to use the MCE notifier x86/mce: Rename "first" function as "early" x86/mce/amd, edac: Remove report_gart_errors x86/mce/amd: Make threshold bank setting hotplug robust x86/mce/amd: Cleanup threshold device remove path x86/mce/amd: Straighten CPU hotplug path x86/mce/amd: Sanitize thresholding device creation hotplug path ...
2020-06-13Merge tag 'x86-entry-2020-06-12' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 entry updates from Thomas Gleixner: "The x86 entry, exception and interrupt code rework This all started about 6 month ago with the attempt to move the Posix CPU timer heavy lifting out of the timer interrupt code and just have lockless quick checks in that code path. Trivial 5 patches. This unearthed an inconsistency in the KVM handling of task work and the review requested to move all of this into generic code so other architectures can share. Valid request and solved with another 25 patches but those unearthed inconsistencies vs. RCU and instrumentation. Digging into this made it obvious that there are quite some inconsistencies vs. instrumentation in general. The int3 text poke handling in particular was completely unprotected and with the batched update of trace events even more likely to expose to endless int3 recursion. In parallel the RCU implications of instrumenting fragile entry code came up in several discussions. The conclusion of the x86 maintainer team was to go all the way and make the protection against any form of instrumentation of fragile and dangerous code pathes enforcable and verifiable by tooling. A first batch of preparatory work hit mainline with commit d5f744f9a2ac ("Pull x86 entry code updates from Thomas Gleixner") That (almost) full solution introduced a new code section '.noinstr.text' into which all code which needs to be protected from instrumentation of all sorts goes into. Any call into instrumentable code out of this section has to be annotated. objtool has support to validate this. Kprobes now excludes this section fully which also prevents BPF from fiddling with it and all 'noinstr' annotated functions also keep ftrace off. The section, kprobes and objtool changes are already merged. The major changes coming with this are: - Preparatory cleanups - Annotating of relevant functions to move them into the noinstr.text section or enforcing inlining by marking them __always_inline so the compiler cannot misplace or instrument them. - Splitting and simplifying the idtentry macro maze so that it is now clearly separated into simple exception entries and the more interesting ones which use interrupt stacks and have the paranoid handling vs. CR3 and GS. - Move quite some of the low level ASM functionality into C code: - enter_from and exit to user space handling. The ASM code now calls into C after doing the really necessary ASM handling and the return path goes back out without bells and whistels in ASM. - exception entry/exit got the equivivalent treatment - move all IRQ tracepoints from ASM to C so they can be placed as appropriate which is especially important for the int3 recursion issue. - Consolidate the declaration and definition of entry points between 32 and 64 bit. They share a common header and macros now. - Remove the extra device interrupt entry maze and just use the regular exception entry code. - All ASM entry points except NMI are now generated from the shared header file and the corresponding macros in the 32 and 64 bit entry ASM. - The C code entry points are consolidated as well with the help of DEFINE_IDTENTRY*() macros. This allows to ensure at one central point that all corresponding entry points share the same semantics. The actual function body for most entry points is in an instrumentable and sane state. There are special macros for the more sensitive entry points, e.g. INT3 and of course the nasty paranoid #NMI, #MCE, #DB and #DF. They allow to put the whole entry instrumentation and RCU handling into safe places instead of the previous pray that it is correct approach. - The INT3 text poke handling is now completely isolated and the recursion issue banned. Aside of the entry rework this required other isolation work, e.g. the ability to force inline bsearch. - Prevent #DB on fragile entry code, entry relevant memory and disable it on NMI, #MC entry, which allowed to get rid of the nested #DB IST stack shifting hackery. - A few other cleanups and enhancements which have been made possible through this and already merged changes, e.g. consolidating and further restricting the IDT code so the IDT table becomes RO after init which removes yet another popular attack vector - About 680 lines of ASM maze are gone. There are a few open issues: - An escape out of the noinstr section in the MCE handler which needs some more thought but under the aspect that MCE is a complete trainwreck by design and the propability to survive it is low, this was not high on the priority list. - Paravirtualization When PV is enabled then objtool complains about a bunch of indirect calls out of the noinstr section. There are a few straight forward ways to fix this, but the other issues vs. general correctness were more pressing than parawitz. - KVM KVM is inconsistent as well. Patches have been posted, but they have not yet been commented on or picked up by the KVM folks. - IDLE Pretty much the same problems can be found in the low level idle code especially the parts where RCU stopped watching. This was beyond the scope of the more obvious and exposable problems and is on the todo list. The lesson learned from this brain melting exercise to morph the evolved code base into something which can be validated and understood is that once again the violation of the most important engineering principle "correctness first" has caused quite a few people to spend valuable time on problems which could have been avoided in the first place. The "features first" tinkering mindset really has to stop. With that I want to say thanks to everyone involved in contributing to this effort. Special thanks go to the following people (alphabetical order): Alexandre Chartre, Andy Lutomirski, Borislav Petkov, Brian Gerst, Frederic Weisbecker, Josh Poimboeuf, Juergen Gross, Lai Jiangshan, Macro Elver, Paolo Bonzin,i Paul McKenney, Peter Zijlstra, Vitaly Kuznetsov, and Will Deacon" * tag 'x86-entry-2020-06-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (142 commits) x86/entry: Force rcu_irq_enter() when in idle task x86/entry: Make NMI use IDTENTRY_RAW x86/entry: Treat BUG/WARN as NMI-like entries x86/entry: Unbreak __irqentry_text_start/end magic x86/entry: __always_inline CR2 for noinstr lockdep: __always_inline more for noinstr x86/entry: Re-order #DB handler to avoid *SAN instrumentation x86/entry: __always_inline arch_atomic_* for noinstr x86/entry: __always_inline irqflags for noinstr x86/entry: __always_inline debugreg for noinstr x86/idt: Consolidate idt functionality x86/idt: Cleanup trap_init() x86/idt: Use proper constants for table size x86/idt: Add comments about early #PF handling x86/idt: Mark init only functions __init x86/entry: Rename trace_hardirqs_off_prepare() x86/entry: Clarify irq_{enter,exit}_rcu() x86/entry: Remove DBn stacks x86/entry: Remove debug IDT frobbing x86/entry: Optimize local_db_save() for virt ...
2020-06-14treewide: replace '---help---' in Kconfig files with 'help'Masahiro Yamada
Since commit 84af7a6194e4 ("checkpatch: kconfig: prefer 'help' over '---help---'"), the number of '---help---' has been gradually decreasing, but there are still more than 2400 instances. This commit finishes the conversion. While I touched the lines, I also fixed the indentation. There are a variety of indentation styles found. a) 4 spaces + '---help---' b) 7 spaces + '---help---' c) 8 spaces + '---help---' d) 1 space + 1 tab + '---help---' e) 1 tab + '---help---' (correct indentation) f) 1 tab + 1 space + '---help---' g) 1 tab + 2 spaces + '---help---' In order to convert all of them to 1 tab + 'help', I ran the following commend: $ find . -name 'Kconfig*' | xargs sed -i 's/^[[:space:]]*---help---/\thelp/' Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2020-06-12x86/entry: Force rcu_irq_enter() when in idle taskThomas Gleixner
The idea of conditionally calling into rcu_irq_enter() only when RCU is not watching turned out to be not completely thought through. Paul noticed occasional premature end of grace periods in RCU torture testing. Bisection led to the commit which made the invocation of rcu_irq_enter() conditional on !rcu_is_watching(). It turned out that this conditional breaks RCU assumptions about the idle task when the scheduler tick happens to be a nested interrupt. Nested interrupts can happen when the first interrupt invokes softirq processing on return which enables interrupts. If that nested tick interrupt does not invoke rcu_irq_enter() then the RCU's irq-nesting checks will believe that this interrupt came directly from idle, which will cause RCU to report a quiescent state. Because this interrupt instead came from a softirq handler which might have been executing an RCU read-side critical section, this can cause the grace period to end prematurely. Change the condition from !rcu_is_watching() to is_idle_task(current) which enforces that interrupts in the idle task unconditionally invoke rcu_irq_enter() independent of the RCU state. This is also correct vs. user mode entries in NOHZ full scenarios because user mode entries bring RCU out of EQS and force the RCU irq nesting state accounting to nested. As only the first interrupt can enter from user mode a nested tick interrupt will enter from kernel mode and as the nesting state accounting is forced to nesting it will not do anything stupid even if rcu_irq_enter() has not been invoked. Fixes: 3eeec3858488 ("x86/entry: Provide idtentry_entry/exit_cond_rcu()") Reported-by: "Paul E. McKenney" <paulmck@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: "Paul E. McKenney" <paulmck@kernel.org> Reviewed-by: "Paul E. McKenney" <paulmck@kernel.org> Acked-by: Andy Lutomirski <luto@kernel.org> Acked-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lkml.kernel.org/r/87wo4cxubv.fsf@nanos.tec.linutronix.de
2020-06-12Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull more KVM updates from Paolo Bonzini: "The guest side of the asynchronous page fault work has been delayed to 5.9 in order to sync with Thomas's interrupt entry rework, but here's the rest of the KVM updates for this merge window. MIPS: - Loongson port PPC: - Fixes ARM: - Fixes x86: - KVM_SET_USER_MEMORY_REGION optimizations - Fixes - Selftest fixes" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (62 commits) KVM: x86: do not pass poisoned hva to __kvm_set_memory_region KVM: selftests: fix sync_with_host() in smm_test KVM: async_pf: Inject 'page ready' event only if 'page not present' was previously injected KVM: async_pf: Cleanup kvm_setup_async_pf() kvm: i8254: remove redundant assignment to pointer s KVM: x86: respect singlestep when emulating instruction KVM: selftests: Don't probe KVM_CAP_HYPERV_ENLIGHTENED_VMCS when nested VMX is unsupported KVM: selftests: do not substitute SVM/VMX check with KVM_CAP_NESTED_STATE check KVM: nVMX: Consult only the "basic" exit reason when routing nested exit KVM: arm64: Move hyp_symbol_addr() to kvm_asm.h KVM: arm64: Synchronize sysreg state on injecting an AArch32 exception KVM: arm64: Make vcpu_cp1x() work on Big Endian hosts KVM: arm64: Remove host_cpu_context member from vcpu structure KVM: arm64: Stop sparse from moaning at __hyp_this_cpu_ptr KVM: arm64: Handle PtrAuth traps early KVM: x86: Unexport x86_fpu_cache and make it static KVM: selftests: Ignore KVM 5-level paging support for VM_MODE_PXXV48_4K KVM: arm64: Save the host's PtrAuth keys in non-preemptible context KVM: arm64: Stop save/restoring ACTLR_EL1 KVM: arm64: Add emulation for 32bit guests accessing ACTLR2 ...
2020-06-12x86/entry: Make NMI use IDTENTRY_RAWThomas Gleixner
For no reason other than beginning brainmelt, IDTENTRY_NMI was mapped to IDTENTRY_IST. This is not a problem on 64bit because the IST default entry point maps to IDTENTRY_RAW which does not any entry handling. The surplus function declaration for the noist C entry point is unused and as there is no ASM code emitted for NMI this went unnoticed. On 32bit IDTENTRY_IST maps to a regular IDTENTRY which does the normal entry handling. That is clearly the wrong thing to do for NMI. Map it to IDTENTRY_RAW to unbreak it. The IDTENTRY_NMI mapping needs to stay to avoid emitting ASM code. Fixes: 6271fef00b34 ("x86/entry: Convert NMI to IDTENTRY_NMI") Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org> Debugged-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/CA+G9fYvF3cyrY+-iw_SZtpN-i2qA2BruHg4M=QYECU2-dNdsMw@mail.gmail.com
2020-06-12x86/entry: Treat BUG/WARN as NMI-like entriesAndy Lutomirski
BUG/WARN are cleverly optimized using UD2 to handle the BUG/WARN out of line in an exception fixup. But if BUG or WARN is issued in a funny RCU context, then the idtentry_enter...() path might helpfully WARN that the RCU context is invalid, which results in infinite recursion. Split the BUG/WARN handling into an nmi_enter()/nmi_exit() path in exc_invalid_op() to increase the chance to survive the experience. [ tglx: Make the declaration match the implementation ] Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/f8fe40e0088749734b4435b554f73eee53dcf7a8.1591932307.git.luto@kernel.org
2020-06-11Merge tag 'locking-kcsan-2020-06-11' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull the Kernel Concurrency Sanitizer from Thomas Gleixner: "The Kernel Concurrency Sanitizer (KCSAN) is a dynamic race detector, which relies on compile-time instrumentation, and uses a watchpoint-based sampling approach to detect races. The feature was under development for quite some time and has already found legitimate bugs. Unfortunately it comes with a limitation, which was only understood late in the development cycle: It requires an up to date CLANG-11 compiler CLANG-11 is not yet released (scheduled for June), but it's the only compiler today which handles the kernel requirements and especially the annotations of functions to exclude them from KCSAN instrumentation correctly. These annotations really need to work so that low level entry code and especially int3 text poke handling can be completely isolated. A detailed discussion of the requirements and compiler issues can be found here: https://lore.kernel.org/lkml/CANpmjNMTsY_8241bS7=XAfqvZHFLrVEkv_uM4aDUWE_kh3Rvbw@mail.gmail.com/ We came to the conclusion that trying to work around compiler limitations and bugs again would end up in a major trainwreck, so requiring a working compiler seemed to be the best choice. For Continous Integration purposes the compiler restriction is manageable and that's where most xxSAN reports come from. For a change this limitation might make GCC people actually look at their bugs. Some issues with CSAN in GCC are 7 years old and one has been 'fixed' 3 years ago with a half baken solution which 'solved' the reported issue but not the underlying problem. The KCSAN developers also ponder to use a GCC plugin to become independent, but that's not something which will show up in a few days. Blocking KCSAN until wide spread compiler support is available is not a really good alternative because the continuous growth of lockless optimizations in the kernel demands proper tooling support" * tag 'locking-kcsan-2020-06-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (76 commits) compiler_types.h, kasan: Use __SANITIZE_ADDRESS__ instead of CONFIG_KASAN to decide inlining compiler.h: Move function attributes to compiler_types.h compiler.h: Avoid nested statement expression in data_race() compiler.h: Remove data_race() and unnecessary checks from {READ,WRITE}_ONCE() kcsan: Update Documentation to change supported compilers kcsan: Remove 'noinline' from __no_kcsan_or_inline kcsan: Pass option tsan-instrument-read-before-write to Clang kcsan: Support distinguishing volatile accesses kcsan: Restrict supported compilers kcsan: Avoid inserting __tsan_func_entry/exit if possible ubsan, kcsan: Don't combine sanitizer with kcov on clang objtool, kcsan: Add kcsan_disable_current() and kcsan_enable_current_nowarn() kcsan: Add __kcsan_{enable,disable}_current() variants checkpatch: Warn about data_race() without comment kcsan: Use GFP_ATOMIC under spin lock Improve KCSAN documentation a bit kcsan: Make reporting aware of KCSAN tests kcsan: Fix function matching in report kcsan: Change data_race() to no longer require marking racing accesses kcsan: Move kcsan_{disable,enable}_current() to kcsan-checks.h ...
2020-06-11Merge tag 'locking-urgent-2020-06-11' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull atomics rework from Thomas Gleixner: "Peter Zijlstras rework of atomics and fallbacks. This solves two problems: 1) Compilers uninline small atomic_* static inline functions which can expose them to instrumentation. 2) The instrumentation of atomic primitives was done at the architecture level while composites or fallbacks were provided at the generic level. As a result there are no uninstrumented variants of the fallbacks. Both issues were in the way of fully isolating fragile entry code pathes and especially the text poke int3 handler which is prone to an endless recursion problem when anything in that code path is about to be instrumented. This was always a problem, but got elevated due to the new batch mode updates of tracing. The solution is to mark the functions __always_inline and to flip the fallback and instrumentation so the non-instrumented variants are at the architecture level and the instrumentation is done in generic code. The latter introduces another fallback variant which will go away once all architectures have been moved over to arch_atomic_*" * tag 'locking-urgent-2020-06-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: locking/atomics: Flip fallbacks and instrumentation asm-generic/atomic: Use __always_inline for fallback wrappers
2020-06-11Merge tag 'x86-urgent-2020-06-11' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull more x86 updates from Thomas Gleixner: "A set of fixes and updates for x86: - Unbreak paravirt VDSO clocks. While the VDSO code was moved into lib for sharing a subtle check for the validity of paravirt clocks got replaced. While the replacement works perfectly fine for bare metal as the update of the VDSO clock mode is synchronous, it fails for paravirt clocks because the hypervisor can invalidate them asynchronously. Bring it back as an optional function so it does not inflict this on architectures which are free of PV damage. - Fix the jiffies to jiffies64 mapping on 64bit so it does not trigger an ODR violation on newer compilers - Three fixes for the SSBD and *IB* speculation mitigation maze to ensure consistency, not disabling of some *IB* variants wrongly and to prevent a rogue cross process shutdown of SSBD. All marked for stable. - Add yet more CPU models to the splitlock detection capable list !@#%$! - Bring the pr_info() back which tells that TSC deadline timer is enabled. - Reboot quirk for MacBook6,1" * tag 'x86-urgent-2020-06-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/vdso: Unbreak paravirt VDSO clocks lib/vdso: Provide sanity check for cycles (again) clocksource: Remove obsolete ifdef x86_64: Fix jiffies ODR violation x86/speculation: PR_SPEC_FORCE_DISABLE enforcement for indirect branches. x86/speculation: Prevent rogue cross-process SSBD shutdown x86/speculation: Avoid force-disabling IBPB based on STIBP and enhanced IBRS. x86/cpu: Add Sapphire Rapids CPU model number x86/split_lock: Add Icelake microserver and Tigerlake CPU models x86/apic: Make TSC deadline timer detection message visible x86/reboot/quirks: Add MacBook6,1 reboot quirk
2020-06-11Rebase locking/kcsan to locking/urgentThomas Gleixner
Merge the state of the locking kcsan branch before the read/write_once() and the atomics modifications got merged. Squash the fallout of the rebase on top of the read/write once and atomic fallback work into the merge. The history of the original branch is preserved in tag locking-kcsan-2020-06-02. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2020-06-11KVM: x86: do not pass poisoned hva to __kvm_set_memory_regionPaolo Bonzini
__kvm_set_memory_region does not use the hva at all, so trying to catch use-after-delete is pointless and, worse, it fails access_ok now that we apply it to all memslots including private kernel ones. This fixes an AVIC regression. Fixes: 09d952c971a5 ("KVM: check userspace_addr for all memslots") Reported-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-11KVM: async_pf: Inject 'page ready' event only if 'page not present' was ↵Vitaly Kuznetsov
previously injected 'Page not present' event may or may not get injected depending on guest's state. If the event wasn't injected, there is no need to inject the corresponding 'page ready' event as the guest may get confused. E.g. Linux thinks that the corresponding 'page not present' event wasn't delivered *yet* and allocates a 'dummy entry' for it. This entry is never freed. Note, 'wakeup all' events have no corresponding 'page not present' event and always get injected. s390 seems to always be able to inject 'page not present', the change is effectively a nop. Suggested-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20200610175532.779793-2-vkuznets@redhat.com> Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=208081 Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-11kvm: i8254: remove redundant assignment to pointer sColin Ian King
The pointer s is being assigned a value that is never read, the assignment is redundant and can be removed. Addresses-Coverity: ("Unused value") Signed-off-by: Colin Ian King <colin.king@canonical.com> Message-Id: <20200609233121.1118683-1-colin.king@canonical.com> Fixes: 7837699fa6d7 ("KVM: In kernel PIT model") Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-11KVM: x86: respect singlestep when emulating instructionFelipe Franciosi
When userspace configures KVM_GUESTDBG_SINGLESTEP, KVM will manage the presence of X86_EFLAGS_TF via kvm_set/get_rflags on vcpus. The actual rflag bit is therefore hidden from callers. That includes init_emulate_ctxt() which uses the value returned from kvm_get_flags() to set ctxt->tf. As a result, x86_emulate_instruction() will skip a single step, leaving singlestep_rip stale and not returning to userspace. This resolves the issue by observing the vcpu guest_debug configuration alongside ctxt->tf in x86_emulate_instruction(), performing the single step if set. Cc: stable@vger.kernel.org Signed-off-by: Felipe Franciosi <felipe@nutanix.com> Message-Id: <20200519081048.8204-1-felipe@nutanix.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-11Merge branch 'kvm-basic-exit-reason' into HEADPaolo Bonzini
Using a topic branch so that stable branches can simply cherry-pick the patch. Reviewed-by: Oliver Upton <oupton@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-11KVM: nVMX: Consult only the "basic" exit reason when routing nested exitSean Christopherson
Consult only the basic exit reason, i.e. bits 15:0 of vmcs.EXIT_REASON, when determining whether a nested VM-Exit should be reflected into L1 or handled by KVM in L0. For better or worse, the switch statement in nested_vmx_exit_reflected() currently defaults to "true", i.e. reflects any nested VM-Exit without dedicated logic. Because the case statements only contain the basic exit reason, any VM-Exit with modifier bits set will be reflected to L1, even if KVM intended to handle it in L0. Practically speaking, this only affects EXIT_REASON_MCE_DURING_VMENTRY, i.e. a #MC that occurs on nested VM-Enter would be incorrectly routed to L1, as "failed VM-Entry" is the only modifier that KVM can currently encounter. The SMM modifiers will never be generated as KVM doesn't support/employ a SMI Transfer Monitor. Ditto for "exit from enclave", as KVM doesn't yet support virtualizing SGX, i.e. it's impossible to enter an enclave in a KVM guest (L1 or L2). Fixes: 644d711aa0e1 ("KVM: nVMX: Deciding if L0 or L1 should handle an L2 exit") Cc: Jim Mattson <jmattson@google.com> Cc: Xiaoyao Li <xiaoyao.li@intel.com> Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200227174430.26371-1-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-11x86/mce/dev-mcelog: Fix -Wstringop-truncation warning about strncpy()Tony Luck
The kbuild test robot reported this warning: arch/x86/kernel/cpu/mce/dev-mcelog.c: In function 'dev_mcelog_init_device': arch/x86/kernel/cpu/mce/dev-mcelog.c:346:2: warning: 'strncpy' output \ truncated before terminating nul copying 12 bytes from a string of the \ same length [-Wstringop-truncation] This is accurate, but I don't care that the trailing NUL character isn't copied. The string being copied is just a magic number signature so that crash dump tools can be sure they are decoding the right blob of memory. Use memcpy() instead of strncpy(). Fixes: d8ecca4043f2 ("x86/mce/dev-mcelog: Dynamically allocate space for machine check records") Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20200527182808.27737-1-tony.luck@intel.com
2020-06-11x86/{mce,mm}: Unmap the entire page if the whole page is affected and poisonedTony Luck
An interesting thing happened when a guest Linux instance took a machine check. The VMM unmapped the bad page from guest physical space and passed the machine check to the guest. Linux took all the normal actions to offline the page from the process that was using it. But then guest Linux crashed because it said there was a second machine check inside the kernel with this stack trace: do_memory_failure set_mce_nospec set_memory_uc _set_memory_uc change_page_attr_set_clr cpa_flush clflush_cache_range_opt This was odd, because a CLFLUSH instruction shouldn't raise a machine check (it isn't consuming the data). Further investigation showed that the VMM had passed in another machine check because is appeared that the guest was accessing the bad page. Fix is to check the scope of the poison by checking the MCi_MISC register. If the entire page is affected, then unmap the page. If only part of the page is affected, then mark the page as uncacheable. This assumes that VMMs will do the logical thing and pass in the "whole page scope" via the MCi_MISC register (since they unmapped the entire page). [ bp: Adjust to x86/entry changes. ] Fixes: 284ce4011ba6 ("x86/memory_failure: Introduce {set, clear}_mce_nospec()") Reported-by: Jue Wang <juew@google.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Jue Wang <juew@google.com> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20200520163546.GA7977@agluck-desk2.amr.corp.intel.com
2020-06-11Merge branch 'x86/entry' into ras/coreThomas Gleixner
to fixup conflicts in arch/x86/kernel/cpu/mce/core.c so MCE specific follow up patches can be applied without creating a horrible merge conflict afterwards.
2020-06-11x86/entry: Unbreak __irqentry_text_start/end magicThomas Gleixner
The entry rework moved interrupt entry code from the irqentry to the noinstr section which made the irqentry section empty. This breaks boundary checks which rely on the __irqentry_text_start/end markers to find out whether a function in a stack trace is interrupt/exception entry code. This affects the function graph tracer and filter_irq_stacks(). As the IDT entry points are all sequentialy emitted this is rather simple to unbreak by injecting __irqentry_text_start/end as global labels. To make this work correctly: - Remove the IRQENTRY_TEXT section from the x86 linker script - Define __irqentry so it breaks the build if it's used - Adjust the entry mirroring in PTI - Remove the redundant kprobes and unwinder bound checks Reported-by: Qian Cai <cai@lca.pw> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2020-06-11x86/entry: __always_inline CR2 for noinstrPeter Zijlstra
vmlinux.o: warning: objtool: exc_page_fault()+0x9: call to read_cr2() leaves .noinstr.text section vmlinux.o: warning: objtool: exc_page_fault()+0x24: call to prefetchw() leaves .noinstr.text section vmlinux.o: warning: objtool: exc_page_fault()+0x21: call to kvm_handle_async_pf.isra.0() leaves .noinstr.text section vmlinux.o: warning: objtool: exc_nmi()+0x1cc: call to write_cr2() leaves .noinstr.text section Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20200603114052.243227806@infradead.org
2020-06-11x86/entry: Re-order #DB handler to avoid *SAN instrumentationPeter Zijlstra
vmlinux.o: warning: objtool: exc_debug()+0xbb: call to clear_ti_thread_flag.constprop.0() leaves .noinstr.text section vmlinux.o: warning: objtool: noist_exc_debug()+0x55: call to clear_ti_thread_flag.constprop.0() leaves .noinstr.text section Rework things so that handle_debug() looses the noinstr and move the clear_thread_flag() into that. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20200603114052.127756554@infradead.org
2020-06-11x86/entry: __always_inline arch_atomic_* for noinstrPeter Zijlstra
vmlinux.o: warning: objtool: rcu_dynticks_eqs_exit()+0x33: call to arch_atomic_and.constprop.0() leaves .noinstr.text section Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20200603114052.070166551@infradead.org
2020-06-11x86/entry: __always_inline irqflags for noinstrPeter Zijlstra
vmlinux.o: warning: objtool: lockdep_hardirqs_on()+0x65: call to arch_local_save_flags() leaves .noinstr.text section vmlinux.o: warning: objtool: lockdep_hardirqs_off()+0x5d: call to arch_local_save_flags() leaves .noinstr.text section vmlinux.o: warning: objtool: lock_is_held_type()+0x35: call to arch_local_irq_save() leaves .noinstr.text section vmlinux.o: warning: objtool: check_preemption_disabled()+0x31: call to arch_local_save_flags() leaves .noinstr.text section vmlinux.o: warning: objtool: check_preemption_disabled()+0x33: call to arch_irqs_disabled_flags() leaves .noinstr.text section vmlinux.o: warning: objtool: lock_is_held_type()+0x2f: call to native_irq_disable() leaves .noinstr.text section Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20200603114052.012171668@infradead.org