summaryrefslogtreecommitdiff
path: root/arch/x86
AgeCommit message (Collapse)Author
2019-11-15x86/hyperv: Initialize clockevents earlier in CPU onliningMichael Kelley
Hyper-V has historically initialized stimer-based clockevents late in the process of onlining a CPU because clockevents depend on stimer interrupts. In the original Hyper-V design, stimer interrupts generate a VMbus message, so the VMbus machinery must be running first, and VMbus can't be initialized until relatively late. On x86/64, LAPIC timer based clockevents are used during early initialization before VMbus and stimer-based clockevents are ready, and again during CPU offlining after the stimer clockevents have been shut down. Unfortunately, this design creates problems when offlining CPUs for hibernation or other purposes. stimer-based clockevents are shut down relatively early in the offlining process, so clockevents_unbind_device() must be used to fallback to the LAPIC-based clockevents for the remainder of the offlining process. Furthermore, the late initialization and early shutdown of stimer-based clockevents doesn't work well on ARM64 since there is no other timer like the LAPIC to fallback to. So CPU onlining and offlining doesn't work properly. Fix this by recognizing that stimer Direct Mode is the normal path for newer versions of Hyper-V on x86/64, and the only path on other architectures. With stimer Direct Mode, stimer interrupts don't require any VMbus machinery. stimer clockevents can be initialized and shut down consistent with how it is done for other clockevent devices. While the old VMbus-based stimer interrupts must still be supported for backward compatibility on x86, that mode of operation can be treated as legacy. So add a new Hyper-V stimer entry in the CPU hotplug state list, and use that new state when in Direct Mode. Update the Hyper-V clocksource driver to allocate and initialize stimer clockevents earlier during boot. Update Hyper-V initialization and the VMbus driver to use this new design. As a result, the LAPIC timer is no longer used during boot or CPU onlining/offlining and clockevents_unbind_device() is not called. But retain the old design as a legacy implementation for older versions of Hyper-V that don't support Direct Mode. Signed-off-by: Michael Kelley <mikelley@microsoft.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Dexuan Cui <decui@microsoft.com> Reviewed-by: Dexuan Cui <decui@microsoft.com> Link: https://lkml.kernel.org/r/1573607467-9456-1-git-send-email-mikelley@microsoft.com
2019-11-15Merge branch 'linus' into x86/hypervThomas Gleixner
Pick up upstream fixes to avoid conflicts.
2019-11-14x86/crash: Align function arguments on opening bracesBorislav Petkov
... or let function calls stick out and thus remain on a single line, even if the 80 cols rule is violated by a couple of chars, for better readability. No functional changes. Signed-off-by: Borislav Petkov <bp@suse.de> Cc: x86@kernel.org Link: https://lkml.kernel.org/r/20191114172200.19563-1-bp@alien8.de
2019-11-14x86/kdump: Remove the backup region handlingLianbo Jiang
When the crashkernel kernel command line option is specified, the low 1M memory will always be reserved now. Therefore, it's not necessary to create a backup region anymore and also no need to copy the contents of the first 640k to it. Remove all the code related to handling that backup region. [ bp: Massage commit message. ] Signed-off-by: Lianbo Jiang <lijiang@redhat.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: bhe@redhat.com Cc: Dave Young <dyoung@redhat.com> Cc: d.hatayama@fujitsu.com Cc: dhowells@redhat.com Cc: ebiederm@xmission.com Cc: horms@verge.net.au Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jürgen Gross <jgross@suse.com> Cc: kexec@lists.infradead.org Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: vgoyal@redhat.com Cc: x86-ml <x86@kernel.org> Link: https://lkml.kernel.org/r/20191108090027.11082-3-lijiang@redhat.com
2019-11-14KVM: x86/mmu: Take slots_lock when using kvm_mmu_zap_all_fast()Sean Christopherson
Acquire the per-VM slots_lock when zapping all shadow pages as part of toggling nx_huge_pages. The fast zap algorithm relies on exclusivity (via slots_lock) to identify obsolete vs. valid shadow pages, because it uses a single bit for its generation number. Holding slots_lock also obviates the need to acquire a read lock on the VM's srcu. Failing to take slots_lock when toggling nx_huge_pages allows multiple instances of kvm_mmu_zap_all_fast() to run concurrently, as the other user, KVM_SET_USER_MEMORY_REGION, does not take the global kvm_lock. (kvm_mmu_zap_all_fast() does take kvm->mmu_lock, but it can be temporarily dropped by kvm_zap_obsolete_pages(), so it is not enough to enforce exclusivity). Concurrent fast zap instances causes obsolete shadow pages to be incorrectly identified as valid due to the single bit generation number wrapping, which results in stale shadow pages being left in KVM's MMU and leads to all sorts of undesirable behavior. The bug is easily confirmed by running with CONFIG_PROVE_LOCKING and toggling nx_huge_pages via its module param. Note, until commit 4ae5acbc4936 ("KVM: x86/mmu: Take slots_lock when using kvm_mmu_zap_all_fast()", 2019-11-13) the fast zap algorithm used an ulong-sized generation instead of relying on exclusivity for correctness, but all callers except the recently added set_nx_huge_pages() needed to hold slots_lock anyways. Therefore, this patch does not have to be backported to stable kernels. Given that toggling nx_huge_pages is by no means a fast path, force it to conform to the current approach instead of reintroducing the previous generation count. Fixes: b8e8c8303ff28 ("kvm: mmu: ITLB_MULTIHIT mitigation", but NOT FOR STABLE) Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-11-14x86/kdump: Always reserve the low 1M when the crashkernel option is specifiedLianbo Jiang
On x86, purgatory() copies the first 640K of memory to a backup region because the kernel needs those first 640K for the real mode trampoline during boot, among others. However, when SME is enabled, the kernel cannot properly copy the old memory to the backup area but reads only its encrypted contents. The result is that the crash tool gets invalid pointers when parsing vmcore: crash> kmem -s|grep -i invalid kmem: dma-kmalloc-512: slab:ffffd77680001c00 invalid freepointer:a6086ac099f0c5a4 kmem: dma-kmalloc-512: slab:ffffd77680001c00 invalid freepointer:a6086ac099f0c5a4 crash> So reserve the remaining low 1M memory when the crashkernel option is specified (after reserving real mode memory) so that allocated memory does not fall into the low 1M area and thus the copying of the contents of the first 640k to a backup region in purgatory() can be avoided altogether. This way, it does not need to be included in crash dumps or used for anything except the trampolines that must live in the low 1M. [ bp: Heavily rewrite commit message, flip check logic in crash_reserve_low_1M().] Signed-off-by: Lianbo Jiang <lijiang@redhat.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: bhe@redhat.com Cc: Dave Young <dyoung@redhat.com> Cc: d.hatayama@fujitsu.com Cc: dhowells@redhat.com Cc: ebiederm@xmission.com Cc: horms@verge.net.au Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jürgen Gross <jgross@suse.com> Cc: kexec@lists.infradead.org Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: vgoyal@redhat.com Cc: x86-ml <x86@kernel.org> Link: https://lkml.kernel.org/r/20191108090027.11082-2-lijiang@redhat.com Link: https://bugzilla.kernel.org/show_bug.cgi?id=204793
2019-11-14x86/crash: Add a forward declaration of struct kimageLianbo Jiang
Add a forward declaration of struct kimage to the crash.h header because future changes will invoke a crash-specific function from the realmode init path and the compiler will complain otherwise like this: In file included from arch/x86/realmode/init.c:11: ./arch/x86/include/asm/crash.h:5:32: warning: ‘struct kimage’ declared inside\ parameter list will not be visible outside of this definition or declaration 5 | int crash_load_segments(struct kimage *image); | ^~~~~~ ./arch/x86/include/asm/crash.h:6:37: warning: ‘struct kimage’ declared inside\ parameter list will not be visible outside of this definition or declaration 6 | int crash_copy_backup_region(struct kimage *image); | ^~~~~~ ./arch/x86/include/asm/crash.h:7:39: warning: ‘struct kimage’ declared inside\ parameter list will not be visible outside of this definition or declaration 7 | int crash_setup_memmap_entries(struct kimage *image, | [ bp: Rewrite the commit message. ] Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: Lianbo Jiang <lijiang@redhat.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: bhe@redhat.com Cc: d.hatayama@fujitsu.com Cc: dhowells@redhat.com Cc: dyoung@redhat.com Cc: ebiederm@xmission.com Cc: horms@verge.net.au Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jürgen Gross <jgross@suse.com> Cc: kexec@lists.infradead.org Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: vgoyal@redhat.com Cc: x86-ml <x86@kernel.org> Link: https://lkml.kernel.org/r/20191108090027.11082-4-lijiang@redhat.com Link: https://lkml.kernel.org/r/201910310233.EJRtTMWP%25lkp@intel.com
2019-11-14xen/mcelog: add PPIN to record when availableJan Beulich
This is to augment commit 3f5a7896a5 ("x86/mce: Include the PPIN in MCE records when available"). I'm also adding "synd" and "ipid" fields to struct xen_mce, in an attempt to keep field offsets in sync with struct mce. These two fields won't get populated for now, though. Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Juergen Gross <jgross@suse.com>
2019-11-13KVM: X86: Reset the three MSR list number variables to 0 in kvm_init_msr_list()Xiaoyao Li
When applying commit 7a5ee6edb42e ("KVM: X86: Fix initialization of MSR lists"), it forgot to reset the three MSR lists number varialbes to 0 while removing the useless conditionals. Fixes: 7a5ee6edb42e (KVM: X86: Fix initialization of MSR lists) Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-11-13kvm: x86: disable shattered huge page recovery for PREEMPT_RT.Paolo Bonzini
If a huge page is recovered (and becomes no executable) while another thread is executing it, the resulting contention on mmu_lock can cause latency spikes. Disabling recovery for PREEMPT_RT kernels fixes this issue. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-11-13ftrace/x86: Tell objtool to ignore nondeterministic ftrace stack layoutJosh Poimboeuf
Objtool complains about the new ftrace direct trampoline code: arch/x86/kernel/ftrace_64.o: warning: objtool: ftrace_regs_caller()+0x190: stack state mismatch: cfa1=7+16 cfa2=7+24 Typically, code has a deterministic stack layout, such that at a given instruction address, the stack frame size is always the same. That's not the case for the new ftrace_regs_caller() code after it adjusts the stack for the direct case. Just plead ignorance and assume it's always the non-direct path. Note this creates a tiny window for ORC to get confused. Link: http://lkml.kernel.org/r/20191108225100.ea3bhsbdf6oerj6g@treble Reported-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-11-13ftrace/x86: Add a counter to test function_graph with directSteven Rostedt (VMware)
As testing for direct calls from the function graph tracer adds a little overhead (which is a lot when tracing every function), add a counter that can be used to test if function_graph tracer needs to test for a direct caller or not. It would have been nicer if we could use a static branch, but the static branch logic fails when used within the function graph tracer trampoline. Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-11-13ftrace/x86: Add register_ftrace_direct() for custom trampolinesSteven Rostedt (VMware)
Enable x86 to allow for register_ftrace_direct(), where a custom trampoline may be called directly from an ftrace mcount/fentry location. Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-11-13x86/resctrl: Fix potential lockdep warningXiaochen Shen
rdtgroup_cpus_write() and mkdir_rdt_prepare() call rdtgroup_kn_lock_live() -> kernfs_to_rdtgroup() to get 'rdtgrp', and then call the rdt_last_cmd_{clear,puts,...}() functions which will check if rdtgroup_mutex is held/requires its caller to hold rdtgroup_mutex. But if 'rdtgrp' returned from kernfs_to_rdtgroup() is NULL, rdtgroup_mutex is not held and calling rdt_last_cmd_{clear,puts,...}() will result in a self-incurred, potential lockdep warning. Remove the rdt_last_cmd_{clear,puts,...}() calls in these two paths. Just returning error should be sufficient to report to the user that the entry doesn't exist any more. [ bp: Massage. ] Fixes: 94457b36e8a5 ("x86/intel_rdt: Add diagnostics when writing the cpus file") Fixes: cfd0f34e4cd5 ("x86/intel_rdt: Add diagnostics when making directories") Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Fenghua Yu <fenghua.yu@intel.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: pei.p.jia@intel.com Cc: Thomas Gleixner <tglx@linutronix.de> Cc: x86-ml <x86@kernel.org> Link: https://lkml.kernel.org/r/1573079796-11713-1-git-send-email-xiaochen.shen@intel.com
2019-11-13perf/x86/intel/pt: Prevent redundant WRMSRsAlexander Shishkin
With recent optimizations to AUX and PT buffer management code (high order AUX allocations, opportunistic Single Range Output), it is far more likely now that the output MSRs won't need reprogramming on every sched-in. To avoid needless WRMSRs of those registers, cache their values and only write them when needed. Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Link: https://lkml.kernel.org/r/20191105082701.78442-3-alexander.shishkin@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-13perf/x86/intel/pt: Opportunistically use single range output modeAlexander Shishkin
Most of PT implementations support Single Range Output mode, which is an alternative to ToPA that can be used for a single contiguous buffer and if we don't require an interrupt, that is, in AUX snapshot mode. Now that perf core will use high order allocations for the AUX buffer, in many cases the first condition will also be satisfied. The two most obvious benefits of the Single Range Output mode over the ToPA are: * not having to allocate the ToPA table(s), * not using the ToPA walk hardware. Make use of this functionality where available and appropriate. Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Link: https://lkml.kernel.org/r/20191105082701.78442-2-alexander.shishkin@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-13perf/x86/intel/pt: Add sampling supportAlexander Shishkin
Add AUX sampling support to the PT PMU: implement an NMI-safe callback that takes a snapshot of the buffer without touching the event states. This is done for PT events that don't use PMIs, that is, snapshot mode (RO mapping of the AUX area). Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: adrian.hunter@intel.com Cc: mathieu.poirier@linaro.org Link: https://lkml.kernel.org/r/20191025140835.53665-4-alexander.shishkin@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-13perf/x86/intel/pt: Factor out pt_config_start()Alexander Shishkin
PT trace is now enabled at the bottom of the event configuration function that takes care of all configuration bits related to a given event, including the address filter update. This is only needed where the event configuration changes, that is, in ->add()/->start(). In the interrupt path we can use a lighter version that keeps the configuration intact, since it hasn't changed, and only flips the enable bit. Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: adrian.hunter@intel.com Cc: mathieu.poirier@linaro.org Link: https://lkml.kernel.org/r/20191025140835.53665-3-alexander.shishkin@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-12Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm fixes from Paolo Bonzini: "Fix unwinding of KVM_CREATE_VM failure, VT-d posted interrupts, DAX/ZONE_DEVICE, and module unload/reload" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: MMU: Do not treat ZONE_DEVICE pages as being reserved KVM: VMX: Introduce pi_is_pir_empty() helper KVM: VMX: Do not change PID.NDST when loading a blocked vCPU KVM: VMX: Consider PID.PIR to determine if vCPU has pending interrupts KVM: VMX: Fix comment to specify PID.ON instead of PIR.ON KVM: X86: Fix initialization of MSR lists KVM: fix placement of refcount initialization KVM: Fix NULL-ptr deref after kvm_create_vm fails
2019-11-12Merge branch 'x86-pti-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 TSX Async Abort and iTLB Multihit mitigations from Thomas Gleixner: "The performance deterioration departement is not proud at all of presenting the seventh installment of speculation mitigations and hardware misfeature workarounds: 1) TSX Async Abort (TAA) - 'The Annoying Affair' TAA is a hardware vulnerability that allows unprivileged speculative access to data which is available in various CPU internal buffers by using asynchronous aborts within an Intel TSX transactional region. The mitigation depends on a microcode update providing a new MSR which allows to disable TSX in the CPU. CPUs which have no microcode update can be mitigated by disabling TSX in the BIOS if the BIOS provides a tunable. Newer CPUs will have a bit set which indicates that the CPU is not vulnerable, but the MSR to disable TSX will be available nevertheless as it is an architected MSR. That means the kernel provides the ability to disable TSX on the kernel command line, which is useful as TSX is a truly useful mechanism to accelerate side channel attacks of all sorts. 2) iITLB Multihit (NX) - 'No eXcuses' iTLB Multihit is an erratum where some Intel processors may incur a machine check error, possibly resulting in an unrecoverable CPU lockup, when an instruction fetch hits multiple entries in the instruction TLB. This can occur when the page size is changed along with either the physical address or cache type. A malicious guest running on a virtualized system can exploit this erratum to perform a denial of service attack. The workaround is that KVM marks huge pages in the extended page tables as not executable (NX). If the guest attempts to execute in such a page, the page is broken down into 4k pages which are marked executable. The workaround comes with a mechanism to recover these shattered huge pages over time. Both issues come with full documentation in the hardware vulnerabilities section of the Linux kernel user's and administrator's guide. Thanks to all patch authors and reviewers who had the extraordinary priviledge to be exposed to this nuisance. Special thanks to Borislav Petkov for polishing the final TAA patch set and to Paolo Bonzini for shepherding the KVM iTLB workarounds and providing also the backports to stable kernels for those!" * 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/speculation/taa: Fix printing of TAA_MSG_SMT on IBRS_ALL CPUs Documentation: Add ITLB_MULTIHIT documentation kvm: x86: mmu: Recovery of shattered NX large pages kvm: Add helper function for creating VM worker threads kvm: mmu: ITLB_MULTIHIT mitigation cpu/speculation: Uninline and export CPU mitigations helpers x86/cpu: Add Tremont to the cpu vulnerability whitelist x86/bugs: Add ITLB_MULTIHIT bug infrastructure x86/tsx: Add config options to set tsx=on|off|auto x86/speculation/taa: Add documentation for TSX Async Abort x86/tsx: Add "auto" option to the tsx= cmdline parameter kvm/x86: Export MDS_NO=0 to guests when TSX is enabled x86/speculation/taa: Add sysfs reporting for TSX Async Abort x86/speculation/taa: Add mitigation for TSX Async Abort x86/cpu: Add a "tsx=" cmdline option with TSX disabled by default x86/cpu: Add a helper function x86_read_arch_cap_msr() x86/msr: Add the IA32_TSX_CTRL MSR
2019-11-12Merge branches 'iommu/fixes', 'arm/qcom', 'arm/renesas', 'arm/rockchip', ↵Joerg Roedel
'arm/mediatek', 'arm/tegra', 'arm/smmu', 'x86/amd', 'x86/vt-d', 'virtio' and 'core' into next
2019-11-12x86/boot: Introduce setup_indirectDaniel Kiper
The setup_data is a bit awkward to use for extremely large data objects, both because the setup_data header has to be adjacent to the data object and because it has a 32-bit length field. However, it is important that intermediate stages of the boot process have a way to identify which chunks of memory are occupied by kernel data. Thus introduce an uniform way to specify such indirect data as setup_indirect struct and SETUP_INDIRECT type. And finally bump setup_header version in arch/x86/boot/header.S. Suggested-by: H. Peter Anvin (Intel) <hpa@zytor.com> Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Ross Philipson <ross.philipson@oracle.com> Reviewed-by: H. Peter Anvin (Intel) <hpa@zytor.com> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: ard.biesheuvel@linaro.org Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: dave.hansen@linux.intel.com Cc: eric.snowberg@oracle.com Cc: Ingo Molnar <mingo@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Juergen Gross <jgross@suse.com> Cc: kanth.ghatraju@oracle.com Cc: linux-doc@vger.kernel.org Cc: linux-efi <linux-efi@vger.kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: rdunlap@infradead.org Cc: ross.philipson@oracle.com Cc: Thomas Gleixner <tglx@linutronix.de> Cc: x86-ml <x86@kernel.org> Cc: xen-devel@lists.xenproject.org Link: https://lkml.kernel.org/r/20191112134640.16035-4-daniel.kiper@oracle.com
2019-11-12x86/boot: Introduce kernel_info.setup_type_maxDaniel Kiper
This field contains maximal allowed type for setup_data. Do not bump setup_header version in arch/x86/boot/header.S because it will be followed by additional changes coming into the Linux/x86 boot protocol. Suggested-by: H. Peter Anvin (Intel) <hpa@zytor.com> Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: Ross Philipson <ross.philipson@oracle.com> Reviewed-by: H. Peter Anvin (Intel) <hpa@zytor.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: ard.biesheuvel@linaro.org Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: dave.hansen@linux.intel.com Cc: eric.snowberg@oracle.com Cc: Ingo Molnar <mingo@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Juergen Gross <jgross@suse.com> Cc: kanth.ghatraju@oracle.com Cc: linux-doc@vger.kernel.org Cc: linux-efi <linux-efi@vger.kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: rdunlap@infradead.org Cc: ross.philipson@oracle.com Cc: Thomas Gleixner <tglx@linutronix.de> Cc: x86-ml <x86@kernel.org> Cc: xen-devel@lists.xenproject.org Link: https://lkml.kernel.org/r/20191112134640.16035-3-daniel.kiper@oracle.com
2019-11-12x86/boot: Introduce kernel_infoDaniel Kiper
The relationships between the headers are analogous to the various data sections: setup_header = .data boot_params/setup_data = .bss What is missing from the above list? That's right: kernel_info = .rodata We have been (ab)using .data for things that could go into .rodata or .bss for a long time, for lack of alternatives and -- especially early on -- inertia. Also, the BIOS stub is responsible for creating boot_params, so it isn't available to a BIOS-based loader (setup_data is, though). setup_header is permanently limited to 144 bytes due to the reach of the 2-byte jump field, which doubles as a length field for the structure, combined with the size of the "hole" in struct boot_params that a protected-mode loader or the BIOS stub has to copy it into. It is currently 119 bytes long, which leaves us with 25 very precious bytes. This isn't something that can be fixed without revising the boot protocol entirely, breaking backwards compatibility. boot_params proper is limited to 4096 bytes, but can be arbitrarily extended by adding setup_data entries. It cannot be used to communicate properties of the kernel image, because it is .bss and has no image-provided content. kernel_info solves this by providing an extensible place for information about the kernel image. It is readonly, because the kernel cannot rely on a bootloader copying its contents anywhere, but that is OK; if it becomes necessary it can still contain data items that an enabled bootloader would be expected to copy into a setup_data chunk. Do not bump setup_header version in arch/x86/boot/header.S because it will be followed by additional changes coming into the Linux/x86 boot protocol. Suggested-by: H. Peter Anvin (Intel) <hpa@zytor.com> Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: Ross Philipson <ross.philipson@oracle.com> Reviewed-by: H. Peter Anvin (Intel) <hpa@zytor.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: ard.biesheuvel@linaro.org Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: dave.hansen@linux.intel.com Cc: eric.snowberg@oracle.com Cc: Ingo Molnar <mingo@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Juergen Gross <jgross@suse.com> Cc: kanth.ghatraju@oracle.com Cc: linux-doc@vger.kernel.org Cc: linux-efi <linux-efi@vger.kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: rdunlap@infradead.org Cc: ross.philipson@oracle.com Cc: Thomas Gleixner <tglx@linutronix.de> Cc: x86-ml <x86@kernel.org> Cc: xen-devel@lists.xenproject.org Link: https://lkml.kernel.org/r/20191112134640.16035-2-daniel.kiper@oracle.com
2019-11-12x86/mce/therm_throt: Optimize notifications of thermal throttleSrinivas Pandruvada
Some modern systems have very tight thermal tolerances. Because of this they may cross thermal thresholds when running normal workloads (even during boot). The CPU hardware will react by limiting power/frequency and using duty cycles to bring the temperature back into normal range. Thus users may see a "critical" message about the "temperature above threshold" which is soon followed by "temperature/speed normal". These messages are rate-limited, but still may repeat every few minutes. This issue became worse starting with the Ivy Bridge generation of CPUs because they include a TCC activation offset in the MSR IA32_TEMPERATURE_TARGET. OEMs use this to provide alerts long before critical temperatures are reached. A test run on a laptop with Intel 8th Gen i5 core for two hours with a workload resulted in 20K+ thermal interrupts per CPU for core level and another 20K+ interrupts at package level. The kernel logs were full of throttling messages. The real value of these threshold interrupts, is to debug problems with the external cooling solutions and performance issues due to excessive throttling. So the solution here is the following: - In the current thermal_throttle folder, show: - the maximum time for one throttling event and, - the total amount of time the system was in throttling state. - Do not log short excursions. - Log only when, in spite of thermal throttling, the temperature is rising. On the high threshold interrupt trigger a delayed workqueue that monitors the threshold violation log bit (THERM_STATUS_PROCHOT_LOG). When the log bit is set, this workqueue callback calculates three point moving average and logs a warning message when the temperature trend is rising. When this log bit is clear and temperature is below threshold temperature, then the workqueue callback logs a "Normal" message. Once a high threshold event is logged, the logging is rate-limited. With this patch on the same test laptop, no warnings are printed in the logs as the max time the processor could bring the temperature under control is only 280 ms. This implementation is done with the inputs from Alan Cox and Tony Luck. [ bp: Touchups. ] Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: bberg@redhat.com Cc: ckellner@redhat.com Cc: hdegoede@redhat.com Cc: Ingo Molnar <mingo@redhat.com> Cc: linux-edac <linux-edac@vger.kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: x86-ml <x86@kernel.org> Link: https://lkml.kernel.org/r/20191111214312.81365-1-srinivas.pandruvada@linux.intel.com
2019-11-12x86/quirks: Disable HPET on Intel Coffe Lake platformsKai-Heng Feng
Some Coffee Lake platforms have a skewed HPET timer once the SoCs entered PC10, which in consequence marks TSC as unstable because HPET is used as watchdog clocksource for TSC. Harry Pan tried to work around it in the clocksource watchdog code [1] thereby creating a circular dependency between HPET and TSC. This also ignores the fact, that HPET is not only unsuitable as watchdog clocksource on these systems, it becomes unusable in general. Disable HPET on affected platforms. Suggested-by: Feng Tang <feng.tang@intel.com> Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=203183 Link: https://lore.kernel.org/lkml/20190516090651.1396-1-harry.pan@intel.com/ [1] Link: https://lkml.kernel.org/r/20191016103816.30650-1-kai.heng.feng@canonical.com
2019-11-12x86/init: Allow DT configured systems to disable RTC at boot timeRahul Tanwar
Systems which do not support RTC run into boot problems as the kernel assumes the availability of the RTC by default. On device tree configured systems the availability of the RTC can be detected by querying the corresponding device tree node. Implement a wallclock init function to query the device tree and disable RTC if the RTC is marked as not available in the corresponding node. [ tglx: Rewrote changelog and comments. Added proper __init(const) annotations. ] Suggested-by: Andy Shevchenko <andriy.shevchenko@intel.com> Signed-off-by: Rahul Tanwar <rahul.tanwar@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/b84d9152ce0c1c09896ff4987e691a0715cb02df.1570693058.git.rahul.tanwar@linux.intel.com
2019-11-12x86/hyperv: Allow guests to enable InvariantTSCAndrea Parri
If the hardware supports TSC scaling, Hyper-V will set bit 15 of the HV_PARTITION_PRIVILEGE_MASK in guest VMs with a compatible Hyper-V configuration version. Bit 15 corresponds to the AccessTscInvariantControls privilege. If this privilege bit is set, guests can access the HvSyntheticInvariantTscControl MSR: guests can set bit 0 of this synthetic MSR to enable the InvariantTSC feature. After setting the synthetic MSR, CPUID will enumerate support for InvariantTSC. Signed-off-by: Andrea Parri <parri.andrea@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Michael Kelley <mikelley@microsoft.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Link: https://lkml.kernel.org/r/20191003155200.22022-1-parri.andrea@gmail.com
2019-11-12x86/hyperv: Micro-optimize send_ipi_one()Vitaly Kuznetsov
When sending an IPI to a single CPU there is no need to deal with cpumasks. With 2 CPU guest on WS2019 a minor (like 3%, 8043 -> 7761 CPU cycles) improvement with smp_call_function_single() loop benchmark can be seeb. The optimization, however, is tiny and straitforward. Also, send_ipi_one() is important for PV spinlock kick. Switching to the regular APIC IPI send for CPU > 64 case does not make sense as it is twice as expesive (12650 CPU cycles for __send_ipi_mask_ex() call, 26000 for orig_apic.send_IPI(cpu, vector)). Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Michael Kelley <mikelley@microsoft.com> Reviewed-by: Roman Kagan <rkagan@virtuozzo.com> Link: https://lkml.kernel.org/r/20191027151938.7296-1-vkuznets@redhat.com
2019-11-12KVM: MMU: Do not treat ZONE_DEVICE pages as being reservedSean Christopherson
Explicitly exempt ZONE_DEVICE pages from kvm_is_reserved_pfn() and instead manually handle ZONE_DEVICE on a case-by-case basis. For things like page refcounts, KVM needs to treat ZONE_DEVICE pages like normal pages, e.g. put pages grabbed via gup(). But for flows such as setting A/D bits or shifting refcounts for transparent huge pages, KVM needs to to avoid processing ZONE_DEVICE pages as the flows in question lack the underlying machinery for proper handling of ZONE_DEVICE pages. This fixes a hang reported by Adam Borowski[*] in dev_pagemap_cleanup() when running a KVM guest backed with /dev/dax memory, as KVM straight up doesn't put any references to ZONE_DEVICE pages acquired by gup(). Note, Dan Williams proposed an alternative solution of doing put_page() on ZONE_DEVICE pages immediately after gup() in order to simplify the auditing needed to ensure is_zone_device_page() is called if and only if the backing device is pinned (via gup()). But that approach would break kvm_vcpu_{un}map() as KVM requires the page to be pinned from map() 'til unmap() when accessing guest memory, unlike KVM's secondary MMU, which coordinates with mmu_notifier invalidations to avoid creating stale page references, i.e. doesn't rely on pages being pinned. [*] http://lkml.kernel.org/r/20190919115547.GA17963@angband.pl Reported-by: Adam Borowski <kilobyte@angband.pl> Analyzed-by: David Hildenbrand <david@redhat.com> Acked-by: Dan Williams <dan.j.williams@intel.com> Cc: stable@vger.kernel.org Fixes: 3565fce3a659 ("mm, x86: get_user_pages() for dax mappings") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-11-12KVM: VMX: Introduce pi_is_pir_empty() helperJoao Martins
Streamline the PID.PIR check and change its call sites to use the newly added helper. Suggested-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-11-12KVM: VMX: Do not change PID.NDST when loading a blocked vCPUJoao Martins
When vCPU enters block phase, pi_pre_block() inserts vCPU to a per pCPU linked list of all vCPUs that are blocked on this pCPU. Afterwards, it changes PID.NV to POSTED_INTR_WAKEUP_VECTOR which its handler (wakeup_handler()) is responsible to kick (unblock) any vCPU on that linked list that now has pending posted interrupts. While vCPU is blocked (in kvm_vcpu_block()), it may be preempted which will cause vmx_vcpu_pi_put() to set PID.SN. If later the vCPU will be scheduled to run on a different pCPU, vmx_vcpu_pi_load() will clear PID.SN but will also *overwrite PID.NDST to this different pCPU*. Instead of keeping it with original pCPU which vCPU had entered block phase on. This results in an issue because when a posted interrupt is delivered, as the wakeup_handler() will be executed and fail to find blocked vCPU on its per pCPU linked list of all vCPUs that are blocked on this pCPU. Which is due to the vCPU being placed on a *different* per pCPU linked list i.e. the original pCPU in which it entered block phase. The regression is introduced by commit c112b5f50232 ("KVM: x86: Recompute PID.ON when clearing PID.SN"). Therefore, partially revert it and reintroduce the condition in vmx_vcpu_pi_load() responsible for avoiding changing PID.NDST when loading a blocked vCPU. Fixes: c112b5f50232 ("KVM: x86: Recompute PID.ON when clearing PID.SN") Tested-by: Nathan Ni <nathan.ni@oracle.com> Co-developed-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-11-12KVM: VMX: Consider PID.PIR to determine if vCPU has pending interruptsJoao Martins
Commit 17e433b54393 ("KVM: Fix leak vCPU's VMCS value into other pCPU") introduced vmx_dy_apicv_has_pending_interrupt() in order to determine if a vCPU have a pending posted interrupt. This routine is used by kvm_vcpu_on_spin() when searching for a a new runnable vCPU to schedule on pCPU instead of a vCPU doing busy loop. vmx_dy_apicv_has_pending_interrupt() determines if a vCPU has a pending posted interrupt solely based on PID.ON. However, when a vCPU is preempted, vmx_vcpu_pi_put() sets PID.SN which cause raised posted interrupts to only set bit in PID.PIR without setting PID.ON (and without sending notification vector), as depicted in VT-d manual section 5.2.3 "Interrupt-Posting Hardware Operation". Therefore, checking PID.ON is insufficient to determine if a vCPU has pending posted interrupts and instead we should also check if there is some bit set on PID.PIR if PID.SN=1. Fixes: 17e433b54393 ("KVM: Fix leak vCPU's VMCS value into other pCPU") Reviewed-by: Jagannathan Raman <jag.raman@oracle.com> Co-developed-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-11-12KVM: VMX: Fix comment to specify PID.ON instead of PIR.ONLiran Alon
The Outstanding Notification (ON) bit is part of the Posted Interrupt Descriptor (PID) as opposed to the Posted Interrupts Register (PIR). The latter is a bitmap for pending vectors. Reviewed-by: Joao Martins <joao.m.martins@oracle.com> Signed-off-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-11-12KVM: X86: Fix initialization of MSR listsChenyi Qiang
The three MSR lists(msrs_to_save[], emulated_msrs[] and msr_based_features[]) are global arrays of kvm.ko, which are adjusted (copy supported MSRs forward to override the unsupported MSRs) when insmod kvm-{intel,amd}.ko, but it doesn't reset these three arrays to their initial value when rmmod kvm-{intel,amd}.ko. Thus, at the next installation, kvm-{intel,amd}.ko will do operations on the modified arrays with some MSRs lost and some MSRs duplicated. So define three constant arrays to hold the initial MSR lists and initialize msrs_to_save[], emulated_msrs[] and msr_based_features[] based on the constant arrays. Cc: stable@vger.kernel.org Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com> [Remove now useless conditionals. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-11-11arch: rely on asm-generic/io.h for default ioremap_* definitionsChristoph Hellwig
Various architectures that use asm-generic/io.h still defined their own default versions of ioremap_nocache, ioremap_wt and ioremap_wc that point back to plain ioremap directly or indirectly. Remove these definitions and rely on asm-generic/io.h instead. For this to work the backup ioremap_* defintions needs to be changed to purely cpp macros instea of inlines to cover for architectures like openrisc that only define ioremap after including <asm-generic/io.h>. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Palmer Dabbelt <palmer@dabbelt.com>
2019-11-11x86: Clean up ioremap()Christoph Hellwig
Use ioremap() as the main implemented function, and defines ioremap_nocache() as a deprecated alias of ioremap() in preparation of removing ioremap_nocache() entirely. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
2019-11-11iommu/vt-d: Check VT-d RMRR region in BIOS is reported as reservedYian Chen
VT-d RMRR (Reserved Memory Region Reporting) regions are reserved for device use only and should not be part of allocable memory pool of OS. BIOS e820_table reports complete memory map to OS, including OS usable memory ranges and BIOS reserved memory ranges etc. x86 BIOS may not be trusted to include RMRR regions as reserved type of memory in its e820 memory map, hence validate every RMRR entry with the e820 memory map to make sure the RMRR regions will not be used by OS for any other purposes. ia64 EFI is working fine so implement RMRR validation as a dummy function Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Sohil Mehta <sohil.mehta@intel.com> Signed-off-by: Yian Chen <yian.chen@intel.com> Signed-off-by: Joerg Roedel <jroedel@suse.de>
2019-11-11KVM: x86: get rid of odd out jump label in pdptrs_changedMiaohe Lin
The odd out jump label is really not needed. Get rid of it by return true directly while r < 0 as suggested by Paolo. This further lead to var changed being unused. Remove it too. Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-11-11x86/PCI: sta2x11: use default DMA address translationNicolas Saenz Julienne
The devices found behind this PCIe chip have unusual DMA mapping constraints as there is an AMBA interconnect placed in between them and the different PCI endpoints. The offset between physical memory addresses and AMBA's view is provided by reading a PCI config register, which is saved and used whenever DMA mapping is needed. It turns out that this DMA setup can be represented by properly setting 'dma_pfn_offset', 'dma_bus_mask' and 'dma_mask' during the PCI device enable fixup. And ultimately allows us to get rid of this device's custom DMA functions. Aside from the code deletion and DMA setup, sta2x11_pdev_to_mapping() is moved to avoid warnings whenever CONFIG_PM is not enabled. Signed-off-by: Nicolas Saenz Julienne <nsaenzjulienne@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-11-11x86: olpc-xo1-sci: Remove invocation of MFD's .enable()/.disable() call-backsLee Jones
IO regions are now requested and released by this device's parent. Signed-off-by: Lee Jones <lee.jones@linaro.org> Reviewed-by: Daniel Thompson <daniel.thompson@linaro.org>
2019-11-11x86: olpc-xo1-pm: Remove invocation of MFD's .enable()/.disable() call-backsLee Jones
IO regions are now requested and released by this device's parent. Signed-off-by: Lee Jones <lee.jones@linaro.org> Reviewed-by: Daniel Thompson <daniel.thompson@linaro.org>
2019-11-11Merge tag 'v5.4-rc7' into sched/core, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-11perf/x86/amd: Remove set but not used variable 'active'Zheng Yongjun
'-Wunused-but-set-variable' triggers this warning: arch/x86/events/amd/core.c: In function amd_pmu_handle_irq: arch/x86/events/amd/core.c:656:6: warning: variable active set but not used [-Wunused-but-set-variable] GCC is right, 'active' is not used anymore. This variable was introduced earlier this year and then removed in: df4d29732fdad perf/x86/amd: Change/fix NMI latency mitigation to use a timestamp [ mingo: Improved the changelog, fixed build warning caused by this fix, improved surrounding code. ] Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Zheng Yongjun <zhengyongjun3@huawei.com> Cc: <acme@kernel.org> Cc: <alexander.shishkin@linux.intel.com> Cc: <mark.rutland@arm.com> Cc: <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20191110094453.113001-1-zhengyongjun3@huawei.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-11Merge tag 'v5.4-rc7' into perf/core, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-10Merge branch 'x86-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Thomas Gleixner: "A small set of fixes for x86: - Make the tsc=reliable/nowatchdog command line parameter work again. It was broken with the introduction of the early TSC clocksource. - Prevent the evaluation of exception stacks before they are set up. This causes a crash in dumpstack because the stack walk termination gets screwed up. - Prevent a NULL pointer dereference in the rescource control file system. - Avoid bogus warnings about APIC id mismatch related to the LDR which can happen when the LDR is not in use and therefore not initialized. Only evaluate that when the APIC is in logical destination mode" * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/tsc: Respect tsc command line paraemeter for clocksource_tsc_early x86/dumpstack/64: Don't evaluate exception stacks before setup x86/apic/32: Avoid bogus LDR warnings x86/resctrl: Prevent NULL pointer dereference when reading mondata
2019-11-07x86/speculation/taa: Fix printing of TAA_MSG_SMT on IBRS_ALL CPUsJosh Poimboeuf
For new IBRS_ALL CPUs, the Enhanced IBRS check at the beginning of cpu_bugs_smt_update() causes the function to return early, unintentionally skipping the MDS and TAA logic. This is not a problem for MDS, because there appears to be no overlap between IBRS_ALL and MDS-affected CPUs. So the MDS mitigation would be disabled and nothing would need to be done in this function anyway. But for TAA, the TAA_MSG_SMT string will never get printed on Cascade Lake and newer. The check is superfluous anyway: when 'spectre_v2_enabled' is SPECTRE_V2_IBRS_ENHANCED, 'spectre_v2_user' is always SPECTRE_V2_USER_NONE, and so the 'spectre_v2_user' switch statement handles it appropriately by doing nothing. So just remove the check. Fixes: 1b42f017415b ("x86/speculation/taa: Add mitigation for TSX Async Abort") Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Tyler Hicks <tyhicks@canonical.com> Reviewed-by: Borislav Petkov <bp@suse.de>
2019-11-07x86/stacktrace: update kconfig help text for reliable unwindersJoe Lawrence
commit 6415b38bae26 ("x86/stacktrace: Enable HAVE_RELIABLE_STACKTRACE for the ORC unwinder") added the ORC unwinder as a "reliable" unwinder. Update the help text to reflect that change: the frame pointer unwinder is no longer the only one that can provide HAVE_RELIABLE_STACKTRACE. Link: http://lkml.kernel.org/r/20191107032958.14034-1-joe.lawrence@redhat.com To: linux-kernel@vger.kernel.org To: live-patching@vger.kernel.org Signed-off-by: Joe Lawrence <joe.lawrence@redhat.com> Acked-by: Josh Poimboeuf <jpoimboe@redhat.com> Acked-by: Miroslav Benes <mbenes@suse.cz> Reviewed-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com> Signed-off-by: Petr Mladek <pmladek@suse.com>
2019-11-07x86/efi: Add efi_fake_mem support for EFI_MEMORY_SPDan Williams
Given that EFI_MEMORY_SP is platform BIOS policy decision for marking memory ranges as "reserved for a specific purpose" there will inevitably be scenarios where the BIOS omits the attribute in situations where it is desired. Unlike other attributes if the OS wants to reserve this memory from the kernel the reservation needs to happen early in init. So early, in fact, that it needs to happen before e820__memblock_setup() which is a pre-requisite for efi_fake_memmap() that wants to allocate memory for the updated table. Introduce an x86 specific efi_fake_memmap_early() that can search for attempts to set EFI_MEMORY_SP via efi_fake_mem and update the e820 table accordingly. The KASLR code that scans the command line looking for user-directed memory reservations also needs to be updated to consider "efi_fake_mem=nn@ss:0x40000" requests. Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-11-07x86/efi: EFI soft reservation to E820 enumerationDan Williams
UEFI 2.8 defines an EFI_MEMORY_SP attribute bit to augment the interpretation of the EFI Memory Types as "reserved for a specific purpose". The proposed Linux behavior for specific purpose memory is that it is reserved for direct-access (device-dax) by default and not available for any kernel usage, not even as an OOM fallback. Later, through udev scripts or another init mechanism, these device-dax claimed ranges can be reconfigured and hot-added to the available System-RAM with a unique node identifier. This device-dax management scheme implements "soft" in the "soft reserved" designation by allowing some or all of the reservation to be recovered as typical memory. This policy can be disabled at compile-time with CONFIG_EFI_SOFT_RESERVE=n, or runtime with efi=nosoftreserve. This patch introduces 2 new concepts at once given the entanglement between early boot enumeration relative to memory that can optionally be reserved from the kernel page allocator by default. The new concepts are: - E820_TYPE_SOFT_RESERVED: Upon detecting the EFI_MEMORY_SP attribute on EFI_CONVENTIONAL memory, update the E820 map with this new type. Only perform this classification if the CONFIG_EFI_SOFT_RESERVE=y policy is enabled, otherwise treat it as typical ram. - IORES_DESC_SOFT_RESERVED: Add a new I/O resource descriptor for a device driver to search iomem resources for application specific memory. Teach the iomem code to identify such ranges as "Soft Reserved". Note that the comment for do_add_efi_memmap() needed refreshing since it seemed to imply that the efi map might overflow the e820 table, but that is not an issue as of commit 7b6e4ba3cb1f "x86/boot/e820: Clean up the E820_X_MAX definition" that removed the 128 entry limit for e820__range_add(). A follow-on change integrates parsing of the ACPI HMAT to identify the node and sub-range boundaries of EFI_MEMORY_SP designated memory. For now, just identify and reserve memory of this type. Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Reported-by: kbuild test robot <lkp@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>