Age | Commit message (Collapse) | Author |
|
Hyper-V has historically initialized stimer-based clockevents late in the
process of onlining a CPU because clockevents depend on stimer
interrupts. In the original Hyper-V design, stimer interrupts generate a
VMbus message, so the VMbus machinery must be running first, and VMbus
can't be initialized until relatively late. On x86/64, LAPIC timer based
clockevents are used during early initialization before VMbus and
stimer-based clockevents are ready, and again during CPU offlining after
the stimer clockevents have been shut down.
Unfortunately, this design creates problems when offlining CPUs for
hibernation or other purposes. stimer-based clockevents are shut down
relatively early in the offlining process, so clockevents_unbind_device()
must be used to fallback to the LAPIC-based clockevents for the remainder
of the offlining process. Furthermore, the late initialization and early
shutdown of stimer-based clockevents doesn't work well on ARM64 since there
is no other timer like the LAPIC to fallback to. So CPU onlining and
offlining doesn't work properly.
Fix this by recognizing that stimer Direct Mode is the normal path for
newer versions of Hyper-V on x86/64, and the only path on other
architectures. With stimer Direct Mode, stimer interrupts don't require any
VMbus machinery. stimer clockevents can be initialized and shut down
consistent with how it is done for other clockevent devices. While the old
VMbus-based stimer interrupts must still be supported for backward
compatibility on x86, that mode of operation can be treated as legacy.
So add a new Hyper-V stimer entry in the CPU hotplug state list, and use
that new state when in Direct Mode. Update the Hyper-V clocksource driver
to allocate and initialize stimer clockevents earlier during boot. Update
Hyper-V initialization and the VMbus driver to use this new design. As a
result, the LAPIC timer is no longer used during boot or CPU
onlining/offlining and clockevents_unbind_device() is not called. But
retain the old design as a legacy implementation for older versions of
Hyper-V that don't support Direct Mode.
Signed-off-by: Michael Kelley <mikelley@microsoft.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Dexuan Cui <decui@microsoft.com>
Reviewed-by: Dexuan Cui <decui@microsoft.com>
Link: https://lkml.kernel.org/r/1573607467-9456-1-git-send-email-mikelley@microsoft.com
|
|
Pick up upstream fixes to avoid conflicts.
|
|
... or let function calls stick out and thus remain on a single line,
even if the 80 cols rule is violated by a couple of chars, for better
readability.
No functional changes.
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: x86@kernel.org
Link: https://lkml.kernel.org/r/20191114172200.19563-1-bp@alien8.de
|
|
When the crashkernel kernel command line option is specified, the low
1M memory will always be reserved now. Therefore, it's not necessary to
create a backup region anymore and also no need to copy the contents of
the first 640k to it.
Remove all the code related to handling that backup region.
[ bp: Massage commit message. ]
Signed-off-by: Lianbo Jiang <lijiang@redhat.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: bhe@redhat.com
Cc: Dave Young <dyoung@redhat.com>
Cc: d.hatayama@fujitsu.com
Cc: dhowells@redhat.com
Cc: ebiederm@xmission.com
Cc: horms@verge.net.au
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jürgen Gross <jgross@suse.com>
Cc: kexec@lists.infradead.org
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: vgoyal@redhat.com
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/20191108090027.11082-3-lijiang@redhat.com
|
|
Acquire the per-VM slots_lock when zapping all shadow pages as part of
toggling nx_huge_pages. The fast zap algorithm relies on exclusivity
(via slots_lock) to identify obsolete vs. valid shadow pages, because it
uses a single bit for its generation number. Holding slots_lock also
obviates the need to acquire a read lock on the VM's srcu.
Failing to take slots_lock when toggling nx_huge_pages allows multiple
instances of kvm_mmu_zap_all_fast() to run concurrently, as the other
user, KVM_SET_USER_MEMORY_REGION, does not take the global kvm_lock.
(kvm_mmu_zap_all_fast() does take kvm->mmu_lock, but it can be
temporarily dropped by kvm_zap_obsolete_pages(), so it is not enough
to enforce exclusivity).
Concurrent fast zap instances causes obsolete shadow pages to be
incorrectly identified as valid due to the single bit generation number
wrapping, which results in stale shadow pages being left in KVM's MMU
and leads to all sorts of undesirable behavior.
The bug is easily confirmed by running with CONFIG_PROVE_LOCKING and
toggling nx_huge_pages via its module param.
Note, until commit 4ae5acbc4936 ("KVM: x86/mmu: Take slots_lock when
using kvm_mmu_zap_all_fast()", 2019-11-13) the fast zap algorithm used
an ulong-sized generation instead of relying on exclusivity for
correctness, but all callers except the recently added set_nx_huge_pages()
needed to hold slots_lock anyways. Therefore, this patch does not have
to be backported to stable kernels.
Given that toggling nx_huge_pages is by no means a fast path, force it
to conform to the current approach instead of reintroducing the previous
generation count.
Fixes: b8e8c8303ff28 ("kvm: mmu: ITLB_MULTIHIT mitigation", but NOT FOR STABLE)
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
On x86, purgatory() copies the first 640K of memory to a backup region
because the kernel needs those first 640K for the real mode trampoline
during boot, among others.
However, when SME is enabled, the kernel cannot properly copy the old
memory to the backup area but reads only its encrypted contents. The
result is that the crash tool gets invalid pointers when parsing vmcore:
crash> kmem -s|grep -i invalid
kmem: dma-kmalloc-512: slab:ffffd77680001c00 invalid freepointer:a6086ac099f0c5a4
kmem: dma-kmalloc-512: slab:ffffd77680001c00 invalid freepointer:a6086ac099f0c5a4
crash>
So reserve the remaining low 1M memory when the crashkernel option is
specified (after reserving real mode memory) so that allocated memory
does not fall into the low 1M area and thus the copying of the contents
of the first 640k to a backup region in purgatory() can be avoided
altogether.
This way, it does not need to be included in crash dumps or used for
anything except the trampolines that must live in the low 1M.
[ bp: Heavily rewrite commit message, flip check logic in
crash_reserve_low_1M().]
Signed-off-by: Lianbo Jiang <lijiang@redhat.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: bhe@redhat.com
Cc: Dave Young <dyoung@redhat.com>
Cc: d.hatayama@fujitsu.com
Cc: dhowells@redhat.com
Cc: ebiederm@xmission.com
Cc: horms@verge.net.au
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jürgen Gross <jgross@suse.com>
Cc: kexec@lists.infradead.org
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: vgoyal@redhat.com
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/20191108090027.11082-2-lijiang@redhat.com
Link: https://bugzilla.kernel.org/show_bug.cgi?id=204793
|
|
Add a forward declaration of struct kimage to the crash.h header because
future changes will invoke a crash-specific function from the realmode
init path and the compiler will complain otherwise like this:
In file included from arch/x86/realmode/init.c:11:
./arch/x86/include/asm/crash.h:5:32: warning: ‘struct kimage’ declared inside\
parameter list will not be visible outside of this definition or declaration
5 | int crash_load_segments(struct kimage *image);
| ^~~~~~
./arch/x86/include/asm/crash.h:6:37: warning: ‘struct kimage’ declared inside\
parameter list will not be visible outside of this definition or declaration
6 | int crash_copy_backup_region(struct kimage *image);
| ^~~~~~
./arch/x86/include/asm/crash.h:7:39: warning: ‘struct kimage’ declared inside\
parameter list will not be visible outside of this definition or declaration
7 | int crash_setup_memmap_entries(struct kimage *image,
|
[ bp: Rewrite the commit message. ]
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Lianbo Jiang <lijiang@redhat.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: bhe@redhat.com
Cc: d.hatayama@fujitsu.com
Cc: dhowells@redhat.com
Cc: dyoung@redhat.com
Cc: ebiederm@xmission.com
Cc: horms@verge.net.au
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jürgen Gross <jgross@suse.com>
Cc: kexec@lists.infradead.org
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: vgoyal@redhat.com
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/20191108090027.11082-4-lijiang@redhat.com
Link: https://lkml.kernel.org/r/201910310233.EJRtTMWP%25lkp@intel.com
|
|
This is to augment commit 3f5a7896a5 ("x86/mce: Include the PPIN in MCE
records when available").
I'm also adding "synd" and "ipid" fields to struct xen_mce, in an
attempt to keep field offsets in sync with struct mce. These two fields
won't get populated for now, though.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
|
|
When applying commit 7a5ee6edb42e ("KVM: X86: Fix initialization of MSR
lists"), it forgot to reset the three MSR lists number varialbes to 0
while removing the useless conditionals.
Fixes: 7a5ee6edb42e (KVM: X86: Fix initialization of MSR lists)
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
If a huge page is recovered (and becomes no executable) while another
thread is executing it, the resulting contention on mmu_lock can cause
latency spikes. Disabling recovery for PREEMPT_RT kernels fixes this
issue.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Objtool complains about the new ftrace direct trampoline code:
arch/x86/kernel/ftrace_64.o: warning: objtool: ftrace_regs_caller()+0x190: stack state mismatch: cfa1=7+16 cfa2=7+24
Typically, code has a deterministic stack layout, such that at a given
instruction address, the stack frame size is always the same.
That's not the case for the new ftrace_regs_caller() code after it
adjusts the stack for the direct case. Just plead ignorance and assume
it's always the non-direct path. Note this creates a tiny window for
ORC to get confused.
Link: http://lkml.kernel.org/r/20191108225100.ea3bhsbdf6oerj6g@treble
Reported-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
As testing for direct calls from the function graph tracer adds a little
overhead (which is a lot when tracing every function), add a counter that
can be used to test if function_graph tracer needs to test for a direct
caller or not.
It would have been nicer if we could use a static branch, but the static
branch logic fails when used within the function graph tracer trampoline.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
Enable x86 to allow for register_ftrace_direct(), where a custom trampoline
may be called directly from an ftrace mcount/fentry location.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
rdtgroup_cpus_write() and mkdir_rdt_prepare() call
rdtgroup_kn_lock_live() -> kernfs_to_rdtgroup() to get 'rdtgrp', and
then call the rdt_last_cmd_{clear,puts,...}() functions which will check
if rdtgroup_mutex is held/requires its caller to hold rdtgroup_mutex.
But if 'rdtgrp' returned from kernfs_to_rdtgroup() is NULL,
rdtgroup_mutex is not held and calling rdt_last_cmd_{clear,puts,...}()
will result in a self-incurred, potential lockdep warning.
Remove the rdt_last_cmd_{clear,puts,...}() calls in these two paths.
Just returning error should be sufficient to report to the user that the
entry doesn't exist any more.
[ bp: Massage. ]
Fixes: 94457b36e8a5 ("x86/intel_rdt: Add diagnostics when writing the cpus file")
Fixes: cfd0f34e4cd5 ("x86/intel_rdt: Add diagnostics when making directories")
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Fenghua Yu <fenghua.yu@intel.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: pei.p.jia@intel.com
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/1573079796-11713-1-git-send-email-xiaochen.shen@intel.com
|
|
With recent optimizations to AUX and PT buffer management code (high order
AUX allocations, opportunistic Single Range Output), it is far more likely
now that the output MSRs won't need reprogramming on every sched-in.
To avoid needless WRMSRs of those registers, cache their values and only
write them when needed.
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: https://lkml.kernel.org/r/20191105082701.78442-3-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Most of PT implementations support Single Range Output mode, which is
an alternative to ToPA that can be used for a single contiguous buffer
and if we don't require an interrupt, that is, in AUX snapshot mode.
Now that perf core will use high order allocations for the AUX buffer,
in many cases the first condition will also be satisfied.
The two most obvious benefits of the Single Range Output mode over the
ToPA are:
* not having to allocate the ToPA table(s),
* not using the ToPA walk hardware.
Make use of this functionality where available and appropriate.
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: https://lkml.kernel.org/r/20191105082701.78442-2-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Add AUX sampling support to the PT PMU: implement an NMI-safe callback
that takes a snapshot of the buffer without touching the event states.
This is done for PT events that don't use PMIs, that is, snapshot mode
(RO mapping of the AUX area).
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: adrian.hunter@intel.com
Cc: mathieu.poirier@linaro.org
Link: https://lkml.kernel.org/r/20191025140835.53665-4-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
PT trace is now enabled at the bottom of the event configuration
function that takes care of all configuration bits related to a given
event, including the address filter update. This is only needed where
the event configuration changes, that is, in ->add()/->start().
In the interrupt path we can use a lighter version that keeps the
configuration intact, since it hasn't changed, and only flips the
enable bit.
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: adrian.hunter@intel.com
Cc: mathieu.poirier@linaro.org
Link: https://lkml.kernel.org/r/20191025140835.53665-3-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Pull kvm fixes from Paolo Bonzini:
"Fix unwinding of KVM_CREATE_VM failure, VT-d posted interrupts,
DAX/ZONE_DEVICE, and module unload/reload"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: MMU: Do not treat ZONE_DEVICE pages as being reserved
KVM: VMX: Introduce pi_is_pir_empty() helper
KVM: VMX: Do not change PID.NDST when loading a blocked vCPU
KVM: VMX: Consider PID.PIR to determine if vCPU has pending interrupts
KVM: VMX: Fix comment to specify PID.ON instead of PIR.ON
KVM: X86: Fix initialization of MSR lists
KVM: fix placement of refcount initialization
KVM: Fix NULL-ptr deref after kvm_create_vm fails
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 TSX Async Abort and iTLB Multihit mitigations from Thomas Gleixner:
"The performance deterioration departement is not proud at all of
presenting the seventh installment of speculation mitigations and
hardware misfeature workarounds:
1) TSX Async Abort (TAA) - 'The Annoying Affair'
TAA is a hardware vulnerability that allows unprivileged
speculative access to data which is available in various CPU
internal buffers by using asynchronous aborts within an Intel TSX
transactional region.
The mitigation depends on a microcode update providing a new MSR
which allows to disable TSX in the CPU. CPUs which have no
microcode update can be mitigated by disabling TSX in the BIOS if
the BIOS provides a tunable.
Newer CPUs will have a bit set which indicates that the CPU is not
vulnerable, but the MSR to disable TSX will be available
nevertheless as it is an architected MSR. That means the kernel
provides the ability to disable TSX on the kernel command line,
which is useful as TSX is a truly useful mechanism to accelerate
side channel attacks of all sorts.
2) iITLB Multihit (NX) - 'No eXcuses'
iTLB Multihit is an erratum where some Intel processors may incur
a machine check error, possibly resulting in an unrecoverable CPU
lockup, when an instruction fetch hits multiple entries in the
instruction TLB. This can occur when the page size is changed
along with either the physical address or cache type. A malicious
guest running on a virtualized system can exploit this erratum to
perform a denial of service attack.
The workaround is that KVM marks huge pages in the extended page
tables as not executable (NX). If the guest attempts to execute in
such a page, the page is broken down into 4k pages which are
marked executable. The workaround comes with a mechanism to
recover these shattered huge pages over time.
Both issues come with full documentation in the hardware
vulnerabilities section of the Linux kernel user's and administrator's
guide.
Thanks to all patch authors and reviewers who had the extraordinary
priviledge to be exposed to this nuisance.
Special thanks to Borislav Petkov for polishing the final TAA patch
set and to Paolo Bonzini for shepherding the KVM iTLB workarounds and
providing also the backports to stable kernels for those!"
* 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/speculation/taa: Fix printing of TAA_MSG_SMT on IBRS_ALL CPUs
Documentation: Add ITLB_MULTIHIT documentation
kvm: x86: mmu: Recovery of shattered NX large pages
kvm: Add helper function for creating VM worker threads
kvm: mmu: ITLB_MULTIHIT mitigation
cpu/speculation: Uninline and export CPU mitigations helpers
x86/cpu: Add Tremont to the cpu vulnerability whitelist
x86/bugs: Add ITLB_MULTIHIT bug infrastructure
x86/tsx: Add config options to set tsx=on|off|auto
x86/speculation/taa: Add documentation for TSX Async Abort
x86/tsx: Add "auto" option to the tsx= cmdline parameter
kvm/x86: Export MDS_NO=0 to guests when TSX is enabled
x86/speculation/taa: Add sysfs reporting for TSX Async Abort
x86/speculation/taa: Add mitigation for TSX Async Abort
x86/cpu: Add a "tsx=" cmdline option with TSX disabled by default
x86/cpu: Add a helper function x86_read_arch_cap_msr()
x86/msr: Add the IA32_TSX_CTRL MSR
|
|
'arm/mediatek', 'arm/tegra', 'arm/smmu', 'x86/amd', 'x86/vt-d', 'virtio' and 'core' into next
|
|
The setup_data is a bit awkward to use for extremely large data objects,
both because the setup_data header has to be adjacent to the data object
and because it has a 32-bit length field. However, it is important that
intermediate stages of the boot process have a way to identify which
chunks of memory are occupied by kernel data. Thus introduce an uniform
way to specify such indirect data as setup_indirect struct and
SETUP_INDIRECT type.
And finally bump setup_header version in arch/x86/boot/header.S.
Suggested-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Ross Philipson <ross.philipson@oracle.com>
Reviewed-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: ard.biesheuvel@linaro.org
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: dave.hansen@linux.intel.com
Cc: eric.snowberg@oracle.com
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Juergen Gross <jgross@suse.com>
Cc: kanth.ghatraju@oracle.com
Cc: linux-doc@vger.kernel.org
Cc: linux-efi <linux-efi@vger.kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: rdunlap@infradead.org
Cc: ross.philipson@oracle.com
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: x86-ml <x86@kernel.org>
Cc: xen-devel@lists.xenproject.org
Link: https://lkml.kernel.org/r/20191112134640.16035-4-daniel.kiper@oracle.com
|
|
This field contains maximal allowed type for setup_data.
Do not bump setup_header version in arch/x86/boot/header.S because it
will be followed by additional changes coming into the Linux/x86 boot
protocol.
Suggested-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Ross Philipson <ross.philipson@oracle.com>
Reviewed-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: ard.biesheuvel@linaro.org
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: dave.hansen@linux.intel.com
Cc: eric.snowberg@oracle.com
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Juergen Gross <jgross@suse.com>
Cc: kanth.ghatraju@oracle.com
Cc: linux-doc@vger.kernel.org
Cc: linux-efi <linux-efi@vger.kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: rdunlap@infradead.org
Cc: ross.philipson@oracle.com
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: x86-ml <x86@kernel.org>
Cc: xen-devel@lists.xenproject.org
Link: https://lkml.kernel.org/r/20191112134640.16035-3-daniel.kiper@oracle.com
|
|
The relationships between the headers are analogous to the various data
sections:
setup_header = .data
boot_params/setup_data = .bss
What is missing from the above list? That's right:
kernel_info = .rodata
We have been (ab)using .data for things that could go into .rodata or .bss for
a long time, for lack of alternatives and -- especially early on -- inertia.
Also, the BIOS stub is responsible for creating boot_params, so it isn't
available to a BIOS-based loader (setup_data is, though).
setup_header is permanently limited to 144 bytes due to the reach of the
2-byte jump field, which doubles as a length field for the structure, combined
with the size of the "hole" in struct boot_params that a protected-mode loader
or the BIOS stub has to copy it into. It is currently 119 bytes long, which
leaves us with 25 very precious bytes. This isn't something that can be fixed
without revising the boot protocol entirely, breaking backwards compatibility.
boot_params proper is limited to 4096 bytes, but can be arbitrarily extended
by adding setup_data entries. It cannot be used to communicate properties of
the kernel image, because it is .bss and has no image-provided content.
kernel_info solves this by providing an extensible place for information about
the kernel image. It is readonly, because the kernel cannot rely on a
bootloader copying its contents anywhere, but that is OK; if it becomes
necessary it can still contain data items that an enabled bootloader would be
expected to copy into a setup_data chunk.
Do not bump setup_header version in arch/x86/boot/header.S because it
will be followed by additional changes coming into the Linux/x86 boot
protocol.
Suggested-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Ross Philipson <ross.philipson@oracle.com>
Reviewed-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: ard.biesheuvel@linaro.org
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: dave.hansen@linux.intel.com
Cc: eric.snowberg@oracle.com
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Juergen Gross <jgross@suse.com>
Cc: kanth.ghatraju@oracle.com
Cc: linux-doc@vger.kernel.org
Cc: linux-efi <linux-efi@vger.kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: rdunlap@infradead.org
Cc: ross.philipson@oracle.com
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: x86-ml <x86@kernel.org>
Cc: xen-devel@lists.xenproject.org
Link: https://lkml.kernel.org/r/20191112134640.16035-2-daniel.kiper@oracle.com
|
|
Some modern systems have very tight thermal tolerances. Because of this
they may cross thermal thresholds when running normal workloads (even
during boot). The CPU hardware will react by limiting power/frequency
and using duty cycles to bring the temperature back into normal range.
Thus users may see a "critical" message about the "temperature above
threshold" which is soon followed by "temperature/speed normal". These
messages are rate-limited, but still may repeat every few minutes.
This issue became worse starting with the Ivy Bridge generation of
CPUs because they include a TCC activation offset in the MSR
IA32_TEMPERATURE_TARGET. OEMs use this to provide alerts long before
critical temperatures are reached.
A test run on a laptop with Intel 8th Gen i5 core for two hours with a
workload resulted in 20K+ thermal interrupts per CPU for core level and
another 20K+ interrupts at package level. The kernel logs were full of
throttling messages.
The real value of these threshold interrupts, is to debug problems with
the external cooling solutions and performance issues due to excessive
throttling.
So the solution here is the following:
- In the current thermal_throttle folder, show:
- the maximum time for one throttling event and,
- the total amount of time the system was in throttling state.
- Do not log short excursions.
- Log only when, in spite of thermal throttling, the temperature is rising.
On the high threshold interrupt trigger a delayed workqueue that
monitors the threshold violation log bit (THERM_STATUS_PROCHOT_LOG). When
the log bit is set, this workqueue callback calculates three point moving
average and logs a warning message when the temperature trend is rising.
When this log bit is clear and temperature is below threshold
temperature, then the workqueue callback logs a "Normal" message. Once a
high threshold event is logged, the logging is rate-limited.
With this patch on the same test laptop, no warnings are printed in the logs
as the max time the processor could bring the temperature under control is
only 280 ms.
This implementation is done with the inputs from Alan Cox and Tony Luck.
[ bp: Touchups. ]
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: bberg@redhat.com
Cc: ckellner@redhat.com
Cc: hdegoede@redhat.com
Cc: Ingo Molnar <mingo@redhat.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/20191111214312.81365-1-srinivas.pandruvada@linux.intel.com
|
|
Some Coffee Lake platforms have a skewed HPET timer once the SoCs entered
PC10, which in consequence marks TSC as unstable because HPET is used as
watchdog clocksource for TSC.
Harry Pan tried to work around it in the clocksource watchdog code [1]
thereby creating a circular dependency between HPET and TSC. This also
ignores the fact, that HPET is not only unsuitable as watchdog clocksource
on these systems, it becomes unusable in general.
Disable HPET on affected platforms.
Suggested-by: Feng Tang <feng.tang@intel.com>
Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=203183
Link: https://lore.kernel.org/lkml/20190516090651.1396-1-harry.pan@intel.com/ [1]
Link: https://lkml.kernel.org/r/20191016103816.30650-1-kai.heng.feng@canonical.com
|
|
Systems which do not support RTC run into boot problems as the kernel
assumes the availability of the RTC by default.
On device tree configured systems the availability of the RTC can be
detected by querying the corresponding device tree node.
Implement a wallclock init function to query the device tree and disable
RTC if the RTC is marked as not available in the corresponding node.
[ tglx: Rewrote changelog and comments. Added proper __init(const)
annotations. ]
Suggested-by: Andy Shevchenko <andriy.shevchenko@intel.com>
Signed-off-by: Rahul Tanwar <rahul.tanwar@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/b84d9152ce0c1c09896ff4987e691a0715cb02df.1570693058.git.rahul.tanwar@linux.intel.com
|
|
If the hardware supports TSC scaling, Hyper-V will set bit 15 of the
HV_PARTITION_PRIVILEGE_MASK in guest VMs with a compatible Hyper-V
configuration version. Bit 15 corresponds to the
AccessTscInvariantControls privilege. If this privilege bit is set,
guests can access the HvSyntheticInvariantTscControl MSR: guests can
set bit 0 of this synthetic MSR to enable the InvariantTSC feature.
After setting the synthetic MSR, CPUID will enumerate support for
InvariantTSC.
Signed-off-by: Andrea Parri <parri.andrea@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lkml.kernel.org/r/20191003155200.22022-1-parri.andrea@gmail.com
|
|
When sending an IPI to a single CPU there is no need to deal with cpumasks.
With 2 CPU guest on WS2019 a minor (like 3%, 8043 -> 7761 CPU cycles)
improvement with smp_call_function_single() loop benchmark can be seeb. The
optimization, however, is tiny and straitforward. Also, send_ipi_one() is
important for PV spinlock kick.
Switching to the regular APIC IPI send for CPU > 64 case does not make
sense as it is twice as expesive (12650 CPU cycles for __send_ipi_mask_ex()
call, 26000 for orig_apic.send_IPI(cpu, vector)).
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
Link: https://lkml.kernel.org/r/20191027151938.7296-1-vkuznets@redhat.com
|
|
Explicitly exempt ZONE_DEVICE pages from kvm_is_reserved_pfn() and
instead manually handle ZONE_DEVICE on a case-by-case basis. For things
like page refcounts, KVM needs to treat ZONE_DEVICE pages like normal
pages, e.g. put pages grabbed via gup(). But for flows such as setting
A/D bits or shifting refcounts for transparent huge pages, KVM needs to
to avoid processing ZONE_DEVICE pages as the flows in question lack the
underlying machinery for proper handling of ZONE_DEVICE pages.
This fixes a hang reported by Adam Borowski[*] in dev_pagemap_cleanup()
when running a KVM guest backed with /dev/dax memory, as KVM straight up
doesn't put any references to ZONE_DEVICE pages acquired by gup().
Note, Dan Williams proposed an alternative solution of doing put_page()
on ZONE_DEVICE pages immediately after gup() in order to simplify the
auditing needed to ensure is_zone_device_page() is called if and only if
the backing device is pinned (via gup()). But that approach would break
kvm_vcpu_{un}map() as KVM requires the page to be pinned from map() 'til
unmap() when accessing guest memory, unlike KVM's secondary MMU, which
coordinates with mmu_notifier invalidations to avoid creating stale
page references, i.e. doesn't rely on pages being pinned.
[*] http://lkml.kernel.org/r/20190919115547.GA17963@angband.pl
Reported-by: Adam Borowski <kilobyte@angband.pl>
Analyzed-by: David Hildenbrand <david@redhat.com>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Cc: stable@vger.kernel.org
Fixes: 3565fce3a659 ("mm, x86: get_user_pages() for dax mappings")
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Streamline the PID.PIR check and change its call sites to use
the newly added helper.
Suggested-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
When vCPU enters block phase, pi_pre_block() inserts vCPU to a per pCPU
linked list of all vCPUs that are blocked on this pCPU. Afterwards, it
changes PID.NV to POSTED_INTR_WAKEUP_VECTOR which its handler
(wakeup_handler()) is responsible to kick (unblock) any vCPU on that
linked list that now has pending posted interrupts.
While vCPU is blocked (in kvm_vcpu_block()), it may be preempted which
will cause vmx_vcpu_pi_put() to set PID.SN. If later the vCPU will be
scheduled to run on a different pCPU, vmx_vcpu_pi_load() will clear
PID.SN but will also *overwrite PID.NDST to this different pCPU*.
Instead of keeping it with original pCPU which vCPU had entered block
phase on.
This results in an issue because when a posted interrupt is delivered, as
the wakeup_handler() will be executed and fail to find blocked vCPU on
its per pCPU linked list of all vCPUs that are blocked on this pCPU.
Which is due to the vCPU being placed on a *different* per pCPU
linked list i.e. the original pCPU in which it entered block phase.
The regression is introduced by commit c112b5f50232 ("KVM: x86:
Recompute PID.ON when clearing PID.SN"). Therefore, partially revert
it and reintroduce the condition in vmx_vcpu_pi_load() responsible for
avoiding changing PID.NDST when loading a blocked vCPU.
Fixes: c112b5f50232 ("KVM: x86: Recompute PID.ON when clearing PID.SN")
Tested-by: Nathan Ni <nathan.ni@oracle.com>
Co-developed-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Commit 17e433b54393 ("KVM: Fix leak vCPU's VMCS value into other pCPU")
introduced vmx_dy_apicv_has_pending_interrupt() in order to determine
if a vCPU have a pending posted interrupt. This routine is used by
kvm_vcpu_on_spin() when searching for a a new runnable vCPU to schedule
on pCPU instead of a vCPU doing busy loop.
vmx_dy_apicv_has_pending_interrupt() determines if a
vCPU has a pending posted interrupt solely based on PID.ON. However,
when a vCPU is preempted, vmx_vcpu_pi_put() sets PID.SN which cause
raised posted interrupts to only set bit in PID.PIR without setting
PID.ON (and without sending notification vector), as depicted in VT-d
manual section 5.2.3 "Interrupt-Posting Hardware Operation".
Therefore, checking PID.ON is insufficient to determine if a vCPU has
pending posted interrupts and instead we should also check if there is
some bit set on PID.PIR if PID.SN=1.
Fixes: 17e433b54393 ("KVM: Fix leak vCPU's VMCS value into other pCPU")
Reviewed-by: Jagannathan Raman <jag.raman@oracle.com>
Co-developed-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
The Outstanding Notification (ON) bit is part of the Posted Interrupt
Descriptor (PID) as opposed to the Posted Interrupts Register (PIR).
The latter is a bitmap for pending vectors.
Reviewed-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
The three MSR lists(msrs_to_save[], emulated_msrs[] and
msr_based_features[]) are global arrays of kvm.ko, which are
adjusted (copy supported MSRs forward to override the unsupported MSRs)
when insmod kvm-{intel,amd}.ko, but it doesn't reset these three arrays
to their initial value when rmmod kvm-{intel,amd}.ko. Thus, at the next
installation, kvm-{intel,amd}.ko will do operations on the modified
arrays with some MSRs lost and some MSRs duplicated.
So define three constant arrays to hold the initial MSR lists and
initialize msrs_to_save[], emulated_msrs[] and msr_based_features[]
based on the constant arrays.
Cc: stable@vger.kernel.org
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
[Remove now useless conditionals. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Various architectures that use asm-generic/io.h still defined their
own default versions of ioremap_nocache, ioremap_wt and ioremap_wc
that point back to plain ioremap directly or indirectly. Remove these
definitions and rely on asm-generic/io.h instead. For this to work
the backup ioremap_* defintions needs to be changed to purely cpp
macros instea of inlines to cover for architectures like openrisc
that only define ioremap after including <asm-generic/io.h>.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Palmer Dabbelt <palmer@dabbelt.com>
|
|
Use ioremap() as the main implemented function, and defines
ioremap_nocache() as a deprecated alias of ioremap() in
preparation of removing ioremap_nocache() entirely.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
|
|
VT-d RMRR (Reserved Memory Region Reporting) regions are reserved
for device use only and should not be part of allocable memory pool of OS.
BIOS e820_table reports complete memory map to OS, including OS usable
memory ranges and BIOS reserved memory ranges etc.
x86 BIOS may not be trusted to include RMRR regions as reserved type
of memory in its e820 memory map, hence validate every RMRR entry
with the e820 memory map to make sure the RMRR regions will not be
used by OS for any other purposes.
ia64 EFI is working fine so implement RMRR validation as a dummy function
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Sohil Mehta <sohil.mehta@intel.com>
Signed-off-by: Yian Chen <yian.chen@intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
The odd out jump label is really not needed. Get rid of
it by return true directly while r < 0 as suggested by
Paolo. This further lead to var changed being unused.
Remove it too.
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
The devices found behind this PCIe chip have unusual DMA mapping
constraints as there is an AMBA interconnect placed in between them and
the different PCI endpoints. The offset between physical memory
addresses and AMBA's view is provided by reading a PCI config register,
which is saved and used whenever DMA mapping is needed.
It turns out that this DMA setup can be represented by properly setting
'dma_pfn_offset', 'dma_bus_mask' and 'dma_mask' during the PCI device
enable fixup. And ultimately allows us to get rid of this device's
custom DMA functions.
Aside from the code deletion and DMA setup, sta2x11_pdev_to_mapping() is
moved to avoid warnings whenever CONFIG_PM is not enabled.
Signed-off-by: Nicolas Saenz Julienne <nsaenzjulienne@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
IO regions are now requested and released by this device's parent.
Signed-off-by: Lee Jones <lee.jones@linaro.org>
Reviewed-by: Daniel Thompson <daniel.thompson@linaro.org>
|
|
IO regions are now requested and released by this device's parent.
Signed-off-by: Lee Jones <lee.jones@linaro.org>
Reviewed-by: Daniel Thompson <daniel.thompson@linaro.org>
|
|
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
'-Wunused-but-set-variable' triggers this warning:
arch/x86/events/amd/core.c: In function amd_pmu_handle_irq:
arch/x86/events/amd/core.c:656:6: warning: variable active set but not used [-Wunused-but-set-variable]
GCC is right, 'active' is not used anymore.
This variable was introduced earlier this year and then removed in:
df4d29732fdad perf/x86/amd: Change/fix NMI latency mitigation to use a timestamp
[ mingo: Improved the changelog, fixed build warning caused by this fix, improved surrounding code. ]
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Zheng Yongjun <zhengyongjun3@huawei.com>
Cc: <acme@kernel.org>
Cc: <alexander.shishkin@linux.intel.com>
Cc: <mark.rutland@arm.com>
Cc: <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20191110094453.113001-1-zhengyongjun3@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Thomas Gleixner:
"A small set of fixes for x86:
- Make the tsc=reliable/nowatchdog command line parameter work again.
It was broken with the introduction of the early TSC clocksource.
- Prevent the evaluation of exception stacks before they are set up.
This causes a crash in dumpstack because the stack walk termination
gets screwed up.
- Prevent a NULL pointer dereference in the rescource control file
system.
- Avoid bogus warnings about APIC id mismatch related to the LDR
which can happen when the LDR is not in use and therefore not
initialized. Only evaluate that when the APIC is in logical
destination mode"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/tsc: Respect tsc command line paraemeter for clocksource_tsc_early
x86/dumpstack/64: Don't evaluate exception stacks before setup
x86/apic/32: Avoid bogus LDR warnings
x86/resctrl: Prevent NULL pointer dereference when reading mondata
|
|
For new IBRS_ALL CPUs, the Enhanced IBRS check at the beginning of
cpu_bugs_smt_update() causes the function to return early, unintentionally
skipping the MDS and TAA logic.
This is not a problem for MDS, because there appears to be no overlap
between IBRS_ALL and MDS-affected CPUs. So the MDS mitigation would be
disabled and nothing would need to be done in this function anyway.
But for TAA, the TAA_MSG_SMT string will never get printed on Cascade
Lake and newer.
The check is superfluous anyway: when 'spectre_v2_enabled' is
SPECTRE_V2_IBRS_ENHANCED, 'spectre_v2_user' is always
SPECTRE_V2_USER_NONE, and so the 'spectre_v2_user' switch statement
handles it appropriately by doing nothing. So just remove the check.
Fixes: 1b42f017415b ("x86/speculation/taa: Add mitigation for TSX Async Abort")
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Tyler Hicks <tyhicks@canonical.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
|
|
commit 6415b38bae26 ("x86/stacktrace: Enable HAVE_RELIABLE_STACKTRACE
for the ORC unwinder") added the ORC unwinder as a "reliable" unwinder.
Update the help text to reflect that change: the frame pointer unwinder
is no longer the only one that can provide HAVE_RELIABLE_STACKTRACE.
Link: http://lkml.kernel.org/r/20191107032958.14034-1-joe.lawrence@redhat.com
To: linux-kernel@vger.kernel.org
To: live-patching@vger.kernel.org
Signed-off-by: Joe Lawrence <joe.lawrence@redhat.com>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Miroslav Benes <mbenes@suse.cz>
Reviewed-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
|
|
Given that EFI_MEMORY_SP is platform BIOS policy decision for marking
memory ranges as "reserved for a specific purpose" there will inevitably
be scenarios where the BIOS omits the attribute in situations where it
is desired. Unlike other attributes if the OS wants to reserve this
memory from the kernel the reservation needs to happen early in init. So
early, in fact, that it needs to happen before e820__memblock_setup()
which is a pre-requisite for efi_fake_memmap() that wants to allocate
memory for the updated table.
Introduce an x86 specific efi_fake_memmap_early() that can search for
attempts to set EFI_MEMORY_SP via efi_fake_mem and update the e820 table
accordingly.
The KASLR code that scans the command line looking for user-directed
memory reservations also needs to be updated to consider
"efi_fake_mem=nn@ss:0x40000" requests.
Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
UEFI 2.8 defines an EFI_MEMORY_SP attribute bit to augment the
interpretation of the EFI Memory Types as "reserved for a specific
purpose".
The proposed Linux behavior for specific purpose memory is that it is
reserved for direct-access (device-dax) by default and not available for
any kernel usage, not even as an OOM fallback. Later, through udev
scripts or another init mechanism, these device-dax claimed ranges can
be reconfigured and hot-added to the available System-RAM with a unique
node identifier. This device-dax management scheme implements "soft" in
the "soft reserved" designation by allowing some or all of the
reservation to be recovered as typical memory. This policy can be
disabled at compile-time with CONFIG_EFI_SOFT_RESERVE=n, or runtime with
efi=nosoftreserve.
This patch introduces 2 new concepts at once given the entanglement
between early boot enumeration relative to memory that can optionally be
reserved from the kernel page allocator by default. The new concepts
are:
- E820_TYPE_SOFT_RESERVED: Upon detecting the EFI_MEMORY_SP
attribute on EFI_CONVENTIONAL memory, update the E820 map with this
new type. Only perform this classification if the
CONFIG_EFI_SOFT_RESERVE=y policy is enabled, otherwise treat it as
typical ram.
- IORES_DESC_SOFT_RESERVED: Add a new I/O resource descriptor for
a device driver to search iomem resources for application specific
memory. Teach the iomem code to identify such ranges as "Soft Reserved".
Note that the comment for do_add_efi_memmap() needed refreshing since it
seemed to imply that the efi map might overflow the e820 table, but that
is not an issue as of commit 7b6e4ba3cb1f "x86/boot/e820: Clean up the
E820_X_MAX definition" that removed the 128 entry limit for
e820__range_add().
A follow-on change integrates parsing of the ACPI HMAT to identify the
node and sub-range boundaries of EFI_MEMORY_SP designated memory. For
now, just identify and reserve memory of this type.
Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Reported-by: kbuild test robot <lkp@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|