summaryrefslogtreecommitdiff
path: root/arch/x86
AgeCommit message (Collapse)Author
2018-06-21locking/refcounts: Include fewer headers in <linux/refcount.h>Alexey Dobriyan
Debloat <linux/refcount.h>'s dependencies: - <linux/kernel.h> is not needed, but <linux/compiler.h> is. - <linux/mutex.h> is not needed, only a forward declaration of "struct mutex". - <linux/spinlock.h> is not needed, <linux/spinlock_types.h> is enough. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Link: https://lkml.kernel.org/lkml/20180331220036.GA7676@avx2 Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-06-21kprobes/x86: Fix %p uses in error messagesMasami Hiramatsu
Remove all %p uses in error messages in kprobes/x86. Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: David Howells <dhowells@redhat.com> Cc: David S . Miller <davem@davemloft.net> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Jon Medhurst <tixy@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Thomas Richter <tmricht@linux.ibm.com> Cc: Tobin C . Harding <me@tobin.cc> Cc: Will Deacon <will.deacon@arm.com> Cc: acme@kernel.org Cc: akpm@linux-foundation.org Cc: brueckner@linux.vnet.ibm.com Cc: linux-arch@vger.kernel.org Cc: rostedt@goodmis.org Cc: schwidefsky@de.ibm.com Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/lkml/152491902310.9916.13355297638917767319.stgit@devbox Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-06-21x86/cpufeatures: Add detection of L1D cache flush support.Konrad Rzeszutek Wilk
336996-Speculative-Execution-Side-Channel-Mitigations.pdf defines a new MSR (IA32_FLUSH_CMD) which is detected by CPUID.7.EDX[28]=1 bit being set. This new MSR "gives software a way to invalidate structures with finer granularity than other architectual methods like WBINVD." A copy of this document is available at https://bugzilla.kernel.org/show_bug.cgi?id=199511 Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2018-06-21uprobes/x86: Remove incorrect WARN_ON() in uprobe_init_insn()Oleg Nesterov
insn_get_length() has the side-effect of processing the entire instruction but only if it was decoded successfully, otherwise insn_complete() can fail and in this case we need to just return an error without warning. Reported-by: syzbot+30d675e3ca03c1c351e7@syzkaller.appspotmail.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: syzkaller-bugs@googlegroups.com Link: https://lkml.kernel.org/lkml/20180518162739.GA5559@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-06-21x86/stacktrace: Enable HAVE_RELIABLE_STACKTRACE for the ORC unwinderJiri Slaby
In SUSE, we need a reliable stack unwinder for kernel live patching, but we do not want to enable frame pointers for performance reasons. So after the previous patches to make the ORC reliable, mark ORC as a reliable stack unwinder on x86. Signed-off-by: Jiri Slaby <jslaby@suse.cz> Acked-by: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/lkml/20180518064713.26440-6-jslaby@suse.cz Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-06-21x86/unwind/orc: Detect the end of the stackJosh Poimboeuf
The existing UNWIND_HINT_EMPTY annotations happen to be good indicators of where entry code calls into C code for the first time. So also use them to mark the end of the stack for the ORC unwinder. Use that information to set unwind->error if the ORC unwinder doesn't unwind all the way to the end. This will be needed for enabling HAVE_RELIABLE_STACKTRACE for the ORC unwinder so we can use it with the livepatch consistency model. Thanks to Jiri Slaby for teaching the ORCs about the unwind hints. Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz> Acked-by: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/lkml/20180518064713.26440-5-jslaby@suse.cz Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-06-21x86/stacktrace: Do not fail for ORC with regs on stackJiri Slaby
save_stack_trace_reliable now returns "non reliable" when there are kernel pt_regs on stack. This means an interrupt or exception happened somewhere down the route. It is a problem for the frame pointer unwinder, because the frame might not have been set up yet when the irq happened, so the unwinder might fail to unwind from the interrupted function. With ORC, this is not a problem, as ORC has out-of-band data. We can find ORC data even for the IP in the interrupted function and always unwind one level up reliably. So lift the check to apply only when CONFIG_FRAME_POINTER=y is enabled. Signed-off-by: Jiri Slaby <jslaby@suse.cz> Acked-by: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/lkml/20180518064713.26440-4-jslaby@suse.cz Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-06-21x86/stacktrace: Clarify the reliable success pathsJiri Slaby
Make clear which path is for user tasks and for kthreads and idle tasks. This will allow easier plug-in of the ORC unwinder in the next patches. Note that we added a check for unwind error to the top of the loop, so that an error is returned also for user tasks (the 'goto success' would skip the check after the loop otherwise). Signed-off-by: Jiri Slaby <jslaby@suse.cz> Acked-by: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/lkml/20180518064713.26440-3-jslaby@suse.cz Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-06-21x86/stacktrace: Remove STACKTRACE_DUMP_ONCEJiri Slaby
The stack unwinding can sometimes fail yet. Especially with the generated debug info. So do not yell at users -- live patching (the only user of this interface) will inform the user about the failure gracefully. And given this was the only user of the macro, remove the macro proper too. Signed-off-by: Jiri Slaby <jslaby@suse.cz> Acked-by: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/lkml/20180518064713.26440-2-jslaby@suse.cz Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-06-21x86/stacktrace: Do not unwind after user regsJiri Slaby
Josh pointed out, that there is no way a frame can be after user regs. So remove the last unwind and the check. Signed-off-by: Jiri Slaby <jslaby@suse.cz> Acked-by: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/lkml/20180518064713.26440-1-jslaby@suse.cz Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-06-21x86/speculation/l1tf: Extend 64bit swap file size limitVlastimil Babka
The previous patch has limited swap file size so that large offsets cannot clear bits above MAX_PA/2 in the pte and interfere with L1TF mitigation. It assumed that offsets are encoded starting with bit 12, same as pfn. But on x86_64, offsets are encoded starting with bit 9. Thus the limit can be raised by 3 bits. That means 16TB with 42bit MAX_PA and 256TB with 46bit MAX_PA. Fixes: 377eeaa8e11f ("x86/speculation/l1tf: Limit swap file size to MAX_PA/2") Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2018-06-21x86/platform/UV: Add kernel parameter to set memory block sizemike.travis@hpe.com
Add a kernel parameter that allows setting UV memory block size. This is to provide an adjustment for new forms of PMEM and other DIMM memory that might require alignment restrictions other than scanning the global address table for the required minimum alignment. The value set will be further adjusted by both the GAM range table scan as well as restrictions imposed by set_memory_block_size_order(). Signed-off-by: Mike Travis <mike.travis@hpe.com> Reviewed-by: Andrew Banman <andrew.banman@hpe.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Russ Anderson <russ.anderson@hpe.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dan.j.williams@intel.com Cc: jgross@suse.com Cc: kirill.shutemov@linux.intel.com Cc: mhocko@suse.com Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/lkml/20180524201711.854849120@stormcage.americas.sgi.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-06-21x86/platform/UV: Use new set memory block size functionmike.travis@hpe.com
Add a call to the new function to "adjust" the current fixed UV memory block size of 2GB so it can be changed to a different physical boundary. This accommodates changes in the Intel BIOS, and therefore UV BIOS, which now can align boundaries different than the previous UV standard of 2GB. It also flags any UV Global Address boundaries from BIOS that cause a change in the mem block size (boundary). The current boundary of 2GB has been used on UV since the first system release in 2009 with Linux 2.6 and has worked fine. But the new NVDIMM persistent memory modules (PMEM), along with the Intel BIOS changes to support these modules caused the memory block size boundary to be set to a lower limit. Intel only guarantees that this minimum boundary at 64MB though the current Linux limit is 128MB. Note that the default remains 2GB if no changes occur. Signed-off-by: Mike Travis <mike.travis@hpe.com> Reviewed-by: Andrew Banman <andrew.banman@hpe.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Russ Anderson <russ.anderson@hpe.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dan.j.williams@intel.com Cc: jgross@suse.com Cc: kirill.shutemov@linux.intel.com Cc: mhocko@suse.com Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/lkml/20180524201711.732785782@stormcage.americas.sgi.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-06-21x86/platform/UV: Add adjustable set memory block size functionmike.travis@hpe.com
Add a new function to "adjust" the current fixed UV memory block size of 2GB so it can be changed to a different physical boundary. This is out of necessity so arch dependent code can accommodate specific BIOS requirements which can align these new PMEM modules at less than the default boundaries. A "set order" type of function was used to insure that the memory block size will be a power of two value without requiring a validity check. 64GB was chosen as the upper limit for memory block size values to accommodate upcoming 4PB systems which have 6 more bits of physical address space (46 becoming 52). Signed-off-by: Mike Travis <mike.travis@hpe.com> Reviewed-by: Andrew Banman <andrew.banman@hpe.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Russ Anderson <russ.anderson@hpe.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dan.j.williams@intel.com Cc: jgross@suse.com Cc: kirill.shutemov@linux.intel.com Cc: mhocko@suse.com Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/lkml/20180524201711.609546602@stormcage.americas.sgi.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-06-21x86/spectre_v1: Disable compiler optimizations over array_index_mask_nospec()Dan Williams
Mark Rutland noticed that GCC optimization passes have the potential to elide necessary invocations of the array_index_mask_nospec() instruction sequence, so mark the asm() volatile. Mark explains: "The volatile will inhibit *some* cases where the compiler could lift the array_index_nospec() call out of a branch, e.g. where there are multiple invocations of array_index_nospec() with the same arguments: if (idx < foo) { idx1 = array_idx_nospec(idx, foo) do_something(idx1); } < some other code > if (idx < foo) { idx2 = array_idx_nospec(idx, foo); do_something_else(idx2); } ... since the compiler can determine that the two invocations yield the same result, and reuse the first result (likely the same register as idx was in originally) for the second branch, effectively re-writing the above as: if (idx < foo) { idx = array_idx_nospec(idx, foo); do_something(idx); } < some other code > if (idx < foo) { do_something_else(idx); } ... if we don't take the first branch, then speculatively take the second, we lose the nospec protection. There's more info on volatile asm in the GCC docs: https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html#Volatile " Reported-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Acked-by: Mark Rutland <mark.rutland@arm.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: <stable@vger.kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Fixes: babdde2698d4 ("x86: Implement array_index_mask_nospec") Link: https://lkml.kernel.org/lkml/152838798950.14521.4893346294059739135.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-06-21perf/x86/intel/lbr: Optimize context switches for the LBR call stackKan Liang
Context switches with perf LBR call stack context are fairly expensive because they do a lot of MSR writes. Currently we unconditionally do the expensive operation when LBR call stack is enabled. It's not necessary for some common cases, e.g task -> other kernel thread -> same task. The LBR registers are not changed, hence they don't need to be rewritten/restored. Introduce per-CPU variables to track the last LBR call stack context. If the same context is scheduled in, the rewrite/restore is not required, with the following two exceptions: - The LBR registers may be modified by a normal LBR event, i.e., adding a new LBR event or scheduling an existing LBR event. In both cases, the LBR registers are reset first. The last LBR call stack information is cleared in intel_pmu_lbr_reset(). Restoring the LBR registers is required. - The LBR registers are initialized to zero in C6. If the LBR registers which TOS points is cleared, C6 must be entered while swapped out. Restoring the LBR registers is required as well. These exceptions are not common. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Stephane Eranian <eranian@google.com> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: acme@kernel.org Cc: eranian@google.com Link: https://lore.kernel.org/lkml/1528213126-4312-2-git-send-email-kan.liang@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-06-21perf/x86/intel/lbr: Fix incomplete LBR call stackKan Liang
LBR has a limited stack size. If a task has a deeper call stack than LBR's stack size, only the overflowed part is reported. A complete call stack may not be reconstructed by perf tool. Current code doesn't access all LBR registers. It only read the ones below the TOS. The LBR registers above the TOS will be discarded unconditionally. When a CALL is captured, the TOS is incremented by 1 , modulo max LBR stack size. The LBR HW only records the call stack information to the register which the TOS points to. It will not touch other LBR registers. So the registers above the TOS probably still store the valid call stack information for an overflowed call stack, which need to be reported. To retrieve complete call stack information, we need to start from TOS, read all LBR registers until an invalid entry is detected. 0s can be used to detect the invalid entry, because: - When a RET is captured, the HW zeros the LBR register which TOS points to, then decreases the TOS. - The LBR registers are reset to 0 when adding a new LBR event or scheduling an existing LBR event. - A taken branch at IP 0 is not expected The context switch code is also modified to save/restore all valid LBR registers. Furthermore, the LBR registers, which don't have valid call stack information, need to be reset in restore, because they may be polluted while swapped out. Here is a small test program, tchain_deep. Its call stack is deeper than 32. noinline void f33(void) { int i; for (i = 0; i < 10000000;) { if (i%2) i++; else i++; } } noinline void f32(void) { f33(); } noinline void f31(void) { f32(); } ... ... noinline void f1(void) { f2(); } int main() { f1(); } Here is the test result on SKX. The max stack size of SKX is 32. Without the patch: $ perf record -e cycles --call-graph lbr -- ./tchain_deep $ perf report --stdio # # Children Self Command Shared Object Symbol # ........ ........ ........... ................ ................. # 100.00% 99.99% tchain_deep tchain_deep [.] f33 | --99.99%--f30 f31 f32 f33 With the patch: $ perf record -e cycles --call-graph lbr -- ./tchain_deep $ perf report --stdio # Children Self Command Shared Object Symbol # ........ ........ ........... ................ .................. # 99.99% 0.00% tchain_deep tchain_deep [.] f1 | ---f1 f2 f3 f4 f5 f6 f7 f8 f9 f10 f11 f12 f13 f14 f15 f16 f17 f18 f19 f20 f21 f22 f23 f24 f25 f26 f27 f28 f29 f30 f31 f32 f33 Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Stephane Eranian <eranian@google.com> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: acme@kernel.org Cc: eranian@google.com Link: https://lore.kernel.org/lkml/1528213126-4312-1-git-send-email-kan.liang@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-06-21x86/asm: Use CC_SET/CC_OUT in percpu_cmpxchg8b_double() to micro-optimize ↵Uros Bizjak
code generation Use CC_SET(z)/CC_OUT(z) instead of explicit SETZ instruction. Using these two defines, the compiler that supports generation of condition code outputs from inline assembly flags generates e.g.: cmpxchg8b %fs:(%esi) jne 172255 <__kmalloc+0x65> instead of: cmpxchg8b %fs:(%esi) sete %al test %al,%al je 172255 <__kmalloc+0x65> Note that older compilers now generate: cmpxchg8b %fs:(%esi) sete %cl test %cl,%cl je 173a85 <__kmalloc+0x65> since we have to mark that cmpxchg8b instruction outputs to %eax register and this way clobbers the value in the register. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/lkml/20180605163910.13015-1-ubizjak@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-06-21atomics/treewide: Make conditional inc/dec ops optionalMark Rutland
The conditional inc/dec ops differ for atomic_t and atomic64_t: - atomic_inc_unless_positive() is optional for atomic_t, and doesn't exist for atomic64_t. - atomic_dec_unless_negative() is optional for atomic_t, and doesn't exist for atomic64_t. - atomic_dec_if_positive is optional for atomic_t, and is mandatory for atomic64_t. Let's make these consistently optional for both. At the same time, let's clean up the existing fallbacks to use atomic_try_cmpxchg(). The instrumented atomics are updated accordingly. There should be no functional change as a result of this patch. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Reviewed-by: Will Deacon <will.deacon@arm.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/lkml/20180621121321.4761-18-mark.rutland@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-06-21atomics/treewide: Make unconditional inc/dec ops optionalMark Rutland
Many of the inc/dec ops are mandatory, but for most architectures inc/dec are simply trivial wrappers around their corresponding add/sub ops. Let's make all the inc/dec ops optional, so that we can get rid of these boilerplate wrappers. The instrumented atomics are updated accordingly. There should be no functional change as a result of this patch. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Reviewed-by: Will Deacon <will.deacon@arm.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Palmer Dabbelt <palmer@sifive.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/lkml/20180621121321.4761-17-mark.rutland@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-06-21atomics/treewide: Make test ops optionalMark Rutland
Some of the atomics return the result of a test applied after the atomic operation, and almost all architectures implement these as trivial wrappers around the underlying atomic. Specifically: * <atomic>_inc_and_test(v) is (<atomic>_inc_return(v) == 0) * <atomic>_dec_and_test(v) is (<atomic>_dec_return(v) == 0) * <atomic>_sub_and_test(i, v) is (<atomic>_sub_return(i, v) == 0) * <atomic>_add_negative(i, v) is (<atomic>_add_return(i, v) < 0) Rather than have these definitions duplicated in all architectures, with minor inconsistencies in formatting and documentation, let's make these operations optional, with default fallbacks as above. Implementations must now provide a preprocessor symbol. The instrumented atomics are updated accordingly. Both x86 and m68k have custom implementations, which are left as-is, given preprocessor symbols to avoid being overridden. There should be no functional change as a result of this patch. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Reviewed-by: Will Deacon <will.deacon@arm.com> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Palmer Dabbelt <palmer@sifive.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/lkml/20180621121321.4761-16-mark.rutland@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-06-21atomics/treewide: Make atomic64_fetch_add_unless() optionalMark Rutland
Architectures with atomic64_fetch_add_unless() provide a preprocessor symbol if they do so, and all other architectures have trivial C implementations of atomic64_add_unless() which are near-identical. Let's unify the trivial definitions of atomic64_fetch_add_unless() in <linux/atomic.h>, so that we always have both atomic64_fetch_add_unless() and atomic64_add_unless() with less boilerplate code. This means that atomic64_add_unless() is always implemented in core code, and the instrumented atomics are updated accordingly. There should be no functional change as a result of this patch. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Reviewed-by: Will Deacon <will.deacon@arm.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/lkml/20180621121321.4761-15-mark.rutland@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-06-21atomics/treewide: Make atomic_fetch_add_unless() optionalMark Rutland
Several architectures these have a near-identical implementation based on atomic_read() and atomic_cmpxchg() which we can instead define in <linux/atomic.h>, so let's do so, using something close to the existing x86 implementation with try_cmpxchg(). Where an architecture provides its own atomic_fetch_add_unless(), it must define a preprocessor symbol for it. The instrumented atomics are updated accordingly. Note that arch/arc's existing atomic_fetch_add_unless() had redundant barriers, as these are already present in its atomic_cmpxchg() implementation. There should be no functional change as a result of this patch. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org> Reviewed-by: Will Deacon <will.deacon@arm.com> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Palmer Dabbelt <palmer@sifive.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vineet Gupta <vgupta@synopsys.com> Link: https://lore.kernel.org/lkml/20180621121321.4761-7-mark.rutland@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-06-21atomics/treewide: Make atomic64_inc_not_zero() optionalMark Rutland
We define a trivial fallback for atomic_inc_not_zero(), but don't do the same for atomic64_inc_not_zero(), leading most architectures to define the same boilerplate. Let's add a fallback in <linux/atomic.h>, and remove the redundant implementations. Note that atomic64_add_unless() is always defined in <linux/atomic.h>, and promotes its arguments to the requisite types, so we need not do this explicitly. There should be no functional change as a result of this patch. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Reviewed-by: Will Deacon <will.deacon@arm.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Palmer Dabbelt <palmer@sifive.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/lkml/20180621121321.4761-6-mark.rutland@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-06-21atomics/treewide: Rename __atomic_add_unless() => atomic_fetch_add_unless()Mark Rutland
While __atomic_add_unless() was originally intended as a building-block for atomic_add_unless(), it's now used in a number of places around the kernel. It's the only common atomic operation named __atomic*(), rather than atomic_*(), and for consistency it would be better named atomic_fetch_add_unless(). This lack of consistency is slightly confusing, and gets in the way of scripting atomics. Given that, let's clean things up and promote it to an official part of the atomics API, in the form of atomic_fetch_add_unless(). This patch converts definitions and invocations over to the new name, including the instrumented version, using the following script: ---- git grep -w __atomic_add_unless | while read line; do sed -i '{s/\<__atomic_add_unless\>/atomic_fetch_add_unless/}' "${line%%:*}"; done git grep -w __arch_atomic_add_unless | while read line; do sed -i '{s/\<__arch_atomic_add_unless\>/arch_atomic_fetch_add_unless/}' "${line%%:*}"; done ---- Note that we do not have atomic{64,_long}_fetch_add_unless(), which will be introduced by later patches. There should be no functional change as a result of this patch. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Reviewed-by: Will Deacon <will.deacon@arm.com> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Palmer Dabbelt <palmer@sifive.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/lkml/20180621121321.4761-2-mark.rutland@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-06-21x86/apic: Ignore secondary threads if nosmt=forceThomas Gleixner
nosmt on the kernel command line merely prevents the onlining of the secondary SMT siblings. nosmt=force makes the APIC detection code ignore the secondary SMT siblings completely, so they even do not show up as possible CPUs. That reduces the amount of memory allocations for per cpu variables and saves other resources from being allocated too large. This is not fully equivalent to disabling SMT in the BIOS because the low level SMT enabling in the BIOS can result in partitioning of resources between the siblings, which is not undone by just ignoring them. Some CPUs can use the full resources when their sibling is not onlined, but this is depending on the CPU family and model and it's not well documented whether this applies to all partitioned resources. That means depending on the workload disabling SMT in the BIOS might result in better performance. Linus analysis of the Intel manual: The intel optimization manual is not very clear on what the partitioning rules are. I find: "In general, the buffers for staging instructions between major pipe stages are partitioned. These buffers include µop queues after the execution trace cache, the queues after the register rename stage, the reorder buffer which stages instructions for retirement, and the load and store buffers. In the case of load and store buffers, partitioning also provided an easier implementation to maintain memory ordering for each logical processor and detect memory ordering violations" but some of that partitioning may be relaxed if the HT thread is "not active": "In Intel microarchitecture code name Sandy Bridge, the micro-op queue is statically partitioned to provide 28 entries for each logical processor, irrespective of software executing in single thread or multiple threads. If one logical processor is not active in Intel microarchitecture code name Ivy Bridge, then a single thread executing on that processor core can use the 56 entries in the micro-op queue" but I do not know what "not active" means, and how dynamic it is. Some of that partitioning may be entirely static and depend on the early BIOS disabling of HT, and even if we park the cores, the resources will just be wasted. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Ingo Molnar <mingo@kernel.org>
2018-06-21x86/cpu/AMD: Evaluate smp_num_siblings earlyThomas Gleixner
To support force disabling of SMT it's required to know the number of thread siblings early. amd_get_topology() cannot be called before the APIC driver is selected, so split out the part which initializes smp_num_siblings and invoke it from amd_early_init(). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Ingo Molnar <mingo@kernel.org>
2018-06-21x86/CPU/AMD: Do not check CPUID max ext level before parsing SMP infoBorislav Petkov
Old code used to check whether CPUID ext max level is >= 0x80000008 because that last leaf contains the number of cores of the physical CPU. The three functions called there now do not depend on that leaf anymore so the check can go. Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Ingo Molnar <mingo@kernel.org>
2018-06-21x86/cpu/intel: Evaluate smp_num_siblings earlyThomas Gleixner
Make use of the new early detection function to initialize smp_num_siblings on the boot cpu before the MP-Table or ACPI/MADT scan happens. That's required for force disabling SMT. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Ingo Molnar <mingo@kernel.org>
2018-06-21x86/cpu/topology: Provide detect_extended_topology_early()Thomas Gleixner
To support force disabling of SMT it's required to know the number of thread siblings early. detect_extended_topology() cannot be called before the APIC driver is selected, so split out the part which initializes smp_num_siblings. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Ingo Molnar <mingo@kernel.org>
2018-06-21x86/cpu/common: Provide detect_ht_early()Thomas Gleixner
To support force disabling of SMT it's required to know the number of thread siblings early. detect_ht() cannot be called before the APIC driver is selected, so split out the part which initializes smp_num_siblings. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Ingo Molnar <mingo@kernel.org>
2018-06-21x86/cpu/AMD: Remove the pointless detect_ht() callThomas Gleixner
Real 32bit AMD CPUs do not have SMT and the only value of the call was to reach the magic printout which got removed. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Ingo Molnar <mingo@kernel.org>
2018-06-21x86/cpu: Remove the pointless CPU printoutThomas Gleixner
The value of this printout is dubious at best and there is no point in having it in two different places along with convoluted ways to reach it. Remove it completely. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Ingo Molnar <mingo@kernel.org>
2018-06-21cpu/hotplug: Provide knobs to control SMTThomas Gleixner
Provide a command line and a sysfs knob to control SMT. The command line options are: 'nosmt': Enumerate secondary threads, but do not online them 'nosmt=force': Ignore secondary threads completely during enumeration via MP table and ACPI/MADT. The sysfs control file has the following states (read/write): 'on': SMT is enabled. Secondary threads can be freely onlined 'off': SMT is disabled. Secondary threads, even if enumerated cannot be onlined 'forceoff': SMT is permanentely disabled. Writes to the control file are rejected. 'notsupported': SMT is not supported by the CPU The command line option 'nosmt' sets the sysfs control to 'off'. This can be changed to 'on' to reenable SMT during runtime. The command line option 'nosmt=force' sets the sysfs control to 'forceoff'. This cannot be changed during runtime. When SMT is 'on' and the control file is changed to 'off' then all online secondary threads are offlined and attempts to online a secondary thread later on are rejected. When SMT is 'off' and the control file is changed to 'on' then secondary threads can be onlined again. The 'off' -> 'on' transition does not automatically online the secondary threads. When the control file is set to 'forceoff', the behaviour is the same as setting it to 'off', but the operation is irreversible and later writes to the control file are rejected. When the control status is 'notsupported' then writes to the control file are rejected. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Ingo Molnar <mingo@kernel.org>
2018-06-21x86/topology: Provide topology_smt_supported()Thomas Gleixner
Provide information whether SMT is supoorted by the CPUs. Preparatory patch for SMT control mechanism. Suggested-by: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Ingo Molnar <mingo@kernel.org>
2018-06-21x86/smp: Provide topology_is_primary_thread()Thomas Gleixner
If the CPU is supporting SMT then the primary thread can be found by checking the lower APIC ID bits for zero. smp_num_siblings is used to build the mask for the APIC ID bits which need to be taken into account. This uses the MPTABLE or ACPI/MADT supplied APIC ID, which can be different than the initial APIC ID in CPUID. But according to AMD the lower bits have to be consistent. Intel gave a tentative confirmation as well. Preparatory patch to support disabling SMT at boot/runtime. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Ingo Molnar <mingo@kernel.org>
2018-06-21x86/pti: Don't report XenPV as vulnerableJiri Kosina
Xen PV domain kernel is not by design affected by meltdown as it's enforcing split CR3 itself. Let's not report such systems as "Vulnerable" in sysfs (we're also already forcing PTI to off in X86_HYPER_XEN_PV cases); the security of the system ultimately depends on presence of mitigation in the Hypervisor, which can't be easily detected from DomU; let's report that. Reported-and-tested-by: Mike Latimer <mlatimer@suse.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Juergen Gross <jgross@suse.com> Cc: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/nycvar.YFH.7.76.1806180959080.6203@cbobk.fhfr.pm [ Merge the user-visible string into a single line. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-06-21x86/build: Remove unnecessary preparation for purgatoryMasahiro Yamada
kexec-purgatory.c is properly generated when Kbuild descend into the arch/x86/purgatory/. Thus the 'archprepare' target is redundant. Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michal Marek <michal.lkml@markovi.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sam Ravnborg <sam@ravnborg.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/lkml/1529401422-28838-3-git-send-email-yamada.masahiro@socionext.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-06-21Revert "kexec/purgatory: Add clean-up for purgatory directory"Masahiro Yamada
Reverts the following commit: b0108f9e93d0 ("kexec: purgatory: add clean-up for purgatory directory") ... which incorrectly stated that the kexec-purgatory.c and purgatory.ro files were not removed after 'make mrproper'. In fact, they are. You can confirm it after reverting it. $ make mrproper $ touch arch/x86/purgatory/kexec-purgatory.c $ touch arch/x86/purgatory/purgatory.ro $ make mrproper CLEAN arch/x86/purgatory $ ls arch/x86/purgatory/ entry64.S Makefile purgatory.c setup-x86_64.S stack.S string.c This is obvious from the build system point of view. arch/x86/Makefile adds 'arch/x86' to core-y. Hence 'make clean' descends like this: arch/x86/Kbuild -> arch/x86/purgatory/Makefile Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michal Marek <michal.lkml@markovi.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sam Ravnborg <sam@ravnborg.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/lkml/1529401422-28838-2-git-send-email-yamada.masahiro@socionext.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-06-21kprobes/x86: Do not disable preempt on int3 pathMasami Hiramatsu
Since int3 and debug exception(for singlestep) are run with IRQ disabled and while running single stepping we drop IF from regs->flags, that path must not be preemptible. So we can remove the preempt disable/enable calls from that path. Suggested-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: linux-arch@vger.kernel.org Cc: linux-doc@vger.kernel.org Link: https://lore.kernel.org/lkml/152942497779.15209.2879580696589868291.stgit@devbox Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-06-21bpf/error-inject/kprobes: Clear current_kprobe and enable preempt in kprobeMasami Hiramatsu
Clear current_kprobe and enable preemption in kprobe even if pre_handler returns !0. This simplifies function override using kprobes. Jprobe used to require to keep the preemption disabled and keep current_kprobe until it returned to original function entry. For this reason kprobe_int3_handler() and similar arch dependent kprobe handers checks pre_handler result and exit without enabling preemption if the result is !0. After removing the jprobe, Kprobes does not need to keep preempt disabled even if user handler returns !0 anymore. But since the function override handler in error-inject and bpf is also returns !0 if it overrides a function, to balancing the preempt count, it enables preemption and reset current kprobe by itself. That is a bad design that is very buggy. This fixes such unbalanced preempt-count and current_kprobes setting in kprobes, bpf and error-inject. Note: for powerpc and x86, this removes all preempt_disable from kprobe_ftrace_handler because ftrace callbacks are called under preempt disabled. Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: James Hogan <jhogan@kernel.org> Cc: Josef Bacik <jbacik@fb.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tony Luck <tony.luck@intel.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: linux-arch@vger.kernel.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-ia64@vger.kernel.org Cc: linux-mips@linux-mips.org Cc: linux-s390@vger.kernel.org Cc: linux-sh@vger.kernel.org Cc: linux-snps-arc@lists.infradead.org Cc: linuxppc-dev@lists.ozlabs.org Cc: sparclinux@vger.kernel.org Link: https://lore.kernel.org/lkml/152942494574.15209.12323837825873032258.stgit@devbox Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-06-21kprobes/x86: Don't call the ->break_handler() in x86 kprobesMasami Hiramatsu
Don't call the ->break_handler() and remove break_handler related code from x86 since that was only used by jprobe which got removed. Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: linux-arch@vger.kernel.org Link: https://lore.kernel.org/lkml/152942465549.15209.15889693025972771135.stgit@devbox Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-06-21kprobes/x86: Remove jprobe implementationMasami Hiramatsu
Remove arch dependent setjump/longjump functions and unused fields in kprobe_ctlblk for jprobes from arch/x86. Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: linux-arch@vger.kernel.org Link: https://lore.kernel.org/lkml/152942433578.15209.14034551799624757792.stgit@devbox Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-06-21x86/xen: Add call of speculative_store_bypass_ht_init() to PV pathsJuergen Gross
Commit: 1f50ddb4f418 ("x86/speculation: Handle HT correctly on AMD") ... added speculative_store_bypass_ht_init() to the per-CPU initialization sequence. speculative_store_bypass_ht_init() needs to be called on each CPU for PV guests, too. Reported-by: Brian Woods <brian.woods@amd.com> Tested-by: Brian Woods <brian.woods@amd.com> Signed-off-by: Juergen Gross <jgross@suse.com> Cc: <stable@vger.kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: boris.ostrovsky@oracle.com Cc: xen-devel@lists.xenproject.org Fixes: 1f50ddb4f4189243c05926b842dc1a0332195f31 ("x86/speculation: Handle HT correctly on AMD") Link: https://lore.kernel.org/lkml/20180621084331.21228-1-jgross@suse.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-06-21x86/bugs: Move the l1tf function and define pr_fmt properlyKonrad Rzeszutek Wilk
The pr_warn in l1tf_select_mitigation would have used the prior pr_fmt which was defined as "Spectre V2 : ". Move the function to be past SSBD and also define the pr_fmt. Fixes: 17dbca119312 ("x86/speculation/l1tf: Add sysfs reporting for l1tf") Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2018-06-20x86/speculation/l1tf: Limit swap file size to MAX_PA/2Andi Kleen
For the L1TF workaround its necessary to limit the swap file size to below MAX_PA/2, so that the higher bits of the swap offset inverted never point to valid memory. Add a mechanism for the architecture to override the swap file size check in swapfile.c and add a x86 specific max swapfile check function that enforces that limit. The check is only enabled if the CPU is vulnerable to L1TF. In VMs with 42bit MAX_PA the typical limit is 2TB now, on a native system with 46bit PA it is 32TB. The limit is only per individual swap file, so it's always possible to exceed these limits with multiple swap files or partitions. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Dave Hansen <dave.hansen@intel.com>
2018-06-20x86/speculation/l1tf: Disallow non privileged high MMIO PROT_NONE mappingsAndi Kleen
For L1TF PROT_NONE mappings are protected by inverting the PFN in the page table entry. This sets the high bits in the CPU's address space, thus making sure to point to not point an unmapped entry to valid cached memory. Some server system BIOSes put the MMIO mappings high up in the physical address space. If such an high mapping was mapped to unprivileged users they could attack low memory by setting such a mapping to PROT_NONE. This could happen through a special device driver which is not access protected. Normal /dev/mem is of course access protected. To avoid this forbid PROT_NONE mappings or mprotect for high MMIO mappings. Valid page mappings are allowed because the system is then unsafe anyways. It's not expected that users commonly use PROT_NONE on MMIO. But to minimize any impact this is only enforced if the mapping actually refers to a high MMIO address (defined as the MAX_PA-1 bit being set), and also skip the check for root. For mmaps this is straight forward and can be handled in vm_insert_pfn and in remap_pfn_range(). For mprotect it's a bit trickier. At the point where the actual PTEs are accessed a lot of state has been changed and it would be difficult to undo on an error. Since this is a uncommon case use a separate early page talk walk pass for MMIO PROT_NONE mappings that checks for this condition early. For non MMIO and non PROT_NONE there are no changes. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Acked-by: Dave Hansen <dave.hansen@intel.com>
2018-06-20x86/speculation/l1tf: Add sysfs reporting for l1tfAndi Kleen
L1TF core kernel workarounds are cheap and normally always enabled, However they still should be reported in sysfs if the system is vulnerable or mitigated. Add the necessary CPU feature/bug bits. - Extend the existing checks for Meltdowns to determine if the system is vulnerable. All CPUs which are not vulnerable to Meltdown are also not vulnerable to L1TF - Check for 32bit non PAE and emit a warning as there is no practical way for mitigation due to the limited physical address bits - If the system has more than MAX_PA/2 physical memory the invert page workarounds don't protect the system against the L1TF attack anymore, because an inverted physical address will also point to valid memory. Print a warning in this case and report that the system is vulnerable. Add a function which returns the PFN limit for the L1TF mitigation, which will be used in follow up patches for sanity and range checks. [ tglx: Renamed the CPU feature bit to L1TF_PTEINV ] Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Acked-by: Dave Hansen <dave.hansen@intel.com>
2018-06-20x86/speculation/l1tf: Make sure the first page is always reservedAndi Kleen
The L1TF workaround doesn't make any attempt to mitigate speculate accesses to the first physical page for zeroed PTEs. Normally it only contains some data from the early real mode BIOS. It's not entirely clear that the first page is reserved in all configurations, so add an extra reservation call to make sure it is really reserved. In most configurations (e.g. with the standard reservations) it's likely a nop. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Acked-by: Dave Hansen <dave.hansen@intel.com>
2018-06-20x86/speculation/l1tf: Protect PROT_NONE PTEs against speculationAndi Kleen
When PTEs are set to PROT_NONE the kernel just clears the Present bit and preserves the PFN, which creates attack surface for L1TF speculation speculation attacks. This is important inside guests, because L1TF speculation bypasses physical page remapping. While the host has its own migitations preventing leaking data from other VMs into the guest, this would still risk leaking the wrong page inside the current guest. This uses the same technique as Linus' swap entry patch: while an entry is is in PROTNONE state invert the complete PFN part part of it. This ensures that the the highest bit will point to non existing memory. The invert is done by pte/pmd_modify and pfn/pmd/pud_pte for PROTNONE and pte/pmd/pud_pfn undo it. This assume that no code path touches the PFN part of a PTE directly without using these primitives. This doesn't handle the case that MMIO is on the top of the CPU physical memory. If such an MMIO region was exposed by an unpriviledged driver for mmap it would be possible to attack some real memory. However this situation is all rather unlikely. For 32bit non PAE the inversion is not done because there are really not enough bits to protect anything. Q: Why does the guest need to be protected when the HyperVisor already has L1TF mitigations? A: Here's an example: Physical pages 1 2 get mapped into a guest as GPA 1 -> PA 2 GPA 2 -> PA 1 through EPT. The L1TF speculation ignores the EPT remapping. Now the guest kernel maps GPA 1 to process A and GPA 2 to process B, and they belong to different users and should be isolated. A sets the GPA 1 PA 2 PTE to PROT_NONE to bypass the EPT remapping and gets read access to the underlying physical page. Which in this case points to PA 2, so it can read process B's data, if it happened to be in L1, so isolation inside the guest is broken. There's nothing the hypervisor can do about this. This mitigation has to be done in the guest itself. [ tglx: Massaged changelog ] Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Dave Hansen <dave.hansen@intel.com>