summaryrefslogtreecommitdiff
path: root/arch/x86/include
AgeCommit message (Collapse)Author
2017-11-01x86/boot: Relocate definition of the initial state of CR0Ricardo Neri
Both head_32.S and head_64.S utilize the same value to initialize the control register CR0. Also, other parts of the kernel might want to access this initial definition (e.g., emulation code for User-Mode Instruction Prevention uses this state to provide a sane dummy value for CR0 when emulating the smsw instruction). Thus, relocate this definition to a header file from which it can be conveniently accessed. Suggested-by: Borislav Petkov <bp@alien8.de> Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Borislav Petkov <bp@suse.de> Reviewed-by: Andy Lutomirski <luto@kernel.org> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: ricardo.neri@intel.com Cc: linux-mm@kvack.org Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Huang Rui <ray.huang@amd.com> Cc: Shuah Khan <shuah@kernel.org> Cc: linux-arch@vger.kernel.org Cc: Jonathan Corbet <corbet@lwn.net> Cc: Jiri Slaby <jslaby@suse.cz> Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Chris Metcalf <cmetcalf@mellanox.com> Cc: Brian Gerst <brgerst@gmail.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Chen Yucong <slaoub@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lkml.kernel.org/r/1509135945-13762-3-git-send-email-ricardo.neri-calderon@linux.intel.com
2017-11-01x86/mm: Relocate page fault error codes to traps.hRicardo Neri
Up to this point, only fault.c used the definitions of the page fault error codes. Thus, it made sense to keep them within such file. Other portions of code might be interested in those definitions too. For instance, the User- Mode Instruction Prevention emulation code will use such definitions to emulate a page fault when it is unable to successfully copy the results of the emulated instructions to user space. While relocating the error code enumeration, the prefix X86_ is used to make it consistent with the rest of the definitions in traps.h. Of course, code using the enumeration had to be updated as well. No functional changes were performed. Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Borislav Petkov <bp@suse.de> Reviewed-by: Andy Lutomirski <luto@kernel.org> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: ricardo.neri@intel.com Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Huang Rui <ray.huang@amd.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Jiri Slaby <jslaby@suse.cz> Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com> Cc: Chris Metcalf <cmetcalf@mellanox.com> Cc: Brian Gerst <brgerst@gmail.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Chen Yucong <slaoub@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Link: https://lkml.kernel.org/r/1509135945-13762-2-git-send-email-ricardo.neri-calderon@linux.intel.com
2017-10-31xen: support 52 bit physical addresses in pv guestsJuergen Gross
Physical addresses on processors supporting 5 level paging can be up to 52 bits wide. For a Xen pv guest running on such a machine those physical addresses have to be supported in order to be able to use any memory on the machine even if the guest itself does not support 5 level paging. So when reading/writing a MFN from/to a pte don't use the kernel's PTE_PFN_MASK but a new XEN_PTE_MFN_MASK allowing full 40 bit wide MFNs. Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
2017-10-31Drivers: hv: vmbus: Make panic reporting to be more usefulK. Y. Srinivasan
Hyper-V allows the guest to report panic and the guest can pass additional information. All this is logged on the host. Currently Linux is passing back information that is not particularly useful. Make the following changes: 1. Windows uses crash MSR P0 to report bugcheck code. Follow the same convention for Linux as well. 2. It will be useful to know the gust ID of the Linux guest that has paniced. Pass back this information. These changes will help in better supporting Linux on Hyper-V Signed-off-by: K. Y. Srinivasan <kys@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-10-31x86/cpufeatures: Enable new SSE/AVX/AVX512 CPU featuresGayatri Kammela
Add a few new SSE/AVX/AVX512 instruction groups/features for enumeration in /proc/cpuinfo: AVX512_VBMI2, GFNI, VAES, VPCLMULQDQ, AVX512_VNNI, AVX512_BITALG. CPUID.(EAX=7,ECX=0):ECX[bit 6] AVX512_VBMI2 CPUID.(EAX=7,ECX=0):ECX[bit 8] GFNI CPUID.(EAX=7,ECX=0):ECX[bit 9] VAES CPUID.(EAX=7,ECX=0):ECX[bit 10] VPCLMULQDQ CPUID.(EAX=7,ECX=0):ECX[bit 11] AVX512_VNNI CPUID.(EAX=7,ECX=0):ECX[bit 12] AVX512_BITALG Detailed information of CPUID bits for these features can be found in the Intel Architecture Instruction Set Extensions and Future Features Programming Interface document (refer to Table 1-1. and Table 1-2.). A copy of this document is available at https://bugzilla.kernel.org/show_bug.cgi?id=197239 Signed-off-by: Gayatri Kammela <gayatri.kammela@intel.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andi Kleen <andi.kleen@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ravi Shankar <ravi.v.shankar@intel.com> Cc: Ricardo Neri <ricardo.neri@intel.com> Cc: Yang Zhong <yang.zhong@intel.com> Cc: bp@alien8.de Link: http://lkml.kernel.org/r/1509412829-23380-1-git-send-email-gayatri.kammela@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-27Merge branch 'perf/urgent' into perf/core, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-27Revert "x86/mm: Limit mmap() of /dev/mem to valid physical addresses"Ingo Molnar
This reverts commit ce56a86e2ade45d052b3228cdfebe913a1ae7381. There's unanticipated interaction with some boot parameters like 'mem=', which now cause the new checks via valid_mmap_phys_addr_range() to be too restrictive, crashing a Qemu bootup in fact, as reported by Fengguang Wu. So while the motivation of the change is still entirely valid, we need a few more rounds of testing to get it right - it's way too late after -rc6, so revert it for now. Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Craig Bergstrom <craigb@google.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luis R. Rodriguez <mcgrof@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Toshi Kani <toshi.kani@hp.com> Cc: dsafonov@virtuozzo.com Cc: kirill.shutemov@linux.intel.com Cc: mhocko@suse.com Cc: oleg@redhat.com Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-25locking/atomics: COCCINELLE/treewide: Convert trivial ACCESS_ONCE() patterns ↵Mark Rutland
to READ_ONCE()/WRITE_ONCE() Please do not apply this to mainline directly, instead please re-run the coccinelle script shown below and apply its output. For several reasons, it is desirable to use {READ,WRITE}_ONCE() in preference to ACCESS_ONCE(), and new code is expected to use one of the former. So far, there's been no reason to change most existing uses of ACCESS_ONCE(), as these aren't harmful, and changing them results in churn. However, for some features, the read/write distinction is critical to correct operation. To distinguish these cases, separate read/write accessors must be used. This patch migrates (most) remaining ACCESS_ONCE() instances to {READ,WRITE}_ONCE(), using the following coccinelle script: ---- // Convert trivial ACCESS_ONCE() uses to equivalent READ_ONCE() and // WRITE_ONCE() // $ make coccicheck COCCI=/home/mark/once.cocci SPFLAGS="--include-headers" MODE=patch virtual patch @ depends on patch @ expression E1, E2; @@ - ACCESS_ONCE(E1) = E2 + WRITE_ONCE(E1, E2) @ depends on patch @ expression E; @@ - ACCESS_ONCE(E) + READ_ONCE(E) ---- Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: davem@davemloft.net Cc: linux-arch@vger.kernel.org Cc: mpe@ellerman.id.au Cc: shuah@kernel.org Cc: snitzer@redhat.com Cc: thor.thayer@linux.intel.com Cc: tj@kernel.org Cc: viro@zeniv.linux.org.uk Cc: will.deacon@arm.com Link: http://lkml.kernel.org/r/1508792849-3115-19-git-send-email-paulmck@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-24locking/barriers: Convert users of lockless_dereference() to READ_ONCE()Will Deacon
READ_ONCE() now has an implicit smp_read_barrier_depends() call, so it can be used instead of lockless_dereference() without any change in semantics. Signed-off-by: Will Deacon <will.deacon@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1508840570-22169-4-git-send-email-will.deacon@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-24Merge tag 'v4.14-rc6' into locking/core, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-23Merge branch 'x86/urgent' into x86/asm, to pick up dependent fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-20Merge branch 'perf/urgent' into perf/core, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-20x86/entry: Use SYSCALL_DEFINE() macros for sys_modify_ldt()Dave Hansen
We do not have tracepoints for sys_modify_ldt() because we define it directly instead of using the normal SYSCALL_DEFINEx() macros. However, there is a reason sys_modify_ldt() does not use the macros: it has an 'int' return type instead of 'unsigned long'. This is a bug, but it's a bug cemented in the ABI. What does this mean? If we return -EINVAL from a function that returns 'int', we have 0x00000000ffffffea in %rax. But, if we return -EINVAL from a function returning 'unsigned long', we end up with 0xffffffffffffffea in %rax, which is wrong. To work around this and maintain the 'int' behavior while using the SYSCALL_DEFINEx() macros, so we add a cast to 'unsigned int' in both implementations of sys_modify_ldt(). Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Andy Lutomirski <luto@kernel.org> Reviewed-by: Brian Gerst <brgerst@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20171018172107.1A79C532@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-20x86/mm: Limit mmap() of /dev/mem to valid physical addressesCraig Bergstrom
Currently, it is possible to mmap() any offset from /dev/mem. If a program mmaps() /dev/mem offsets outside of the addressable limits of a system, the page table can be corrupted by setting reserved bits. For example if you mmap() offset 0x0001000000000000 of /dev/mem on an x86_64 system with a 48-bit bus, the page fault handler will be called with error_code set to RSVD. The kernel then crashes with a page table corruption error. This change prevents this page table corruption on x86 by refusing to mmap offsets higher than the highest valid address in the system. Signed-off-by: Craig Bergstrom <craigb@google.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luis R. Rodriguez <mcgrof@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Toshi Kani <toshi.kani@hp.com> Cc: dsafonov@virtuozzo.com Cc: kirill.shutemov@linux.intel.com Cc: mhocko@suse.com Cc: oleg@redhat.com Link: http://lkml.kernel.org/r/20171019192856.39672-1-craigb@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-19dma-mapping: turn dma_cache_sync into a dma_map_ops methodChristoph Hellwig
After we removed all the dead wood it turns out only two architectures actually implement dma_cache_sync as a real op: mips and parisc. Add a cache_sync method to struct dma_map_ops and implement it for the mips defualt DMA ops, and the parisc pa11 ops. Note that arm, arc and openrisc support DMA_ATTR_NON_CONSISTENT, but never provided a functional dma_cache_sync implementations, which seems somewhat odd. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Robin Murphy <robin.murphy@arm.com>
2017-10-19x86: make dma_cache_sync a no-opChristoph Hellwig
x86 does not implement DMA_ATTR_NON_CONSISTENT allocations, so it doesn't make any sense to do any work in dma_cache_sync given that it must be a no-op when dma_alloc_attrs returns coherent memory. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Robin Murphy <robin.murphy@arm.com>
2017-10-18KVM: SVM: detect opening of SMI window using STGI interceptLadi Prosek
Commit 05cade71cf3b ("KVM: nSVM: fix SMI injection in guest mode") made KVM mask SMI if GIF=0 but it didn't do anything to unmask it when GIF is enabled. The issue manifests for me as a significantly longer boot time of Windows guests when running with SMM-enabled OVMF. This commit fixes it by intercepting STGI instead of requesting immediate exit if the reason why SMM was masked is GIF. Fixes: 05cade71cf3b ("KVM: nSVM: fix SMI injection in guest mode") Signed-off-by: Ladi Prosek <lprosek@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-10-18x86/mm: Remove debug/x86/tlb_defer_switch_to_init_mmAndy Lutomirski
Borislav thinks that we don't need this knob in a released kernel. Get rid of it. Requested-by: Borislav Petkov <bp@alien8.de> Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: b956575bed91 ("x86/mm: Flush more aggressively in lazy TLB mode") Link: http://lkml.kernel.org/r/1fa72431924e81e86c164ff7881bf9240d1f1a6c.1508000261.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-18x86/mm: Tidy up "x86/mm: Flush more aggressively in lazy TLB mode"Andy Lutomirski
Due to timezones, commit: b956575bed91 ("x86/mm: Flush more aggressively in lazy TLB mode") was an outdated patch that well tested and fixed the bug but didn't address Borislav's review comments. Tidy it up: - The name "tlb_use_lazy_mode()" was highly confusing. Change it to "tlb_defer_switch_to_init_mm()", which describes what it actually means. - Move the static_branch crap into a helper. - Improve comments. Actually removing the debugfs option is in the next patch. Reported-by: Borislav Petkov <bp@alien8.de> Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: b956575bed91 ("x86/mm: Flush more aggressively in lazy TLB mode") Link: http://lkml.kernel.org/r/154ef95428d4592596b6e98b0af1d2747d6cfbf8.1508000261.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-17x86/cpuid: Add generic table for CPUID dependenciesAndi Kleen
Some CPUID features depend on other features. Currently it's possible to to clear dependent features, but not clear the base features, which can cause various interesting problems. This patch implements a generic table to describe dependencies between CPUID features, to be used by all code that clears CPUID. Some subsystems (like XSAVE) had an own implementation of this, but it's better to do it all in a single place for everyone. Then clear_cpu_cap and setup_clear_cpu_cap always look up this table and clear all dependencies too. This is intended to be a practical table: only for features that make sense to clear. If someone for example clears FPU, or other features that are essentially part of the required base feature set, not much is going to work. Handling that is right now out of scope. We're only handling features which can be usefully cleared. Signed-off-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Jonathan McDowell <noodles@earth.li> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20171013215645.23166-3-andi@firstfloor.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-17x86/vector: Use correct per cpu variable in free_moved_vector()Thomas Gleixner
free_moved_vector() accesses the per cpu vector array with this_cpu_write() to clear the vector. The function has two call sites: 1) The vector cleanup IPI 2) The force_complete_move() code path For #1 this_cpu_write() is correct as it runs on the CPU on which the vector needs to be freed. For #2 this_cpu_write() is wrong because the function is called from an outgoing CPU which is not necessarily the CPU on which the previous vector needs to be freed. As a result it sets the vector on the outgoing CPU to NULL, which is pointless as that CPU does not handle interrupts anymore. What's worse is that it leaves the vector on the previous target CPU in place which later on triggers the BUG_ON(vector) in the vector allocation code when the vector gets reused. That's possible because the bitmap allocator entry of that CPU is freed correctly. Always use the CPU to which the vector was associated and clear the vector entry on that CPU. Fixup the tracepoint as well so it tracks on which CPU the vector gets removed. Fixes: 69cde0004a4b ("x86/vector: Use matrix allocator for vector assignment") Reported-by: Petri Latvala <petri.latvala@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Juergen Gross <jgross@suse.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Len Brown <lenb@kernel.org> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rui Zhang <rui.zhang@intel.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Alok Kataria <akataria@vmware.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Yu Chen <yu.c.chen@intel.com> Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1710161614430.1973@nanos
2017-10-17x86/tsc: Make CONFIG_X86_TSC=n build work againThomas Gleixner
tsc_async_resets is only available when CONFIG_X86_TSC=y. So a build with CONFIG_X86_TSC=n breaks: arch/x86/kernel/tsc.o: In function `tsc_init': (.init.text+0x87b): undefined reference to `tsc_async_resets' Add a stub define for the TSC=n case. Side note: This config switch should simply be removed. Reported-by: kbuild test robot <fengguang.wu@intel.com> Fixes: 341102c3ef29 ("x86/tsc: Add option that TSC on Socket 0 being non-zero is valid") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Mike Travis <mike.travis@hpe.com>
2017-10-16x86/platform/UV: Add check of TSC state set by UV BIOSmike.travis@hpe.com
Insert a check early in UV system startup that checks whether BIOS was able to obtain satisfactory TSC Sync stability. If not, it usually is caused by an error in the external TSC clock generation source. In this case the best fallback is to use the builtin hardware RTC as the kernel will not be able to set an accurate TSC sync either. Signed-off-by: Mike Travis <mike.travis@hpe.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Dimitri Sivanich <dimitri.sivanich@hpe.com> Reviewed-by: Russ Anderson <russ.anderson@hpe.com> Reviewed-by: Andrew Banman <andrew.abanman@hpe.com> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Andrew Banman <andrew.banman@hpe.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Bin Gao <bin.gao@linux.intel.com> Link: https://lkml.kernel.org/r/20171012163202.406294490@stormcage.americas.sgi.com
2017-10-16x86/tsc: Add option that TSC on Socket 0 being non-zero is validmike.travis@hpe.com
Add a flag to indicate and process that TSC counters are on chassis that reset at different times during system startup. Therefore which TSC ADJUST values should be zero is not predictable. Signed-off-by: Mike Travis <mike.travis@hpe.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Dimitri Sivanich <dimitri.sivanich@hpe.com> Reviewed-by: Russ Anderson <russ.anderson@hpe.com> Reviewed-by: Andrew Banman <andrew.abanman@hpe.com> Reviewed-by: Peter Zijlstra <peterz@infradead.org> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Andrew Banman <andrew.banman@hpe.com> Cc: Bin Gao <bin.gao@linux.intel.com> Link: https://lkml.kernel.org/r/20171012163201.944370012@stormcage.americas.sgi.com
2017-10-14Merge branch 'x86-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Ingo Molnar: "A landry list of fixes: - fix reboot breakage on some PCID-enabled system - fix crashes/hangs on some PCID-enabled systems - fix microcode loading on certain older CPUs - various unwinder fixes - extend an APIC quirk to more hardware systems and disable APIC related warning on virtualized systems - various Hyper-V fixes - a macro definition robustness fix - remove jprobes IRQ disabling - various mem-encryption fixes" * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/microcode: Do the family check first x86/mm: Flush more aggressively in lazy TLB mode x86/apic: Update TSC_DEADLINE quirk with additional SKX stepping x86/apic: Silence "FW_BUG TSC_DEADLINE disabled due to Errata" on hypervisors x86/mm: Disable various instrumentations of mm/mem_encrypt.c and mm/tlb.c x86/hyperv: Fix hypercalls with extended CPU ranges for TLB flushing x86/hyperv: Don't use percpu areas for pcpu_flush/pcpu_flush_ex structures x86/hyperv: Clear vCPU banks between calls to avoid flushing unneeded vCPUs x86/unwind: Disable unwinder warnings on 32-bit x86/unwind: Align stack pointer in unwinder dump x86/unwind: Use MSB for frame pointer encoding on 32-bit x86/unwind: Fix dereference of untrusted pointer x86/alternatives: Fix alt_max_short macro to really be a max() x86/mm/64: Fix reboot interaction with CR4.PCIDE kprobes/x86: Remove IRQ disabling from jprobe handlers kprobes/x86: Set up frame pointer in kprobe trampoline
2017-10-14Merge branch 'ras-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RAS fixes from Ingo Molnar: "A boot parameter fix, plus a header export fix" * 'ras-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mce: Hide mca_cfg RAS/CEC: Use the right length for "cec_disable"
2017-10-14x86/unwind: Rename unwinder config options to 'CONFIG_UNWINDER_*'Josh Poimboeuf
Rename the unwinder config options from: CONFIG_ORC_UNWINDER CONFIG_FRAME_POINTER_UNWINDER CONFIG_GUESS_UNWINDER to: CONFIG_UNWINDER_ORC CONFIG_UNWINDER_FRAME_POINTER CONFIG_UNWINDER_GUESS ... in order to give them a more logical config namespace. Suggested-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/73972fc7e2762e91912c6b9584582703d6f1b8cc.1507924831.git.jpoimboe@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-14Merge branch 'core/urgent' into x86/asm, to pick up dependencyIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-14x86/mm: Flush more aggressively in lazy TLB modeAndy Lutomirski
Since commit: 94b1b03b519b ("x86/mm: Rework lazy TLB mode and TLB freshness tracking") x86's lazy TLB mode has been all the way lazy: when running a kernel thread (including the idle thread), the kernel keeps using the last user mm's page tables without attempting to maintain user TLB coherence at all. From a pure semantic perspective, this is fine -- kernel threads won't attempt to access user pages, so having stale TLB entries doesn't matter. Unfortunately, I forgot about a subtlety. By skipping TLB flushes, we also allow any paging-structure caches that may exist on the CPU to become incoherent. This means that we can have a paging-structure cache entry that references a freed page table, and the CPU is within its rights to do a speculative page walk starting at the freed page table. I can imagine this causing two different problems: - A speculative page walk starting from a bogus page table could read IO addresses. I haven't seen any reports of this causing problems. - A speculative page walk that involves a bogus page table can install garbage in the TLB. Such garbage would always be at a user VA, but some AMD CPUs have logic that triggers a machine check when it notices these bogus entries. I've seen a couple reports of this. Boris further explains the failure mode: > It is actually more of an optimization which assumes that paging-structure > entries are in WB DRAM: > > "TlbCacheDis: cacheable memory disable. Read-write. 0=Enables > performance optimization that assumes PML4, PDP, PDE, and PTE entries > are in cacheable WB-DRAM; memory type checks may be bypassed, and > addresses outside of WB-DRAM may result in undefined behavior or NB > protocol errors. 1=Disables performance optimization and allows PML4, > PDP, PDE and PTE entries to be in any memory type. Operating systems > that maintain page tables in memory types other than WB- DRAM must set > TlbCacheDis to insure proper operation." > > The MCE generated is an NB protocol error to signal that > > "Link: A specific coherent-only packet from a CPU was issued to an > IO link. This may be caused by software which addresses page table > structures in a memory type other than cacheable WB-DRAM without > properly configuring MSRC001_0015[TlbCacheDis]. This may occur, for > example, when page table structure addresses are above top of memory. In > such cases, the NB will generate an MCE if it sees a mismatch between > the memory operation generated by the core and the link type." > > I'm assuming coherent-only packets don't go out on IO links, thus the > error. To fix this, reinstate TLB coherence in lazy mode. With this patch applied, we do it in one of two ways: - If we have PCID, we simply switch back to init_mm's page tables when we enter a kernel thread -- this seems to be quite cheap except for the cost of serializing the CPU. - If we don't have PCID, then we set a flag and switch to init_mm the first time we would otherwise need to flush the TLB. The /sys/kernel/debug/x86/tlb_use_lazy_mode debug switch can be changed to override the default mode for benchmarking. In theory, we could optimize this better by only flushing the TLB in lazy CPUs when a page table is freed. Doing that would require auditing the mm code to make sure that all page table freeing goes through tlb_remove_page() as well as reworking some data structures to implement the improved flush logic. Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de> Reported-by: Adam Borowski <kilobyte@angband.pl> Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Eric Biggers <ebiggers@google.com> Cc: Johannes Hirte <johannes.hirte@datenkhaos.de> Cc: Kees Cook <keescook@chromium.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Roman Kagan <rkagan@virtuozzo.com> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 94b1b03b519b ("x86/mm: Rework lazy TLB mode and TLB freshness tracking") Link: http://lkml.kernel.org/r/20171009170231.fkpraqokz6e4zeco@pd.tnic Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-13x86/fpu/debug: Remove unused 'x86_fpu_state' and 'x86_fpu_deactivate_state' ↵Steven Rostedt (VMware)
tracepoints Commit: d1898b733619 ("x86/fpu: Add tracepoints to dump FPU state at key points") ... added the 'x86_fpu_state' and 'x86_fpu_deactivate_state' trace points, but never used them. Today they are still not used. As they take up and waste memory, remove them. Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20171012180619.670b68b6@gandalf.local.home Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-12KVM: nSVM: fix SMI injection in guest modeLadi Prosek
Entering SMM while running in guest mode wasn't working very well because several pieces of the vcpu state were left set up for nested operation. Some of the issues observed: * L1 was getting unexpected VM exits (using L1 interception controls but running in SMM execution environment) * MMU was confused (walk_mmu was still set to nested_mmu) * INTERCEPT_SMI was not emulated for L1 (KVM never injected SVM_EXIT_SMI) Intel SDM actually prescribes the logical processor to "leave VMX operation" upon entering SMM in 34.14.1 Default Treatment of SMI Delivery. AMD doesn't seem to document this but they provide fields in the SMM state-save area to stash the current state of SVM. What we need to do is basically get out of guest mode for the duration of SMM. All this completely transparent to L1, i.e. L1 is not given control and no L1 observable state changes. To avoid code duplication this commit takes advantage of the existing nested vmexit and run functionality, perhaps at the cost of efficiency. To get out of guest mode, nested_svm_vmexit is called, unchanged. Re-entering is performed using enter_svm_guest_mode. This commit fixes running Windows Server 2016 with Hyper-V enabled in a VM with OVMF firmware (OVMF_CODE-need-smm.fd). Signed-off-by: Ladi Prosek <lprosek@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-10-12KVM: x86: introduce ISA specific smi_allowed callbackLadi Prosek
Similar to NMI, there may be ISA specific reasons why an SMI cannot be injected into the guest. This commit adds a new smi_allowed callback to be implemented in following commits. Signed-off-by: Ladi Prosek <lprosek@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-10-12KVM: x86: introduce ISA specific SMM entry/exit callbacksLadi Prosek
Entering and exiting SMM may require ISA specific handling under certain circumstances. This commit adds two new callbacks with empty implementations. Actual functionality will be added in following commits. * pre_enter_smm() is to be called when injecting an SMM, before any SMM related vcpu state has been changed * pre_leave_smm() is to be called when emulating the RSM instruction, when the vcpu is in real mode and before any SMM related vcpu state has been restored Signed-off-by: Ladi Prosek <lprosek@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-10-12KVM: VMX: rename RDSEED and RDRAND vmx ctrls to reflect exitingDavid Hildenbrand
Let's just name these according to the SDM. This should make it clearer that the are used to enable exiting and not the feature itself. Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-10-12Merge branch 'irq/urgent' into x86/apicThomas Gleixner
Pick up core changes which affect the vector rework.
2017-10-10x86/hyperv: Clear vCPU banks between calls to avoid flushing unneeded vCPUsVitaly Kuznetsov
hv_flush_pcpu_ex structures are not cleared between calls for performance reasons (they're variable size up to PAGE_SIZE each) but we must clear hv_vp_set.bank_contents part of it to avoid flushing unneeded vCPUs. The rest of the structure is formed correctly. To do the clearing in an efficient way stash the maximum possible vCPU number (this may differ from Linux CPU id). Reported-by: Jork Loeser <Jork.Loeser@microsoft.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Dexuan Cui <decui@microsoft.com> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: K. Y. Srinivasan <kys@microsoft.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephen Hemminger <sthemmin@microsoft.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: devel@linuxdriverproject.org Link: http://lkml.kernel.org/r/20171006154854.18092-1-vkuznets@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-10locking/arch: Remove dummy arch_{read,spin,write}_lock_flags() implementationsWill Deacon
The arch_{read,spin,write}_lock_flags() macros are simply mapped to the non-flags versions by the majority of architectures, so do this in core code and remove the dummy implementations. Also remove the implementation in spinlock_up.h, since all callers of do_raw_spin_lock_flags() call local_irq_save(flags) anyway. Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: paulmck@linux.vnet.ibm.com Link: http://lkml.kernel.org/r/1507055129-12300-4-git-send-email-will.deacon@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-10locking/arch: Remove dummy arch_{read,spin,write}_relax() implementationsWill Deacon
arch_{read,spin,write}_relax() are defined as cpu_relax() by the core code, so architectures that can't do better (i.e. most of them) don't need to bother with the dummy definitions. Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: paulmck@linux.vnet.ibm.com Link: http://lkml.kernel.org/r/1507055129-12300-3-git-send-email-will.deacon@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-10locking/arch, x86: Add __down_read_killable()Kirill Tkhai
Similar to __down_write_killable(), add read killable primitive: extract current __down_read() code to macros and teach it to get different functions as slow_path argument: store ax register to ret, and add sp register and preserve its value. Add call_rwsem_down_read_failed_killable() assembly entry similar to call_rwsem_down_read_failed(): push dx register to stack in additional to common registers, as it's not declarated as modifiable in ____down_read(). Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: arnd@arndb.de Cc: avagin@virtuozzo.com Cc: davem@davemloft.net Cc: fenghua.yu@intel.com Cc: gorcunov@virtuozzo.com Cc: heiko.carstens@de.ibm.com Cc: hpa@zytor.com Cc: ink@jurassic.park.msu.ru Cc: mattst88@gmail.com Cc: rientjes@google.com Cc: rth@twiddle.net Cc: schwidefsky@de.ibm.com Cc: tony.luck@intel.com Cc: viro@zeniv.linux.org.uk Link: http://lkml.kernel.org/r/150670118802.23930.1316107715255410256.stgit@localhost.localdomain Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-10locking/paravirt: Use new static key for controlling call of virt_spin_lock()Juergen Gross
There are cases where a guest tries to switch spinlocks to bare metal behavior (e.g. by setting "xen_nopvspin" boot parameter). Today this has the downside of falling back to unfair test and set scheme for qspinlocks due to virt_spin_lock() detecting the virtualized environment. Add a static key controlling whether virt_spin_lock() should be called or not. When running on bare metal set the new key to false. Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Waiman Long <longman@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: akataria@vmware.com Cc: boris.ostrovsky@oracle.com Cc: chrisw@sous-sol.org Cc: hpa@zytor.com Cc: jeremy@goop.org Cc: rusty@rustcorp.com.au Cc: virtualization@lists.linux-foundation.org Cc: xen-devel@lists.xenproject.org Link: http://lkml.kernel.org/r/20170906173625.18158-2-jgross@suse.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-10Merge branch 'locking/urgent' into locking/core, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-09x86/alternatives: Fix alt_max_short macro to really be a max()Mathias Krause
The alt_max_short() macro in asm/alternative.h does not work as intended, leading to nasty bugs. E.g. alt_max_short("1", "3") evaluates to 3, but alt_max_short("3", "1") evaluates to 1 -- not exactly the maximum of 1 and 3. In fact, I had to learn it the hard way by crashing my kernel in not so funny ways by attempting to make use of the ALTENATIVE_2 macro with alternatives where the first one was larger than the second one. According to [1] and commit dbe4058a6a44 ("x86/alternatives: Fix ALTERNATIVE_2 padding generation properly") the right handed side should read "-(-(a < b))" not "-(-(a - b))". Fix that, to make the macro work as intended. While at it, fix up the comments regarding the additional "-", too. It's not about gas' usage of s32 but brain dead logic of having a "true" value of -1 for the < operator ... *sigh* Btw., the one in asm/alternative-asm.h is correct. And, apparently, all current users of ALTERNATIVE_2() pass same sized alternatives, avoiding to hit the bug. [1] http://graphics.stanford.edu/~seander/bithacks.html#IntegerMinOrMax Reviewed-and-tested-by: Borislav Petkov <bp@suse.de> Fixes: dbe4058a6a44 ("x86/alternatives: Fix ALTERNATIVE_2 padding generation properly") Signed-off-by: Mathias Krause <minipli@googlemail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Borislav Petkov <bp@suse.de> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/1507228213-13095-1-git-send-email-minipli@googlemail.com
2017-10-05x86/mce: Hide mca_cfgBorislav Petkov
Now that lguest is gone, put it in the internal header which should be used only by MCA/RAS code. Add missing header guards while at it. No functional change. Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20171002092836.22971-3-bp@alien8.de
2017-10-04kvm/x86: Avoid async PF preempting the kernel incorrectlyBoqun Feng
Currently, in PREEMPT_COUNT=n kernel, kvm_async_pf_task_wait() could call schedule() to reschedule in some cases. This could result in accidentally ending the current RCU read-side critical section early, causing random memory corruption in the guest, or otherwise preempting the currently running task inside between preempt_disable and preempt_enable. The difficulty to handle this well is because we don't know whether an async PF delivered in a preemptible section or RCU read-side critical section for PREEMPT_COUNT=n, since preempt_disable()/enable() and rcu_read_lock/unlock() are both no-ops in that case. To cure this, we treat any async PF interrupting a kernel context as one that cannot be preempted, preventing kvm_async_pf_task_wait() from choosing the schedule() path in that case. To do so, a second parameter for kvm_async_pf_task_wait() is introduced, so that we know whether it's called from a context interrupting the kernel, and the parameter is set properly in all the callsites. Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Wanpeng Li <wanpeng.li@hotmail.com> Cc: stable@vger.kernel.org Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-10-03x86/boot: Spell out "boot CPU" for BPJean Delvare
It's not obvious to everybody that BP stands for boot processor. At least it was not for me. And BP is also a CPU register on x86, so it is ambiguous. Spell out "boot CPU" everywhere instead. Signed-off-by: Jean Delvare <jdelvare@suse.de> Cc: Alok Kataria <akataria@vmware.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-01Merge branch 'x86-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Thomas Gleixner: "This contains the following fixes and improvements: - Avoid dereferencing an unprotected VMA pointer in the fault signal generation code - Fix inline asm call constraints for GCC 4.4 - Use existing register variable to retrieve the stack pointer instead of forcing the compiler to create another indirect access which results in excessive extra 'mov %rsp, %<dst>' instructions - Disable branch profiling for the memory encryption code to prevent an early boot crash - Fix a sparse warning caused by casting the __user annotation in __get_user_asm_u64() away - Fix an off by one error in the loop termination of the error patch in the x86 sysfs init code - Add missing CPU IDs to various Intel specific drivers to enable the functionality on recent hardware - More (init) constification in the numachip code" * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/asm: Use register variable to get stack pointer value x86/mm: Disable branch profiling in mem_encrypt.c x86/asm: Fix inline asm call constraints for GCC 4.4 perf/x86/intel/uncore: Correct num_boxes for IIO and IRP perf/x86/intel/rapl: Add missing CPU IDs perf/x86/msr: Add missing CPU IDs perf/x86/intel/cstate: Add missing CPU IDs x86: Don't cast away the __user in __get_user_asm_u64() x86/sysfs: Fix off-by-one error in loop termination x86/mm: Fix fault error path using unsafe vma pointer x86/numachip: Add const and __initconst to numachip2_clockevent
2017-09-29Merge tag 'for-linus-4.14c-rc3-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull xen fixes from Juergen Gross: - avoid a warning when compiling with clang - consider read-only bits in xen-pciback when writing to a BAR - fix a boot crash of pv-domains * tag 'for-linus-4.14c-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: xen/mmu: Call xen_cleanhighmap() with 4MB aligned for page tables mapping xen-pciback: relax BAR sizing write value check x86/xen: clean up clang build warning
2017-09-29x86/asm: Use register variable to get stack pointer valueAndrey Ryabinin
Currently we use current_stack_pointer() function to get the value of the stack pointer register. Since commit: f5caf621ee35 ("x86/asm: Fix inline asm call constraints for Clang") ... we have a stack register variable declared. It can be used instead of current_stack_pointer() function which allows to optimize away some excessive "mov %rsp, %<dst>" instructions: -mov %rsp,%rdx -sub %rdx,%rax -cmp $0x3fff,%rax -ja ffffffff810722fd <ist_begin_non_atomic+0x2d> +sub %rsp,%rax +cmp $0x3fff,%rax +ja ffffffff810722fa <ist_begin_non_atomic+0x2a> Remove current_stack_pointer(), rename __asm_call_sp to current_stack_pointer and use it instead of the removed function. Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20170929141537.29167-1-aryabinin@virtuozzo.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-09-29x86/asm: Fix inline asm call constraints for GCC 4.4Josh Poimboeuf
The kernel test bot (run by Xiaolong Ye) reported that the following commit: f5caf621ee35 ("x86/asm: Fix inline asm call constraints for Clang") is causing double faults in a kernel compiled with GCC 4.4. Linus subsequently diagnosed the crash pattern and the buggy commit and found that the issue is with this code: register unsigned int __asm_call_sp asm("esp"); #define ASM_CALL_CONSTRAINT "+r" (__asm_call_sp) Even on a 64-bit kernel, it's using ESP instead of RSP. That causes GCC to produce the following bogus code: ffffffff8147461d: 89 e0 mov %esp,%eax ffffffff8147461f: 4c 89 f7 mov %r14,%rdi ffffffff81474622: 4c 89 fe mov %r15,%rsi ffffffff81474625: ba 20 00 00 00 mov $0x20,%edx ffffffff8147462a: 89 c4 mov %eax,%esp ffffffff8147462c: e8 bf 52 05 00 callq ffffffff814c98f0 <copy_user_generic_unrolled> Despite the absurdity of it backing up and restoring the stack pointer for no reason, the bug is actually the fact that it's only backing up and restoring the lower 32 bits of the stack pointer. The upper 32 bits are getting cleared out, corrupting the stack pointer. So change the '__asm_call_sp' register variable to be associated with the actual full-size stack pointer. This also requires changing the __ASM_SEL() macro to be based on the actual compiled arch size, rather than the CONFIG value, because CONFIG_X86_64 compiles some files with '-m32' (e.g., realmode and vdso). Otherwise Clang fails to build the kernel because it complains about the use of a 64-bit register (RSP) in a 32-bit file. Reported-and-Bisected-and-Tested-by: kernel test robot <xiaolong.ye@intel.com> Diagnosed-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dmitriy Vyukov <dvyukov@google.com> Cc: LKP <lkp@01.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matthias Kaehlcke <mka@chromium.org> Cc: Miguel Bernal Marin <miguel.bernal.marin@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: f5caf621ee35 ("x86/asm: Fix inline asm call constraints for Clang") Link: http://lkml.kernel.org/r/20170928215826.6sdpmwtkiydiytim@treble Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-09-28locking/refcounts, x86/asm: Use unique .text section for refcount exceptionsKees Cook
Using .text.unlikely for refcount exceptions isn't safe because gcc may move entire functions into .text.unlikely (e.g. in6_dev_dev()), which would cause any uses of a protected refcount_t function to stay inline with the function, triggering the protection unconditionally: .section .text.unlikely,"ax",@progbits .type in6_dev_get, @function in6_dev_getx: .LFB4673: .loc 2 4128 0 .cfi_startproc ... lock; incl 480(%rbx) js 111f .pushsection .text.unlikely 111: lea 480(%rbx), %rcx 112: .byte 0x0f, 0xff .popsection 113: This creates a unique .text..refcount section and adds an additional test to the exception handler to WARN in the case of having none of OF, SF, nor ZF set so we can see things like this more easily in the future. The double dot for the section name keeps it out of the TEXT_MAIN macro namespace, to avoid collisions and so it can be put at the end with text.unlikely to keep the cold code together. See commit: cb87481ee89db ("kbuild: linker script do not match C names unless LD_DEAD_CODE_DATA_ELIMINATION is configured") ... which matches C names: [a-zA-Z0-9_] but not ".". Reported-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Kees Cook <keescook@chromium.org> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Elena <elena.reshetova@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-arch <linux-arch@vger.kernel.org> Fixes: 7a46ec0e2f48 ("locking/refcounts, x86/asm: Implement fast refcount overflow protection") Link: http://lkml.kernel.org/r/1504382986-49301-2-git-send-email-keescook@chromium.org Signed-off-by: Ingo Molnar <mingo@kernel.org>