summaryrefslogtreecommitdiff
path: root/arch/x86/include
AgeCommit message (Collapse)Author
2016-07-14kvm: mmu: don't set the present bit unconditionallyBandan Das
To support execute only mappings on behalf of L1 hypervisors, we need to teach set_spte() to honor all three of L1's XWR bits. As a start, add a new variable "shadow_present_mask" that will be set for non-EPT shadow paging and clear for EPT. Signed-off-by: Bandan Das <bsd@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-13x86/mm: Ignore A/D bits in pte/pmd/pud_none()Dave Hansen
The erratum we are fixing here can lead to stray setting of the A and D bits. That means that a pte that we cleared might suddenly have A/D set. So, stop considering those bits when determining if a pte is pte_none(). The same goes for the other pmd_none() and pud_none(). pgd_none() can be skipped because it is not affected; we do not use PGD entries for anything other than pagetables on affected configurations. This adds a tiny amount of overhead to all pte_none() checks. I doubt we'll be able to measure it anywhere. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave@sr71.net> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luis R. Rodriguez <mcgrof@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Toshi Kani <toshi.kani@hp.com> Cc: dave.hansen@intel.com Cc: linux-mm@kvack.org Cc: mhocko@suse.com Link: http://lkml.kernel.org/r/20160708001912.5216F89C@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-13x86/mm: Move swap offset/type up in PTE to work around erratumDave Hansen
This erratum can result in Accessed/Dirty getting set by the hardware when we do not expect them to be (on !Present PTEs). Instead of trying to fix them up after this happens, we just allow the bits to get set and try to ignore them. We do this by shifting the layout of the bits we use for swap offset/type in our 64-bit PTEs. It looks like this: bitnrs: | ... | 11| 10| 9|8|7|6|5| 4| 3|2|1|0| names: | ... |SW3|SW2|SW1|G|L|D|A|CD|WT|U|W|P| before: | OFFSET (9-63) |0|X|X| TYPE(1-5) |0| after: | OFFSET (14-63) | TYPE (9-13) |0|X|X|X| X| X|X|X|0| Note that D was already a don't care (X) even before. We just move TYPE up and turn its old spot (which could be hit by the A bit) into all don't cares. We take 5 bits away from the offset, but that still leaves us with 50 bits which lets us index into a 62-bit swapfile (4 EiB). I think that's probably fine for the moment. We could theoretically reclaim 5 of the bits (1, 2, 3, 4, 7) but it doesn't gain us anything. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave@sr71.net> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luis R. Rodriguez <mcgrof@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Toshi Kani <toshi.kani@hp.com> Cc: dave.hansen@intel.com Cc: linux-mm@kvack.org Cc: mhocko@suse.com Link: http://lkml.kernel.org/r/20160708001911.9A3FD2B6@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-13x86/sfi: Enable enumeration of SD devicesAndy Shevchenko
SFI specification v0.8.2 defines type of devices which are connected to SD bus. In particularly WiFi dongle is a such. Add a callback to enumerate the devices connected to SD bus. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1468322192-62080-1-git-send-email-andriy.shevchenko@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-12pmem: kill __pmem address spaceDan Williams
The __pmem address space was meant to annotate codepaths that touch persistent memory and need to coordinate a call to wmb_pmem(). Now that wmb_pmem() is gone, there is little need to keep this annotation. Cc: Christoph Hellwig <hch@lst.de> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-07-12pmem: kill wmb_pmem()Dan Williams
All users have been replaced with flushing in the pmem driver. Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-07-11x86/tsc: Enumerate SKL cpu_khz and tsc_khz via CPUIDLen Brown
Skylake CPU base-frequency and TSC frequency may differ by up to 2%. Enumerate CPU and TSC frequencies separately, allowing cpu_khz and tsc_khz to differ. The existing CPU frequency calibration mechanism is unchanged. However, CPUID extensions are preferred, when available. CPUID.0x16 is preferred over MSR and timer calibration for CPU frequency discovery. CPUID.0x15 takes precedence over CPU-frequency for TSC frequency discovery. Signed-off-by: Len Brown <len.brown@intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/b27ec289fd005833b27d694d9c2dbb716c5cdff7.1466138954.git.len.brown@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-11x86/tsc_msr: Remove irqoff around MSR-based TSC enumerationLen Brown
Remove the irqoff/irqon around MSR-based TSC enumeration, as it is not necessary. Also rename: try_msr_calibrate_tsc() to cpu_khz_from_msr(), as that better describes what the routine does. Signed-off-by: Len Brown <len.brown@intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/a6b5c3ecd3b068175d2309599ab28163fc34215e.1466138954.git.len.brown@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-11x86/fpu/xstate: Fix fpstate_init() for XRSTORSYu-cheng Yu
In XSAVES mode if fpstate_init() is used to initialize a task's extended state area, xsave.header.xcomp_bv[63] must be set. Otherwise, when the task is scheduled, a warning is triggered from copy_kernel_to_xregs(). One such test case is: setting an invalid extended state through PTRACE. When xstateregs_set() rejects the syscall and re-initializes the task's extended state area. This triggers the warning mentioned above. Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com> Signed-off-by: Fenghua Yu <fenghua.yu@intel.com> Reviewed-by: Dave Hansen <dave.hansen@intel.com> Cc: H. Peter Anvin <h.peter.anvin@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ravi V Shankar <ravi.v.shankar@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1468253937-40008-4-git-send-email-fenghua.yu@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-10x86/platform/intel-mid: Make vertical indentation consistentAndy Shevchenko
The vertical indentation is kinda chaotic in intel-mid.h. Let's be consistent with it. Suggested-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1465992260-29897-1-git-send-email-andriy.shevchenko@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-10x86/fpu/xstate: Fix PTRACE frames for XSAVESYu-cheng Yu
XSAVES uses compacted format and is a kernel instruction. The kernel should use standard-format, non-supervisor state data for PTRACE. Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com> [ Edited away artificial linebreaks. ] Reviewed-by: Dave Hansen <dave.hansen@intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com> Cc: Ravi V. Shankar <ravi.v.shankar@intel.com> Cc: Sai Praneeth Prakhya <sai.praneeth.prakhya@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/de3d80949001305fe389799973b675cab055c457.1466179491.git.yu-cheng.yu@intel.com [ Made various readability edits. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-10x86/fpu/xstate: Fix supervisor xstate component offsetYu-cheng Yu
CPUID function 0x0d, sub function (i, i > 1) returns in ebx the offset of xstate component i. Zero is returned for a supervisor state. A supervisor state can only be saved by XSAVES and XSAVES uses a compacted format. There is no fixed offset for a supervisor state. This patch checks and makes sure a supervisor state offset is not recorded or mis-used. This has no effect in practice as we currently use no supervisor states, but it would be good to fix. Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com> Reviewed-by: Dave Hansen <dave.hansen@intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com> Cc: Ravi V. Shankar <ravi.v.shankar@intel.com> Cc: Sai Praneeth Prakhya <sai.praneeth.prakhya@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/81b29e40d35d4cec9f2511a856fe769f34935a3f.1466179491.git.yu-cheng.yu@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-10Merge branch 'linus' into x86/fpu, to pick up fixes before applying new changesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-09x86/cpu: Fix duplicated X86_BUG(9) macroDave Hansen
cpufeatures.h currently defines X86_BUG(9) twice on 32-bit: #define X86_BUG_NULL_SEG X86_BUG(9) /* Nulling a selector preserves the base */ ... #ifdef CONFIG_X86_32 #define X86_BUG_ESPFIX X86_BUG(9) /* "" IRET to 16-bit SS corrupts ESP/RSP high bits */ #endif I think what happened was that this added the X86_BUG_ESPFIX, but in an #ifdef below most of the bugs: 58a5aac53313 x86/entry/32: Introduce and use X86_BUG_ESPFIX instead of paravirt_enabled Then this came along and added X86_BUG_NULL_SEG, but collided with the earlier one that did the bug below the main block defining all the X86_BUG()s. 7a5d67048745 x86/cpu: Probe the behavior of nulling out a segment at boot time Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Andy Lutomirski <luto@kernel.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave@sr71.net> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/20160618001503.CEE1B141@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-09Merge branch 'linus' into x86/asm, to pick up fixes before merging new changesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-08x86/mm: Enable KASLR for vmalloc memory regionsThomas Garnier
Add vmalloc to the list of randomized memory regions. The vmalloc memory region contains the allocation made through the vmalloc() API. The allocations are done sequentially to prevent fragmentation and each allocation address can easily be deduced especially from boot. Signed-off-by: Thomas Garnier <thgarnie@google.com> Signed-off-by: Kees Cook <keescook@chromium.org> Cc: Alexander Kuleshov <kuleshovmail@gmail.com> Cc: Alexander Popov <alpopov@ptsecurity.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Baoquan He <bhe@redhat.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Borislav Petkov <bp@suse.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dave Young <dyoung@redhat.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jan Beulich <JBeulich@suse.com> Cc: Joerg Roedel <jroedel@suse.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Lv Zheng <lv.zheng@intel.com> Cc: Mark Salter <msalter@redhat.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephen Smalley <sds@tycho.nsa.gov> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Toshi Kani <toshi.kani@hpe.com> Cc: Xiao Guangrong <guangrong.xiao@linux.intel.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: kernel-hardening@lists.openwall.com Cc: linux-doc@vger.kernel.org Link: http://lkml.kernel.org/r/1466556426-32664-8-git-send-email-keescook@chromium.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-08x86/mm: Enable KASLR for physical mapping memory regionsThomas Garnier
Add the physical mapping in the list of randomized memory regions. The physical memory mapping holds most allocations from boot and heap allocators. Knowing the base address and physical memory size, an attacker can deduce the PDE virtual address for the vDSO memory page. This attack was demonstrated at CanSecWest 2016, in the following presentation: "Getting Physical: Extreme Abuse of Intel Based Paged Systems": https://github.com/n3k/CansecWest2016_Getting_Physical_Extreme_Abuse_of_Intel_Based_Paging_Systems/blob/master/Presentation/CanSec2016_Presentation.pdf (See second part of the presentation). The exploits used against Linux worked successfully against 4.6+ but fail with KASLR memory enabled: https://github.com/n3k/CansecWest2016_Getting_Physical_Extreme_Abuse_of_Intel_Based_Paging_Systems/tree/master/Demos/Linux/exploits Similar research was done at Google leading to this patch proposal. Variants exists to overwrite /proc or /sys objects ACLs leading to elevation of privileges. These variants were tested against 4.6+. The page offset used by the compressed kernel retains the static value since it is not yet randomized during this boot stage. Signed-off-by: Thomas Garnier <thgarnie@google.com> Signed-off-by: Kees Cook <keescook@chromium.org> Cc: Alexander Kuleshov <kuleshovmail@gmail.com> Cc: Alexander Popov <alpopov@ptsecurity.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Baoquan He <bhe@redhat.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Borislav Petkov <bp@suse.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dave Young <dyoung@redhat.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jan Beulich <JBeulich@suse.com> Cc: Joerg Roedel <jroedel@suse.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Lv Zheng <lv.zheng@intel.com> Cc: Mark Salter <msalter@redhat.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephen Smalley <sds@tycho.nsa.gov> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Toshi Kani <toshi.kani@hpe.com> Cc: Xiao Guangrong <guangrong.xiao@linux.intel.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: kernel-hardening@lists.openwall.com Cc: linux-doc@vger.kernel.org Link: http://lkml.kernel.org/r/1466556426-32664-7-git-send-email-keescook@chromium.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-08x86/mm: Implement ASLR for kernel memory regionsThomas Garnier
Randomizes the virtual address space of kernel memory regions for x86_64. This first patch adds the infrastructure and does not randomize any region. The following patches will randomize the physical memory mapping, vmalloc and vmemmap regions. This security feature mitigates exploits relying on predictable kernel addresses. These addresses can be used to disclose the kernel modules base addresses or corrupt specific structures to elevate privileges bypassing the current implementation of KASLR. This feature can be enabled with the CONFIG_RANDOMIZE_MEMORY option. The order of each memory region is not changed. The feature looks at the available space for the regions based on different configuration options and randomizes the base and space between each. The size of the physical memory mapping is the available physical memory. No performance impact was detected while testing the feature. Entropy is generated using the KASLR early boot functions now shared in the lib directory (originally written by Kees Cook). Randomization is done on PGD & PUD page table levels to increase possible addresses. The physical memory mapping code was adapted to support PUD level virtual addresses. This implementation on the best configuration provides 30,000 possible virtual addresses in average for each memory region. An additional low memory page is used to ensure each CPU can start with a PGD aligned virtual address (for realmode). x86/dump_pagetable was updated to correctly display each region. Updated documentation on x86_64 memory layout accordingly. Performance data, after all patches in the series: Kernbench shows almost no difference (-+ less than 1%): Before: Average Optimal load -j 12 Run (std deviation): Elapsed Time 102.63 (1.2695) User Time 1034.89 (1.18115) System Time 87.056 (0.456416) Percent CPU 1092.9 (13.892) Context Switches 199805 (3455.33) Sleeps 97907.8 (900.636) After: Average Optimal load -j 12 Run (std deviation): Elapsed Time 102.489 (1.10636) User Time 1034.86 (1.36053) System Time 87.764 (0.49345) Percent CPU 1095 (12.7715) Context Switches 199036 (4298.1) Sleeps 97681.6 (1031.11) Hackbench shows 0% difference on average (hackbench 90 repeated 10 times): attemp,before,after 1,0.076,0.069 2,0.072,0.069 3,0.066,0.066 4,0.066,0.068 5,0.066,0.067 6,0.066,0.069 7,0.067,0.066 8,0.063,0.067 9,0.067,0.065 10,0.068,0.071 average,0.0677,0.0677 Signed-off-by: Thomas Garnier <thgarnie@google.com> Signed-off-by: Kees Cook <keescook@chromium.org> Cc: Alexander Kuleshov <kuleshovmail@gmail.com> Cc: Alexander Popov <alpopov@ptsecurity.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Baoquan He <bhe@redhat.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Borislav Petkov <bp@suse.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dave Young <dyoung@redhat.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jan Beulich <JBeulich@suse.com> Cc: Joerg Roedel <jroedel@suse.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Lv Zheng <lv.zheng@intel.com> Cc: Mark Salter <msalter@redhat.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephen Smalley <sds@tycho.nsa.gov> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Toshi Kani <toshi.kani@hpe.com> Cc: Xiao Guangrong <guangrong.xiao@linux.intel.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: kernel-hardening@lists.openwall.com Cc: linux-doc@vger.kernel.org Link: http://lkml.kernel.org/r/1466556426-32664-6-git-send-email-keescook@chromium.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-08x86/mm: Separate variable for trampoline PGDThomas Garnier
Use a separate global variable to define the trampoline PGD used to start other processors. This change will allow KALSR memory randomization to change the trampoline PGD to be correctly aligned with physical memory. Signed-off-by: Thomas Garnier <thgarnie@google.com> Signed-off-by: Kees Cook <keescook@chromium.org> Cc: Alexander Kuleshov <kuleshovmail@gmail.com> Cc: Alexander Popov <alpopov@ptsecurity.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Baoquan He <bhe@redhat.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Borislav Petkov <bp@suse.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dave Young <dyoung@redhat.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jan Beulich <JBeulich@suse.com> Cc: Joerg Roedel <jroedel@suse.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Lv Zheng <lv.zheng@intel.com> Cc: Mark Salter <msalter@redhat.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephen Smalley <sds@tycho.nsa.gov> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Toshi Kani <toshi.kani@hpe.com> Cc: Xiao Guangrong <guangrong.xiao@linux.intel.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: kernel-hardening@lists.openwall.com Cc: linux-doc@vger.kernel.org Link: http://lkml.kernel.org/r/1466556426-32664-5-git-send-email-keescook@chromium.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-08x86/mm: Refactor KASLR entropy functionsThomas Garnier
Move the KASLR entropy functions into arch/x86/lib to be used in early kernel boot for KASLR memory randomization. Signed-off-by: Thomas Garnier <thgarnie@google.com> Signed-off-by: Kees Cook <keescook@chromium.org> Cc: Alexander Kuleshov <kuleshovmail@gmail.com> Cc: Alexander Popov <alpopov@ptsecurity.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Baoquan He <bhe@redhat.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Borislav Petkov <bp@suse.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dave Young <dyoung@redhat.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jan Beulich <JBeulich@suse.com> Cc: Joerg Roedel <jroedel@suse.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Lv Zheng <lv.zheng@intel.com> Cc: Mark Salter <msalter@redhat.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephen Smalley <sds@tycho.nsa.gov> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Toshi Kani <toshi.kani@hpe.com> Cc: Xiao Guangrong <guangrong.xiao@linux.intel.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: kernel-hardening@lists.openwall.com Cc: linux-doc@vger.kernel.org Link: http://lkml.kernel.org/r/1466556426-32664-2-git-send-email-keescook@chromium.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-08Merge tag 'v4.7-rc6' into x86/mm, to merge fixes before applying new changesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-08x86/dumpstack: Add show_stack_regs() and use itBorislav Petkov
Add a helper to dump supplied pt_regs and use it in the MSR exception handling code to have precise stack traces pointing to the actual function causing the MSR access exception and not the stack frame of the exception handler itself. The new output looks like this: unchecked MSR access error: RDMSR from 0xdeadbeef at rIP: 0xffffffff8102ddb6 (early_init_intel+0x16/0x3a0) 00000000756e6547 ffffffff81c03f68 ffffffff81dd0940 ffffffff81c03f10 ffffffff81d42e65 0000000001000000 ffffffff81c03f58 ffffffff81d3e5a3 0000800000000000 ffffffff81800080 ffffffffffffffff 0000000000000000 Call Trace: [<ffffffff81d42e65>] early_cpu_init+0xe7/0x136 [<ffffffff81d3e5a3>] setup_arch+0xa5/0x9df [<ffffffff81d38bb9>] start_kernel+0x9f/0x43a [<ffffffff81d38294>] x86_64_start_reservations+0x2f/0x31 [<ffffffff81d383fe>] x86_64_start_kernel+0x168/0x176 Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Andy Lutomirski <luto@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1467671487-10344-4-git-send-email-bp@alien8.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-08Merge tag 'v4.7-rc6' into x86/microcode, to pick up fixes before merging new ↵Ingo Molnar
changes Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-07kbuild, x86: Track generated headers with generated-yJames Hogan
Track generated header files which aren't already in genhdr-y, alongside generic-y wrappers in the */include/generated/[uapi/]asm/ directories. Currently only x86 generates extra headers in these directories, for the purposes of enumerating system calls for different ABIs, and xen hypercalls. This will allow the asm-generic wrapper handling code to remove stale wrappers when files are removed from generic-y, without also removing these headers which are generated separately. Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: James Hogan <james.hogan@imgtec.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: linux-kbuild@vger.kernel.org Cc: linux-doc@vger.kernel.org Cc: Michal Marek <mmarek@suse.com> Link: http://lkml.kernel.org/r/1466808144-23209-2-git-send-email-james.hogan@imgtec.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-07-07x86: remove duplicate turbo ratio limit MSRsSrinivas Pandruvada
Remove MSR_NHM_TURBO_RATIO_LIMIT and MSR_IVT_TURBO_RATIO_LIMIT as they are duplicate. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-07-07Merge branch 'pm-cpufreq' into pm-cpuRafael J. Wysocki
2016-07-07Merge branch 'perf/urgent' into perf/core, to pick up fixes before merging ↵Ingo Molnar
new changes Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-04Merge back earlier cpufreq material for v4.8.Rafael J. Wysocki
2016-07-01x86/cpu: Rename "WESTMERE2" family to "NEHALEM_G"Dave Hansen
Len Brown noticed something was amiss in our INTEL_FAM6_* definitions. It seems like model 0x1F was a Nehalem part, marketed as "Intel Core i7 and i5 Processors" (according to the SDM). But, although it was a Nehalem 0x1F had some uncore events which were shared with Westmere. Len also mentioned he thought it was called "Havendale", which Wikipedia says was graphics-oriented and canceled: https://en.wikipedia.org/wiki/Nehalem_(microarchitecture) So either way, it's probably not imporant what we call it, but call it Nehalem to be accurate, and add a "G" since it seems graphics-related. If it were canceled that would be a good reason why it's so sparsely and inconsistently referred to in the code. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dave Hansen <dave@sr71.net> Cc: Len Brown <lenb@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20160629192737.949C41A8@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-30x86/cpufeature: Add helper macro for mask check macrosDave Hansen
Every time we add a word to our cpu features, we need to add something like this in two places: (((bit)>>5)==16 && (1UL<<((bit)&31) & REQUIRED_MASK16)) The trick is getting the "16" in this case in both places. I've now screwed this up twice, so as pennance, I've come up with this patch to keep me and other poor souls from doing the same. I also commented the logic behind the bit manipulation showcased above. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave@sr71.net> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20160629200110.1BA8949E@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-30x86/cpufeature: Make sure DISABLED/REQUIRED macros are updatedDave Hansen
x86 has two macros which allow us to evaluate some CPUID-based features at compile time: REQUIRED_MASK_BIT_SET() DISABLED_MASK_BIT_SET() They're both defined by having the compiler check the bit argument against some constant masks of features. But, when adding new CPUID leaves, we need to check new words for these macros. So make sure that those macros and the REQUIRED_MASK* and DISABLED_MASK* get updated when necessary. This looks kinda silly to have an open-coded value ("18" in this case) open-coded in 5 places in the code. But, we really do need 5 places updated when NCAPINTS gets bumped, so now we just force the issue. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave@sr71.net> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20160629200108.92466F6F@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-30x86/cpufeature: Update cpufeaure macrosDave Hansen
We had a new CPUID "NCAPINT" word added, but the REQUIRED_MASK and DISABLED_MASK macros did not get updated. Update them. None of the features was needed in these masks, so there was no harm, but we should keep them updated anyway. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave@sr71.net> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20160629200107.8D3C9A31@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-27pvclock: Cleanup to remove function pvclock_get_nsec_offsetMinfei Huang
Function __pvclock_read_cycles is short enough, so there is no need to have another function pvclock_get_nsec_offset to calculate tsc delta. It's better to combine it into function __pvclock_read_cycles. Remove useless variables in function __pvclock_read_cycles. Signed-off-by: Minfei Huang <mnghuan@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-06-27pvclock: Add CPU barriers to get correct version valueMinfei Huang
Protocol for the "version" fields is: hypervisor raises it (making it uneven) before it starts updating the fields and raises it again (making it even) when it is done. Thus the guest can make sure the time values it got are consistent by checking the version before and after reading them. Add CPU barries after getting version value just like what function vread_pvclock does, because all of callees in this function is inline. Fixes: 502dfeff239e8313bfbe906ca0a1a6827ac8481b Cc: stable@vger.kernel.org Signed-off-by: Minfei Huang <mnghuan@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-06-27efi: Convert efi_call_virt() to efi_call_virt_pointer()Alex Thorlton
This commit makes a few slight modifications to the efi_call_virt() macro to get it to work with function pointers that are stored in locations other than efi.systab->runtime, and renames the macro to efi_call_virt_pointer(). The majority of the changes here are to pull these macros up into header files so that they can be accessed from outside of drivers/firmware/efi/runtime-wrappers.c. The most significant change not directly related to the code move is to add an extra "p" argument into the appropriate efi_call macros, and use that new argument in place of the, formerly hard-coded, efi.systab->runtime pointer. The last piece of the puzzle was to add an efi_call_virt() macro back into drivers/firmware/efi/runtime-wrappers.c to wrap around the new efi_call_virt_pointer() macro - this was mainly to keep the code from looking too cluttered by adding a bunch of extra references to efi.systab->runtime everywhere. Note that I also broke up the code in the efi_call_virt_pointer() macro a bit in the process of moving it. Signed-off-by: Alex Thorlton <athorlton@sgi.com> Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dimitri Sivanich <sivanich@sgi.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roy Franz <roy.franz@linaro.org> Cc: Russ Anderson <rja@sgi.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Cc: linux-arm-kernel@lists.infradead.org Cc: linux-efi@vger.kernel.org Link: http://lkml.kernel.org/r/1466839230-12781-5-git-send-email-matt@codeblueprint.co.uk Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-27Merge tag 'v4.7-rc5' into perf/core, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-24Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge misc fixes from Andrew Morton: "Two weeks worth of fixes here" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (41 commits) init/main.c: fix initcall_blacklisted on ia64, ppc64 and parisc64 autofs: don't get stuck in a loop if vfs_write() returns an error mm/page_owner: avoid null pointer dereference tools/vm/slabinfo: fix spelling mistake: "Ocurrences" -> "Occurrences" fs/nilfs2: fix potential underflow in call to crc32_le oom, suspend: fix oom_reaper vs. oom_killer_disable race ocfs2: disable BUG assertions in reading blocks mm, compaction: abort free scanner if split fails mm: prevent KASAN false positives in kmemleak mm/hugetlb: clear compound_mapcount when freeing gigantic pages mm/swap.c: flush lru pvecs on compound page arrival memcg: css_alloc should return an ERR_PTR value on error memcg: mem_cgroup_migrate() may be called with irq disabled hugetlb: fix nr_pmds accounting with shared page tables Revert "mm: disable fault around on emulated access bit architecture" Revert "mm: make faultaround produce old ptes" mailmap: add Boris Brezillon's email mailmap: add Antoine Tenart's email mm, sl[au]b: add __GFP_ATOMIC to the GFP reclaim mask mm: mempool: kasan: don't poot mempool objects in quarantine ...
2016-06-24tree wide: get rid of __GFP_REPEAT for order-0 allocations part IMichal Hocko
This is the third version of the patchset previously sent [1]. I have basically only rebased it on top of 4.7-rc1 tree and dropped "dm: get rid of superfluous gfp flags" which went through dm tree. I am sending it now because it is tree wide and chances for conflicts are reduced considerably when we want to target rc2. I plan to send the next step and rename the flag and move to a better semantic later during this release cycle so we will have a new semantic ready for 4.8 merge window hopefully. Motivation: While working on something unrelated I've checked the current usage of __GFP_REPEAT in the tree. It seems that a majority of the usage is and always has been bogus because __GFP_REPEAT has always been about costly high order allocations while we are using it for order-0 or very small orders very often. It seems that a big pile of them is just a copy&paste when a code has been adopted from one arch to another. I think it makes some sense to get rid of them because they are just making the semantic more unclear. Please note that GFP_REPEAT is documented as * __GFP_REPEAT: Try hard to allocate the memory, but the allocation attempt * _might_ fail. This depends upon the particular VM implementation. while !costly requests have basically nofail semantic. So one could reasonably expect that order-0 request with __GFP_REPEAT will not loop for ever. This is not implemented right now though. I would like to move on with __GFP_REPEAT and define a better semantic for it. $ git grep __GFP_REPEAT origin/master | wc -l 111 $ git grep __GFP_REPEAT | wc -l 36 So we are down to the third after this patch series. The remaining places really seem to be relying on __GFP_REPEAT due to large allocation requests. This still needs some double checking which I will do later after all the simple ones are sorted out. I am touching a lot of arch specific code here and I hope I got it right but as a matter of fact I even didn't compile test for some archs as I do not have cross compiler for them. Patches should be quite trivial to review for stupid compile mistakes though. The tricky parts are usually hidden by macro definitions and thats where I would appreciate help from arch maintainers. [1] http://lkml.kernel.org/r/1461849846-27209-1-git-send-email-mhocko@kernel.org This patch (of 19): __GFP_REPEAT has a rather weak semantic but since it has been introduced around 2.6.12 it has been ignored for low order allocations. Yet we have the full kernel tree with its usage for apparently order-0 allocations. This is really confusing because __GFP_REPEAT is explicitly documented to allow allocation failures which is a weaker semantic than the current order-0 has (basically nofail). Let's simply drop __GFP_REPEAT from those places. This would allow to identify place which really need allocator to retry harder and formulate a more specific semantic for what the flag is supposed to do actually. Link: http://lkml.kernel.org/r/1464599699-30131-2-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Andy Lutomirski <luto@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chen Liqin <liqin.linux@gmail.com> Cc: Chris Metcalf <cmetcalf@mellanox.com> [for tile] Cc: Guan Xuetao <gxt@mprc.pku.edu.cn> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: John Crispin <blogic@openwrt.org> Cc: Lennox Wu <lennox.wu@gmail.com> Cc: Ley Foon Tan <lftan@altera.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@arm.linux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-06-24x86: fix up a few misc stack pointer vs thread_info confusionsLinus Torvalds
As the actual pointer value is the same for the thread stack allocation and the thread_info, code that confused the two worked fine, but will break when the thread info is moved away from the stack allocation. It also looks very confusing. For example, the kprobe code wanted to know the current top of stack. To do that, it used this: (unsigned long)current_thread_info() + THREAD_SIZE which did indeed give the correct value. But it's not only a fairly nonsensical expression, it's also rather complex, especially since we actually have this: static inline unsigned long current_top_of_stack(void) which not only gives us the value we are interested in, but happens to be how "current_thread_info()" is currently defined as: (struct thread_info *)(current_top_of_stack() - THREAD_SIZE); so using current_thread_info() to figure out the top of the stack really is a very round-about thing to do. The other cases are just simpler confusion about task_thread_info() vs task_stack_page(), which currently return the same pointer - but if you want the stack page, you really should be using the latter one. And there was one entirely unused assignment of the current stack to a thread_info pointer. All cleaned up to make more sense today, and make it easier to move the thread_info away from the stack in the future. No semantic changes. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-06-23x86: avoid avoid passing around 'thread_info' in stack dumping codeLinus Torvalds
None of the code actually wants a thread_info, it all wants a task_struct, and it's just converting to a thread_info pointer much too early. No semantic change. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-06-23KVM: VMX: enable guest access to LMCE related MSRsAshok Raj
On Intel platforms, this patch adds LMCE to KVM MCE supported capabilities and handles guest access to LMCE related MSRs. Signed-off-by: Ashok Raj <ashok.raj@intel.com> [Haozhong: macro KVM_MCE_CAP_SUPPORTED => variable kvm_mce_cap_supported Only enable LMCE on Intel platform Check MSR_IA32_FEATURE_CONTROL when handling guest access to MSR_IA32_MCG_EXT_CTL] Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-06-22ACPI / tables: move arch-specific symbol to asm/acpi.hAleksey Makarov
The constant that defines max phys address where the new upgraded ACPI table should be allocated is arch-specific. Move it to <asm/acpi.h> Signed-off-by: Aleksey Makarov <aleksey.makarov@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-06-18x86/fpu/xstate: Copy xstate registers directly to the signal frame when ↵Yu-cheng Yu
compacted format is in use XSAVES is a kernel instruction and uses a compacted format. When working with user space, the kernel should provide standard-format, non-supervisor state data. We cannot do __copy_to_user() from a compacted-format kernel xstate area to a signal frame. Dave Hansen proposes this method to simplify copy xstate directly to user. This patch is based on an earlier patch from Fenghua Yu <fenghua.yu@intel.com> Originally-from: Fenghua Yu <fenghua.yu@intel.com> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com> Reviewed-by: Dave Hansen <dave.hansen@intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com> Cc: Ravi V. Shankar <ravi.v.shankar@intel.com> Cc: Sai Praneeth Prakhya <sai.praneeth.prakhya@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/c36f419d525517d04209a28dd8e1e5af9000036e.1463760376.git.yu-cheng.yu@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-18x86/fpu/xstate: Rename 'xstate_size' to 'fpu_kernel_xstate_size', to ↵Fenghua Yu
distinguish it from 'fpu_user_xstate_size' User space uses standard format xsave area. fpstate in signal frame should have standard format size. To explicitly distinguish between xstate size in kernel space and the one in user space, we rename 'xstate_size' to 'fpu_kernel_xstate_size'. Cleanup only, no change in functionality. Signed-off-by: Fenghua Yu <fenghua.yu@intel.com> [ Rebased the patch and cleaned up the naming. ] Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com> Reviewed-by: Dave Hansen <dave.hansen@intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com> Cc: Ravi V. Shankar <ravi.v.shankar@intel.com> Cc: Sai Praneeth Prakhya <sai.praneeth.prakhya@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/2ecbae347a5152d94be52adf7d0f3b7305d90d99.1463760376.git.yu-cheng.yu@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-18x86/fpu/xstate: Define and use 'fpu_user_xstate_size'Fenghua Yu
The kernel xstate area can be in standard or compacted format; it is always in standard format for user mode. When XSAVES is enabled, the kernel uses the compacted format and it is necessary to use a separate fpu_user_xstate_size for signal/ptrace frames. Signed-off-by: Fenghua Yu <fenghua.yu@intel.com> [ Rebased the patch and cleaned up the naming. ] Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com> Reviewed-by: Dave Hansen <dave.hansen@intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com> Cc: Ravi V. Shankar <ravi.v.shankar@intel.com> Cc: Sai Praneeth Prakhya <sai.praneeth.prakhya@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/8756ec34dabddfc727cda5743195eb81e8caf91c.1463760376.git.yu-cheng.yu@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-16Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull KVM fixes from Paolo Bonzini: - miscellaneous fixes for MIPS and s390 - one new kvm_stat for s390 - correctly disable VT-d posted interrupts with the rest of posted interrupts - "make randconfig" fix for x86 AMD - off-by-one in irq route check (the "good" kind that errors out a bit too early!) * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: kvm: vmx: check apicv is active before using VT-d posted interrupt kvm: Fix irq route entries exceeding KVM_MAX_IRQ_ROUTES kvm: svm: Do not support AVIC if not CONFIG_X86_LOCAL_APIC kvm: svm: Fix implicit declaration for __default_cpu_present_to_apicid() MIPS: KVM: Fix CACHE triggered exception emulation MIPS: KVM: Don't unwind PC when emulating CACHE MIPS: KVM: Include bit 31 in segment matches MIPS: KVM: Fix modular KVM under QEMU KVM: s390: Add stats for PEI events KVM: s390: ignore IBC if zero
2016-06-16locking/atomic: Remove linux/atomic.h:atomic_fetch_or()Peter Zijlstra
Since all architectures have this implemented now natively, remove this dead code. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-arch@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-16locking/atomic, arch/x86: Implement atomic{,64}_fetch_{add,sub,and,or,xor}()Peter Zijlstra
Implement FETCH-OP atomic primitives, these are very similar to the existing OP-RETURN primitives we already have, except they return the value of the atomic variable _before_ modification. This is especially useful for irreversible operations -- such as bitops (because it becomes impossible to reconstruct the state prior to modification). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-arch@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-16kvm: vmx: hook preemption timer supportYunhong Jiang
Hook the VMX preemption timer to the "hv timer" functionality added by the previous patch. This includes: checking if the feature is supported, if the feature is broken on the CPU, the hooks to setup/clean the VMX preemption timer, arming the timer on vmentry and handling the vmexit. A module parameter states if the VMX preemption timer should be utilized. Signed-off-by: Yunhong Jiang <yunhong.jiang@intel.com> [Move hv_deadline_tsc to struct vcpu_vmx, use -1 as the "unset" value. Put all VMX bits here. Enable it by default #yolo. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-06-16KVM: x86: support using the vmx preemption timer for tsc deadline timerYunhong Jiang
The VMX preemption timer can be used to virtualize the TSC deadline timer. The VMX preemption timer is armed when the vCPU is running, and a VMExit will happen if the virtual TSC deadline timer expires. When the vCPU thread is blocked because of HLT, KVM will switch to use an hrtimer, and then go back to the VMX preemption timer when the vCPU thread is unblocked. This solution avoids the complex OS's hrtimer system, and the host timer interrupt handling cost, replacing them with a little math (for guest->host TSC and host TSC->preemption timer conversion) and a cheaper VMexit. This benefits latency for isolated pCPUs. [A word about performance... Yunhong reported a 30% reduction in average latency from cyclictest. I made a similar test with tscdeadline_latency from kvm-unit-tests, and measured - ~20 clock cycles loss (out of ~3200, so less than 1% but still statistically significant) in the worst case where the test halts just after programming the TSC deadline timer - ~800 clock cycles gain (25% reduction in latency) in the best case where the test busy waits. I removed the VMX bits from Yunhong's patch, to concentrate them in the next patch - Paolo] Signed-off-by: Yunhong Jiang <yunhong.jiang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>