summaryrefslogtreecommitdiff
path: root/arch/x86/mm
AgeCommit message (Collapse)Author
2015-06-09x86/mpx: Allow 32-bit binaries on 64-bit kernels againDave Hansen
Now that the bugs in mixed mode MPX handling are fixed, re-allow 32-bit binaries on 64-bit kernels again. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Dave Hansen <dave@sr71.net> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20150607183706.70277DAD@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-06-09x86/mpx: Do not count MPX VMAs as neighbors when unmappingDave Hansen
The comment pretty much says it all. I wrote a test program that does lots of random allocations and forces bounds tables to be created. It came up with a layout like this: .... | BOUNDS DIRECTORY ENTRY COVERS | .... | BOUNDS TABLE COVERS | | BOUNDS TABLE | REAL ALLOC | BOUNDS TABLE | Unmapping "REAL ALLOC" should have been able to free the bounds table "covering" the "REAL ALLOC" because it was the last real user. But, the neighboring VMA bounds tables were found, considered as real neighbors, and we declined to free the bounds table covering the area. Doing this over and over left a small but significant number of these orphans. Handling them is fairly straighforward. All we have to do is walk the VMAs and skip all of the MPX ones when looking for neighbors. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Dave Hansen <dave@sr71.net> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20150607183706.A6BD90BF@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-06-09x86/mpx: Rewrite the unmap codeDave Hansen
The MPX code needs to clear out bounds tables for memory which is no longer in use. We do this when a userspace mapping is torn down (unmapped). There are two modes: 1. An entire bounds table becomes unused, and can be freed and its pointer removed from the bounds directory. This happens either when a large mapping is torn down, or when a small mapping is torn down and it is the last mapping "covered" by a bounds table. 2. Only part of a bounds table becomes unused, in which case we free the backing memory as if MADV_DONTNEED was called. The old code was a spaghetti mess of "edge" bounds tables where the edges were handled specially, even if we were unmapping an entire one. Non-edge bounds tables are always fully unmapped, but share a different code path from the edge ones. The old code had a bug where it was unmapping too much memory. I worked on fixing it for two days and gave up. I didn't write the original code. I didn't particularly like it, but it worked, so I left it. After my debug session, I realized it was undebuggagle *and* buggy, so out it went. I also wrote a new unmapping test program which uncovers bugs pretty nicely. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Dave Hansen <dave@sr71.net> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20150607183706.DCAEC67D@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-06-09x86/mpx: Support 32-bit binaries on 64-bit kernelsDave Hansen
Right now, the kernel can only switch between 64-bit and 32-bit binaries at compile time. This patch adds support for 32-bit binaries on 64-bit kernels when we support ia32 emulation. We essentially choose which set of table sizes to use when doing arithmetic for the bounds table calculations. This also uses a different approach for calculating the table indexes than before. I think the new one makes it much more clear what is going on, and allows us to share more code between the 32-bit and 64-bit cases. Based-on-patch-by: Qiaowei Ren <qiaowei.ren@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Dave Hansen <dave@sr71.net> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20150607183705.E01F21E2@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-06-09x86/mpx: Use 32-bit-only cmpxchg() for 32-bit appsDave Hansen
user_atomic_cmpxchg_inatomic() actually looks at sizeof(*ptr) to figure out how many bytes to copy. If we run it on a 64-bit kernel with a 64-bit pointer, it will copy a 64-bit bounds directory entry. That's fine, except when we have 32-bit programs with 32-bit bounds directory entries and we only *want* 32-bits. This patch breaks the cmpxchg() operation out in to its own function and performs the 32-bit type swizzling in there. Note, the "64-bit" version of this code _would_ work on a 32-bit-only kernel. The issue this patch addresses is only for when the kernel's 'long' is mismatched from the size of the bounds directory entry of the process we are working on. The new helper modifies 'actual_old_val' or returns an error. But gcc doesn't know this, so it warns about 'actual_old_val' being unused. Shut it up with an uninitialized_var(). Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Dave Hansen <dave@sr71.net> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20150607183705.672B115E@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-06-09x86/mpx: Introduce new 'directory entry' to 'addr' helper functionDave Hansen
Currently, to get from a bounds directory entry to the virtual address of a bounds table, we simply mask off a few low bits. However, the set of bits we mask off is different for 32-bit and 64-bit binaries. This breaks the operation out in to a helper function and also adds a temporary variable to store the result until we are sure we are returning one. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Dave Hansen <dave@sr71.net> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20150607183704.007686CE@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-06-09x86/mpx: Add temporary variable to reduce maskingDave Hansen
When we allocate a bounds table, we call mmap(), then add a "valid" bit to the value before storing it in to the bounds directory. If we fail along the way, we go and mask that valid bit _back_ out. That seems a little silly, and this makes it much more clear when we have a plain address versus an actual table _entry_. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Dave Hansen <dave@sr71.net> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20150607183704.3D69D5F4@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-06-09x86/mpx: Trace allocation of new bounds tablesDave Hansen
Bounds tables are a significant consumer of memory. It is important to know when they are being allocated. Add a trace point to trace whenever an allocation occurs and also its virtual address. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Dave Hansen <dave@sr71.net> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20150607183704.EC23A93E@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-06-09x86/mpx: Trace the attempts to find bounds tablesDave Hansen
There are two different events being traced here. They are doing similar things so share a trace "EVENT_CLASS" and are presented together. 1. Trace when MPX is zapping pages "mpx_unmap_zap": When MPX can not free an entire bounds table, it will instead try to zap unused parts of a bounds table to free the backing memory. This decreases RSS (resident set size) without decreasing the virtual space allocated for bounds tables. 2. Trace attempts to find bounds tables "mpx_unmap_search": This event traces any time we go looking to unmap a bounds table for a given virtual address range. This is useful to ensure that the kernel actually "tried" to free a bounds table versus times it succeeded in finding one. It might try and fail if it realized that a table was shared with an adjacent VMA which is not being unmapped. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Dave Hansen <dave@sr71.net> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20150607183703.B9D2468B@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-06-09x86/mpx: Trace entry to bounds exception pathsDave Hansen
There are two basic things that can happen as the result of a bounds exception (#BR): 1. We allocate a new bounds table 2. We pass up a bounds exception to userspace. This patch adds a trace point for the case where we are passing the exception up to userspace with a signal. We are also explicit that we're printing out the inverse of the 'upper' that we encounter. If you want to filter, for instance, you need to ~ the value first. The reason we do this is because of how 'upper' is stored in the bounds table. If a pointer's range is: 0x1000 -> 0x2000 it is stored in the bounds table as (32-bits here for brevity): lower: 0x00001000 upper: 0xffffdfff That is so that an all 0's entry: lower: 0x00000000 upper: 0x00000000 corresponds to the "init" bounds which store a *range* of: 0x00000000 -> 0xffffffff That is, by far, the common case, and that lets us use the zero page, or deduplicate the memory, etc... The 'upper' stored in the table is gibberish to print by itself, so we print ~upper to get the *actual*, logical, human-readable value printed out. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Dave Hansen <dave@sr71.net> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20150607183703.027BB9B0@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-06-09x86/mpx: Trace #BR exceptionsDave Hansen
This is the first in a series of MPX tracing patches. I've found these extremely useful in the process of debugging applications and the kernel code itself. This exception hooks in to the bounds (#BR) exception very early and allows capturing the key registers which would influence how the exception is handled. Note that bndcfgu/bndstatus are technically still 64-bit registers even in 32-bit mode. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Dave Hansen <dave@sr71.net> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20150607183703.5FE2619A@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-06-09x86/mpx: Restrict the mmap() size check to bounds tablesDave Hansen
The comment and code here are confusing. We do not currently allocate the bounds directory in the kernel. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Dave Hansen <dave@sr71.net> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20150607183702.222CEC2A@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-06-09x86/mpx: Clean up the code by not passing a task pointer around when unnecessaryDave Hansen
The MPX code can only work on the current task. You can not, for instance, enable MPX management in another process or thread. You can also not handle a fault for another process or thread. Despite this, we pass a task_struct around prolifically. This patch removes all of the task struct passing for code paths where the code can not deal with another task (which turns out to be all of them). This has no functional changes. It's just a cleanup. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Dave Hansen <dave@sr71.net> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: bp@alien8.de Link: http://lkml.kernel.org/r/20150607183702.6A81DA2C@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-06-09x86/mpx: Use the new get_xsave_field_ptr()APIDave Hansen
The MPX registers (bndcsr/bndcfgu/bndstatus) are not directly accessible via normal instructions. They essentially act as if they were floating point registers and are saved/restored along with those registers. There are two main paths in the MPX code where we care about the contents of these registers: 1. #BR (bounds) faults 2. the prctl() code where we are setting MPX up Both of those paths _might_ be called without the FPU having been used. That means that 'tsk->thread.fpu.state' might never be allocated. Also, fpu_save_init() is not preempt-safe. It was a bug to call it without disabling preemption. The new get_xsave_addr() calls unlazy_fpu() instead and properly disables preemption. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Dave Hansen <dave@sr71.net> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Suresh Siddha <sbsiddha@gmail.com> Cc: bp@alien8.de Link: http://lkml.kernel.org/r/20150607183701.BC0D37CF@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-19x86/fpu: Harmonize FPU register state typesIngo Molnar
Use these consistent names: struct fregs_state # was: i387_fsave_struct struct fxregs_state # was: i387_fxsave_struct struct swregs_state # was: i387_soft_struct struct xregs_state # was: xsave_struct union fpregs_state # was: thread_xstate Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-19x86/fpu: Rename all the fpregs, xregs, fxregs and fregs handling functionsIngo Molnar
Standardize the naming of the various functions that copy register content in specific FPU context formats: copy_fxregs_to_kernel() # was: fpu_fxsave() copy_xregs_to_kernel() # was: xsave_state() copy_kernel_to_fregs() # was: frstor_checking() copy_kernel_to_fxregs() # was: fxrstor_checking() copy_kernel_to_xregs() # was: fpu_xrstor_checking() copy_kernel_to_xregs_booting() # was: xrstor_state_booting() copy_fregs_to_user() # was: fsave_user() copy_fxregs_to_user() # was: fxsave_user() copy_xregs_to_user() # was: xsave_user() copy_user_to_fregs() # was: frstor_user() copy_user_to_fxregs() # was: fxrstor_user() copy_user_to_xregs() # was: xrestore_user() copy_user_to_fpregs_zeroing() # was: restore_user_xstate() Eliminate fpu_xrstor_checking(), because it was just a wrapper. No change in functionality. Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-19x86/fpu: Simplify FPU handling by embedding the fpstate in task_struct (again)Ingo Molnar
So 6 years ago we made the FPU fpstate dynamically allocated: aa283f49276e ("x86, fpu: lazy allocation of FPU area - v5") 61c4628b5386 ("x86, fpu: split FPU state from task struct - v5") In hindsight this was a mistake: - it complicated context allocation failure handling, such as: /* kthread execs. TODO: cleanup this horror. */ if (WARN_ON(fpstate_alloc_init(fpu))) force_sig(SIGKILL, tsk); - it caused us to enable irqs in fpu__restore(): local_irq_enable(); /* * does a slab alloc which can sleep */ if (fpstate_alloc_init(fpu)) { /* * ran out of memory! */ do_group_exit(SIGKILL); return; } local_irq_disable(); - it (slightly) slowed down task creation/destruction by adding slab allocation/free pattens. - it made access to context contents (slightly) slower by adding one more pointer dereference. The motivation for the dynamic allocation was two-fold: - reduce memory consumption by non-FPU tasks - allocate and handle only the necessary amount of context for various XSAVE processors that have varying hardware frame sizes. These days, with glibc using SSE memcpy by default and GCC optimizing for SSE/AVX by default, the scope of FPU using apps on an x86 system is much larger than it was 6 years ago. For example on a freshly installed Fedora 21 desktop system, with a recent kernel, all non-kthread tasks have used the FPU shortly after bootup. Also, even modern embedded x86 CPUs try to support the latest vector instruction set - so they'll too often use the larger xstate frame sizes. So remove the dynamic allocation complication by embedding the FPU fpstate in task_struct again. This should make the FPU a lot more accessible to all sorts of atomic contexts. We could still optimize for the xstate frame size in the future, by moving the state structure to the last element of task_struct, and allocating only a part of that. This change is kept minimal by still keeping the ctx_alloc()/free() routines (that now do nothing substantial) - we'll remove them in the following patches. Reviewed-by: Borislav Petkov <bp@alien8.de> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-19x86/fpu: Rename fpu_save_init() to copy_fpregs_to_fpstate()Ingo Molnar
So fpu_save_init() is a historic name that got its name when the only way the FPU state was FNSAVE, which cleared (well, destroyed) the FPU state after saving it. Nowadays the name is misleading, because ever since the introduction of FXSAVE (and more modern FPU saving instructions) the 'we need to reload the FPU state' part is only true if there's a pending FPU exception [*], which is almost never the case. So rename it to copy_fpregs_to_fpstate() to make it clear what's happening. Also add a few comments about why we cannot keep registers in certain cases. Also clean up the control flow a bit, to make it more apparent when we are dropping/keeping FP registers, and to optimize the common case (of keeping fpregs) some more. [*] Probably not true anymore, modern instructions always leave the FPU state intact, even if exceptions are pending: because pending FP exceptions are posted on the next FP instruction, not asynchronously. They were truly asynchronous back in the IRQ13 case, and we had to synchronize with them, but that code is not working anymore: we don't have IRQ13 mapped in the IDT anymore. But a cleanup patch is obviously not the place to change subtle behavior. Reviewed-by: Borislav Petkov <bp@alien8.de> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-19x86/fpu: Rename fpu-internal.h to fpu/internal.hIngo Molnar
This unifies all the FPU related header files under a unified, hiearchical naming scheme: - asm/fpu/types.h: FPU related data types, needed for 'struct task_struct', widely included in almost all kernel code, and hence kept as small as possible. - asm/fpu/api.h: FPU related 'public' methods exported to other subsystems. - asm/fpu/internal.h: FPU subsystem internal methods - asm/fpu/xsave.h: XSAVE support internal methods (Also standardize the header guard in asm/fpu/internal.h.) Reviewed-by: Borislav Petkov <bp@alien8.de> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-19x86/fpu: Remove fpu_xsave()Ingo Molnar
It's a pointless wrapper now - use xsave_state(). Reviewed-by: Borislav Petkov <bp@alien8.de> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-19x86/fpu: Fix header file dependencies of fpu-internal.hIngo Molnar
Fix a minor header file dependency bug in asm/fpu-internal.h: it relies on i387.h but does not include it. All users of fpu-internal.h included it explicitly. Also remove unnecessary includes, to reduce compilation time. This also makes it easier to use it as a standalone header file for FPU internals, such as an upcoming C module in arch/x86/kernel/fpu/. Reviewed-by: Borislav Petkov <bp@alien8.de> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-06Merge branch 'x86-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Ingo Molnar: "EFI fixes, and FPU fix, a ticket spinlock boundary condition fix and two build fixes" * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/fpu: Always restore_xinit_state() when use_eager_cpu() x86: Make cpu_tss available to external modules efi: Fix error handling in add_sysfs_runtime_map_entry() x86/spinlocks: Fix regression in spinlock contention detection x86/mm: Clean up types in xlate_dev_mem_ptr() x86/efi: Store upper bits of command line buffer address in ext_cmd_line_ptr efivarfs: Ensure VariableName is NUL-terminated
2015-04-20x86/mm: Clean up types in xlate_dev_mem_ptr()Ingo Molnar
Pavel Machek reported the following compiler warning on x86/32 CONFIG_HIGHMEM64G=y builds: arch/x86/mm/ioremap.c:344:10: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast] Clean up the types in this function by using a single natural type for internal calculations (unsigned long), to make it more apparent what's happening, and also to remove fragile casts. Reported-by: Pavel Machek <pavel@ucw.cz> Cc: jgross@suse.com Cc: roland@purestorage.com Link: http://lkml.kernel.org/r/20150416080440.GA507@amd Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-14mm: move memtest under mmVladimir Murzin
Memtest is a simple feature which fills the memory with a given set of patterns and validates memory contents, if bad memory regions is detected it reserves them via memblock API. Since memblock API is widely used by other architectures this feature can be enabled outside of x86 world. This patch set promotes memtest to live under generic mm umbrella and enables memtest feature for arm/arm64. It was reported that this patch set was useful for tracking down an issue with some errant DMA on an arm64 platform. This patch (of 6): There is nothing platform dependent in the core memtest code, so other platforms might benefit from this feature too. [linux@roeck-us.net: MEMTEST depends on MEMBLOCK] Signed-off-by: Vladimir Murzin <vladimir.murzin@arm.com> Acked-by: Will Deacon <will.deacon@arm.com> Tested-by: Mark Rutland <mark.rutland@arm.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Paul Bolle <pebolle@tiscali.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14mm: expose arch_mmap_rnd when availableKees Cook
When an architecture fully supports randomizing the ELF load location, a per-arch mmap_rnd() function is used to find a randomized mmap base. In preparation for randomizing the location of ET_DYN binaries separately from mmap, this renames and exports these functions as arch_mmap_rnd(). Additionally introduces CONFIG_ARCH_HAS_ELF_RANDOMIZE for describing this feature on architectures that support it (which is a superset of ARCH_BINFMT_ELF_RANDOMIZE_PIE, since s390 already supports a separated ET_DYN ASLR from mmap ASLR without the ARCH_BINFMT_ELF_RANDOMIZE_PIE logic). Signed-off-by: Kees Cook <keescook@chromium.org> Cc: Hector Marco-Gisbert <hecmargi@upv.es> Cc: Russell King <linux@arm.linux.org.uk> Reviewed-by: Ingo Molnar <mingo@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: "David A. Long" <dave.long@linaro.org> Cc: Andrey Ryabinin <a.ryabinin@samsung.com> Cc: Arun Chandran <achandran@mvista.com> Cc: Yann Droneaud <ydroneaud@opteya.com> Cc: Min-Hua Chen <orca.chen@gmail.com> Cc: Paul Burton <paul.burton@imgtec.com> Cc: Alex Smith <alex@alex-smith.me.uk> Cc: Markos Chandras <markos.chandras@imgtec.com> Cc: Vineeth Vijayan <vvijayan@mvista.com> Cc: Jeff Bailey <jeffbailey@google.com> Cc: Michael Holzheu <holzheu@linux.vnet.ibm.com> Cc: Ben Hutchings <ben@decadent.org.uk> Cc: Behan Webster <behanw@converseincode.com> Cc: Ismael Ripoll <iripoll@upv.es> Cc: Jan-Simon Mller <dl9pf@gmx.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14x86: standardize mmap_rnd() usageKees Cook
In preparation for splitting out ET_DYN ASLR, this refactors the use of mmap_rnd() to be used similarly to arm, and extracts the checking of PF_RANDOMIZE. Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: Ingo Molnar <mingo@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14x86, mm: support huge KVA mappings on x86Toshi Kani
Implement huge KVA mapping interfaces on x86. On x86, MTRRs can override PAT memory types with a 4KB granularity. When using a huge page, MTRRs can override the memory type of the huge page, which may lead a performance penalty. The processor can also behave in an undefined manner if a huge page is mapped to a memory range that MTRRs have mapped with multiple different memory types. Therefore, the mapping code falls back to use a smaller page size toward 4KB when a mapping range is covered by non-WB type of MTRRs. The WB type of MTRRs has no affect on the PAT memory types. pud_set_huge() and pmd_set_huge() call mtrr_type_lookup() to see if a given range is covered by MTRRs. MTRR_TYPE_WRBACK indicates that the range is either covered by WB or not covered and the MTRR default value is set to WB. 0xFF indicates that MTRRs are disabled. HAVE_ARCH_HUGE_VMAP is selected when X86_64 or X86_32 with X86_PAE is set. X86_32 without X86_PAE is not supported since such config can unlikey be benefited from this feature, and there was an issue found in testing. [fengguang.wu@intel.com: ioremap_pud_capable can be static] Signed-off-by: Toshi Kani <toshi.kani@hp.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Robert Elliott <Elliott@hp.com> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14x86, mm: support huge I/O mapping capability I/FToshi Kani
Implement huge I/O mapping capability interfaces for ioremap() on x86. IOREMAP_MAX_ORDER is defined to PUD_SHIFT on x86/64 and PMD_SHIFT on x86/32, which overrides the default value defined in <linux/vmalloc.h>. Signed-off-by: Toshi Kani <toshi.kani@hp.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Robert Elliott <Elliott@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14x86: expose number of page table levels on Kconfig levelKirill A. Shutemov
We would want to use number of page table level to define mm_struct. Let's expose it as CONFIG_PGTABLE_LEVELS. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Tested-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-13Merge branch 'x86-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fix from Ingo Molnar: "Leftover from 4.0 Fix a local stack variable corruption with certain kdump usage patterns (Dave Young)" * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mm/numa: Fix kernel stack corruption in numa_init()->numa_clear_kernel_node_hotplug()
2015-04-13Merge branch 'x86-mm-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 mm changes from Ingo Molnar: "The main changes in this cycle were: - reduce the x86/32 PAE per task PGD allocation overhead from 4K to 0.032k (Fenghua Yu) - early_ioremap/memunmap() usage cleanups (Juergen Gross) - gbpages support cleanups (Luis R Rodriguez) - improve AMD Bulldozer (family 0x15) ASLR I$ aliasing workaround to increase randomization by 3 bits (per bootup) (Hector Marco-Gisbert) - misc fixlets" * 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mm: Improve AMD Bulldozer ASLR workaround x86/mm/pat: Initialize __cachemode2pte_tbl[] and __pte2cachemode_tbl[] in a bit more readable fashion init.h: Clean up the __setup()/early_param() macros x86/mm: Simplify probe_page_size_mask() x86/mm: Further simplify 1 GB kernel linear mappings handling x86/mm: Use early_param_on_off() for direct_gbpages init.h: Add early_param_on_off() x86/mm: Simplify enabling direct_gbpages x86/mm: Use IS_ENABLED() for direct_gbpages x86/mm: Unexport set_memory_ro() and set_memory_rw() x86/mm, efi: Use early_ioremap() in arch/x86/platform/efi/efi-bgrt.c x86/mm: Use early_memunmap() instead of early_iounmap() x86/mm/pat: Ensure different messages in STRICT_DEVMEM and PAT cases x86/mm: Reduce PAE-mode per task pgd allocation overhead from 4K to 32 bytes
2015-04-07x86/mm/numa: Fix kernel stack corruption in ↵Dave Young
numa_init()->numa_clear_kernel_node_hotplug() I got below kernel panic during kdump test on Thinkpad T420 laptop: [ 0.000000] No NUMA configuration found [ 0.000000] Faking a node at [mem 0x0000000000000000-0x0000000037ba4fff] [ 0.000000] Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: ffffffff81d21910 ... [ 0.000000] Call Trace: [ 0.000000] [<ffffffff817c2a26>] dump_stack+0x45/0x57 [ 0.000000] [<ffffffff817bc8d2>] panic+0xd0/0x204 [ 0.000000] [<ffffffff81d21910>] ? numa_clear_kernel_node_hotplug+0xe6/0xf2 [ 0.000000] [<ffffffff8107741b>] __stack_chk_fail+0x1b/0x20 [ 0.000000] [<ffffffff81d21910>] numa_clear_kernel_node_hotplug+0xe6/0xf2 [ 0.000000] [<ffffffff81d21e5d>] numa_init+0x1a5/0x520 [ 0.000000] [<ffffffff81d222b1>] x86_numa_init+0x19/0x3d [ 0.000000] [<ffffffff81d22460>] initmem_init+0x9/0xb [ 0.000000] [<ffffffff81d0d00c>] setup_arch+0x94f/0xc82 [ 0.000000] [<ffffffff81d05120>] ? early_idt_handlers+0x120/0x120 [ 0.000000] [<ffffffff817bd0bb>] ? printk+0x55/0x6b [ 0.000000] [<ffffffff81d05120>] ? early_idt_handlers+0x120/0x120 [ 0.000000] [<ffffffff81d05d9b>] start_kernel+0xe8/0x4d6 [ 0.000000] [<ffffffff81d05120>] ? early_idt_handlers+0x120/0x120 [ 0.000000] [<ffffffff81d05120>] ? early_idt_handlers+0x120/0x120 [ 0.000000] [<ffffffff81d055ee>] x86_64_start_reservations+0x2a/0x2c [ 0.000000] [<ffffffff81d05751>] x86_64_start_kernel+0x161/0x184 [ 0.000000] ---[ end Kernel panic - not syncing: stack-protector: Kernel sta This is caused by writing over the end of numa mask bitmap in numa_clear_kernel_node(). numa_clear_kernel_node() tries to set the node id in a mask bitmap, by iterating all reserved regions and assuming that every region has a valid nid. This assumption is not true because there's an exception for some graphic memory quirks. See trim_snb_memory() in arch/x86/kernel/setup.c It is easily to reproduce the bug in the kdump kernel because kdump kernel use pre-reserved memory instead of the whole memory, but kexec pass other reserved memory ranges to 2nd kernel as well. like below in my test: kdump kernel ram 0x2d000000 - 0x37bfffff One of the reserved regions: 0x40000000 - 0x40100000 which includes 0x40004000, a page excluded in trim_snb_memory(). For this memblock reserved region the nid is not set, it is still default value MAX_NUMNODES. later node_set will set bit MAX_NUMNODES thus stack corruption happen. This also happens when booting with mem= kernel commandline during my test. Fixing it by adding a check, do not call node_set in case nid is MAX_NUMNODES. Signed-off-by: Dave Young <dyoung@redhat.com> Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Borislav Petkov <bp@alien8.de> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: bhe@redhat.com Cc: qiuxishi@huawei.com Link: http://lkml.kernel.org/r/20150407134132.GA23522@dhcp-16-198.nay.redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-23x86/asm/entry: Change all 'user_mode_vm()' calls to 'user_mode()'Andy Lutomirski
user_mode_vm() and user_mode() are now the same. Change all callers of user_mode_vm() to user_mode(). The next patch will remove the definition of user_mode_vm. Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brad Spengler <spender@grsecurity.net> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/43b1f57f3df70df5a08b0925897c660725015554.1426728647.git.luto@kernel.org [ Merged to a more recent kernel. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-23x86/mm/fault: Use TASK_SIZE_MAX in is_prefetch()Andy Lutomirski
This is slightly shorter and slightly faster. It's also more correct: the split between user and kernel addresses is TASK_SIZE_MAX, regardless of ti->flags. Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brad Spengler <spender@grsecurity.net> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/09156b63bad90a327827003c9e53faa82ef4c56e.1426728647.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-05x86/mm/pat: Initialize __cachemode2pte_tbl[] and __pte2cachemode_tbl[] in a ↵Ingo Molnar
bit more readable fashion The initialization of these two arrays is a bit difficult to follow: restructure it optically so that a 2D structure shows which bit in the PTE is set and which not. Also improve on comments a bit. No code or data changed: # arch/x86/mm/init.o: text data bss dec hex filename 4585 424 29776 34785 87e1 init.o.before 4585 424 29776 34785 87e1 init.o.after md5: a82e11ff58bcfd0af3a94662a701f65d init.o.before.asm a82e11ff58bcfd0af3a94662a701f65d init.o.after.asm Reviewed-by: Juergen Gross <jgross@suse.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Jan Beulich <JBeulich@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luis R. Rodriguez <mcgrof@suse.com> Cc: Toshi Kani <toshi.kani@hp.com> Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/20150305082135.GB5969@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-05x86/mm: Simplify probe_page_size_mask()Ingo Molnar
Now that we've simplified the gbpages config space, move the 'page_size_mask' initialization into probe_page_size_mask(), right next to the PSE and PGE enablement lines. Cc: Luis R. Rodriguez <mcgrof@suse.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Borislav Petkov <bp@suse.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Dexuan Cui <decui@microsoft.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: JBeulich@suse.com Cc: Jan Beulich <JBeulich@suse.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Juergen Gross <jgross@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Pavel Machek <pavel@ucw.cz> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Lindgren <tony@atomide.com> Cc: Toshi Kani <toshi.kani@hp.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Xishi Qiu <qiuxishi@huawei.com> Cc: julia.lawall@lip6.fr Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-05x86/mm: Further simplify 1 GB kernel linear mappings handlingIngo Molnar
It's a bit pointless to allow Kconfig configuration for 1GB kernel mappings, it's already hidden behind a 'default y' and CONFIG_EXPERT. Remove this complication and simplify the code by renaming CONFIG_ENABLE_DIRECT_GBPAGES to CONFIG_X86_DIRECT_GBPAGES and document the DEBUG_PAGE_ALLOC and KMEMCHECK quirks. Cc: Luis R. Rodriguez <mcgrof@suse.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Borislav Petkov <bp@suse.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Dexuan Cui <decui@microsoft.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: JBeulich@suse.com Cc: Jan Beulich <JBeulich@suse.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Juergen Gross <jgross@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Pavel Machek <pavel@ucw.cz> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Lindgren <tony@atomide.com> Cc: Toshi Kani <toshi.kani@hp.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Xishi Qiu <qiuxishi@huawei.com> Cc: julia.lawall@lip6.fr Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-05x86/mm: Use early_param_on_off() for direct_gbpagesLuis R. Rodriguez
The enabler / disabler is pretty simple, just use the provided wrappers, this lets us easily relate the variable to the associated Kconfig entry. Signed-off-by: Luis R. Rodriguez <mcgrof@suse.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Borislav Petkov <bp@suse.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Dexuan Cui <decui@microsoft.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: JBeulich@suse.com Cc: Jan Beulich <JBeulich@suse.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Juergen Gross <jgross@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Pavel Machek <pavel@ucw.cz> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Lindgren <tony@atomide.com> Cc: Toshi Kani <toshi.kani@hp.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Xishi Qiu <qiuxishi@huawei.com> Cc: julia.lawall@lip6.fr Link: http://lkml.kernel.org/r/1425518654-3403-5-git-send-email-mcgrof@do-not-panic.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-05x86/mm: Simplify enabling direct_gbpagesLuis R. Rodriguez
direct_gbpages can be force enabled as an early parameter but not really have taken effect when DEBUG_PAGEALLOC or KMEMCHECK is enabled. You can also enable direct_gbpages right now if you have an x86_64 architecture but your CPU doesn't really have support for this feature. In both cases PG_LEVEL_1G won't actually be enabled but direct_gbpages is used in other areas under the assumptions that PG_LEVEL_1G was set. Fix this by putting together all requirements which make this feature sensible to enable under, and only enable both finally flipping on PG_LEVEL_1G and leaving PG_LEVEL_1G set when this is true. We only enable this feature then to be possible on sensible builds defined by the new ENABLE_DIRECT_GBPAGES. If the CPU has support for it you can either enable this by using the DIRECT_GBPAGES option or using the early kernel parameter. If a platform had support for this you can always force disable it as well. Signed-off-by: Luis R. Rodriguez <mcgrof@suse.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Borislav Petkov <bp@suse.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Dexuan Cui <decui@microsoft.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: JBeulich@suse.com Cc: Jan Beulich <JBeulich@suse.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Juergen Gross <jgross@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Pavel Machek <pavel@ucw.cz> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Lindgren <tony@atomide.com> Cc: Toshi Kani <toshi.kani@hp.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Xishi Qiu <qiuxishi@huawei.com> Cc: julia.lawall@lip6.fr Link: http://lkml.kernel.org/r/1425518654-3403-3-git-send-email-mcgrof@do-not-panic.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-05x86/mm: Use IS_ENABLED() for direct_gbpagesLuis R. Rodriguez
Replace #ifdef eyesore with IS_ENABLED() use. Signed-off-by: Luis R. Rodriguez <mcgrof@suse.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Borislav Petkov <bp@suse.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Dexuan Cui <decui@microsoft.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: JBeulich@suse.com Cc: Jan Beulich <JBeulich@suse.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Juergen Gross <jgross@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Pavel Machek <pavel@ucw.cz> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Lindgren <tony@atomide.com> Cc: Toshi Kani <toshi.kani@hp.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Xishi Qiu <qiuxishi@huawei.com> Cc: julia.lawall@lip6.fr Link: http://lkml.kernel.org/r/1425518654-3403-2-git-send-email-mcgrof@do-not-panic.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-04Merge tag 'v4.0-rc2' into x86/asm, to refresh the treeIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-28x86/mm: Unexport set_memory_ro() and set_memory_rw()Daniel Borkmann
This effectively unexports set_memory_ro() and set_memory_rw() functions, and thus reverts: a03352d2c1dc ("x86: export set_memory_ro and set_memory_rw"). They have been introduced for debugging purposes in e1000e, but no module user is in mainline kernel (anymore?) and we explicitly do not want modules to use these functions, as they i.e. protect eBPF (interpreted & JIT'ed) images from malicious modifications or bugs. Outside of eBPF scope, I believe also other set_memory_*() functions should be unexported on x86 for modules. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Borislav Petkov <bp@suse.de> Cc: Bruce Allan <bruce.w.allan@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jesse Brandeburg <jesse.brandeburg@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: davem@davemloft.net Link: http://lkml.kernel.org/r/a064393a0a5d319eebde5c761cfd743132d4f213.1425040940.git.daniel@iogearbox.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-24Merge tag 'v4.0-rc1' into x86/mm, to refresh the treeIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-21Merge branch 'x86-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull misc x86 fixes from Ingo Molnar: "This contains: - EFI fixes - a boot printout fix - ASLR/kASLR fixes - intel microcode driver fixes - other misc fixes Most of the linecount comes from an EFI revert" * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mm/ASLR: Avoid PAGE_SIZE redefinition for UML subarch x86/microcode/intel: Handle truncated microcode images more robustly x86/microcode/intel: Guard against stack overflow in the loader x86, mm/ASLR: Fix stack randomization on 64-bit systems x86/mm/init: Fix incorrect page size in init_memory_mapping() printks x86/mm/ASLR: Propagate base load address calculation Documentation/x86: Fix path in zero-page.txt x86/apic: Fix the devicetree build in certain configs Revert "efi/libstub: Call get_memory_map() to obtain map and desc sizes" x86/efi: Avoid triple faults during EFI mixed mode calls
2015-02-19Merge branch 'tip-x86-kaslr' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp into x86/urgent Pull ASLR and kASLR fixes from Borislav Petkov: - Add a global flag announcing KASLR state so that relevant code can do informed decisions based on its setting. (Jiri Kosina) - Fix a stack randomization entropy decrease bug. (Hector Marco-Gisbert) Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-19x86, mm/ASLR: Fix stack randomization on 64-bit systemsHector Marco-Gisbert
The issue is that the stack for processes is not properly randomized on 64 bit architectures due to an integer overflow. The affected function is randomize_stack_top() in file "fs/binfmt_elf.c": static unsigned long randomize_stack_top(unsigned long stack_top) { unsigned int random_variable = 0; if ((current->flags & PF_RANDOMIZE) && !(current->personality & ADDR_NO_RANDOMIZE)) { random_variable = get_random_int() & STACK_RND_MASK; random_variable <<= PAGE_SHIFT; } return PAGE_ALIGN(stack_top) + random_variable; return PAGE_ALIGN(stack_top) - random_variable; } Note that, it declares the "random_variable" variable as "unsigned int". Since the result of the shifting operation between STACK_RND_MASK (which is 0x3fffff on x86_64, 22 bits) and PAGE_SHIFT (which is 12 on x86_64): random_variable <<= PAGE_SHIFT; then the two leftmost bits are dropped when storing the result in the "random_variable". This variable shall be at least 34 bits long to hold the (22+12) result. These two dropped bits have an impact on the entropy of process stack. Concretely, the total stack entropy is reduced by four: from 2^28 to 2^30 (One fourth of expected entropy). This patch restores back the entropy by correcting the types involved in the operations in the functions randomize_stack_top() and stack_maxrandom_size(). The successful fix can be tested with: $ for i in `seq 1 10`; do cat /proc/self/maps | grep stack; done 7ffeda566000-7ffeda587000 rw-p 00000000 00:00 0 [stack] 7fff5a332000-7fff5a353000 rw-p 00000000 00:00 0 [stack] 7ffcdb7a1000-7ffcdb7c2000 rw-p 00000000 00:00 0 [stack] 7ffd5e2c4000-7ffd5e2e5000 rw-p 00000000 00:00 0 [stack] ... Once corrected, the leading bytes should be between 7ffc and 7fff, rather than always being 7fff. Signed-off-by: Hector Marco-Gisbert <hecmargi@upv.es> Signed-off-by: Ismael Ripoll <iripoll@upv.es> [ Rebased, fixed 80 char bugs, cleaned up commit message, added test example and CVE ] Signed-off-by: Kees Cook <keescook@chromium.org> Cc: <stable@vger.kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Fixes: CVE-2015-1593 Link: http://lkml.kernel.org/r/20150214173350.GA18393@www.outflux.net Signed-off-by: Borislav Petkov <bp@suse.de>
2015-02-19x86/mm/init: Fix incorrect page size in init_memory_mapping() printksDave Hansen
With 32-bit non-PAE kernels, we have 2 page sizes available (at most): 4k and 4M. Enabling PAE replaces that 4M size with a 2M one (which 64-bit systems use too). But, when booting a 32-bit non-PAE kernel, in one of our early-boot printouts, we say: init_memory_mapping: [mem 0x00000000-0x000fffff] [mem 0x00000000-0x000fffff] page 4k init_memory_mapping: [mem 0x37000000-0x373fffff] [mem 0x37000000-0x373fffff] page 2M init_memory_mapping: [mem 0x00100000-0x36ffffff] [mem 0x00100000-0x003fffff] page 4k [mem 0x00400000-0x36ffffff] page 2M init_memory_mapping: [mem 0x37400000-0x377fdfff] [mem 0x37400000-0x377fdfff] page 4k Which is obviously wrong. There is no 2M page available. This is probably because of a badly-named variable: in the map_range code: PG_LEVEL_2M. Instead of renaming all the PG_LEVEL_2M's. This patch just fixes the printout: init_memory_mapping: [mem 0x00000000-0x000fffff] [mem 0x00000000-0x000fffff] page 4k init_memory_mapping: [mem 0x37000000-0x373fffff] [mem 0x37000000-0x373fffff] page 4M init_memory_mapping: [mem 0x00100000-0x36ffffff] [mem 0x00100000-0x003fffff] page 4k [mem 0x00400000-0x36ffffff] page 4M init_memory_mapping: [mem 0x37400000-0x377fdfff] [mem 0x37400000-0x377fdfff] page 4k BRK [0x03206000, 0x03206fff] PGTABLE Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Yinghai Lu <yinghai@kernel.org> Link: http://lkml.kernel.org/r/20150210212030.665EC267@viggo.jf.intel.com Signed-off-by: Borislav Petkov <bp@suse.de>
2015-02-19x86-64: Also clear _PAGE_GLOBAL from __supported_pte_mask if !cpu_has_pgeJan Beulich
Not just setting it when the feature is available is for consistency, and may allow Xen to drop its custom clearing of the flag (unless it needs it cleared earlier than this code executes). Note that the change is benign to ix86, as the flag starts out clear there. Signed-off-by: Jan Beulich <jbeulich@suse.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/54C215D10200007800058912@mail.emea.novell.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-19x86/mm/pat: Ensure different messages in STRICT_DEVMEM and PAT casesPavel Machek
STRICT_DEVMEM and PAT produce same failure accessing /dev/mem, which is quite confusing to the user. Make printk messages different to lessen confusion. Signed-off-by: Pavel Machek <pavel@ucw.cz> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-19x86/mm: Reduce PAE-mode per task pgd allocation overhead from 4K to 32 bytesFenghua Yu
With more embedded systems emerging using Quark, among other things, 32-bit kernel matters again. 32-bit machine and kernel uses PAE paging, which currently wastes at least 4K of memory per process on Linux where we have to reserve an entire page to support a single 32-byte PGD structure. It would be a very good thing if we could eliminate that wastage. PAE paging is used to access more than 4GB memory on x86-32. And it is required for NX. In this patch, we still allocate one page for pgd for a Xen domain and 64-bit kernel because one page pgd is assumed in these cases. But we can save memory space by only allocating 32-byte pgd for 32-bit PAE kernel when it is not running as a Xen domain. Signed-off-by: Fenghua Yu <fenghua.yu@intel.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Christoph Lameter <cl@linux.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Glenn Williamson <glenn.p.williamson@intel.com> Cc: H. Peter Anvin <hpa@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1421382601-46912-1-git-send-email-fenghua.yu@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>