summaryrefslogtreecommitdiff
path: root/arch/x86/mm
AgeCommit message (Collapse)Author
2018-12-17x86/mm/cpa: Fix cpa_flush_array() TLB invalidationPeter Zijlstra
In commit: a7295fd53c39 ("x86/mm/cpa: Use flush_tlb_kernel_range()") I misread the CAP array code and incorrectly used tlb_flush_kernel_range(), resulting in missing TLB flushes and consequent failures. Instead do a full invalidate in this case -- for now. Reported-by: StDenis, Tom <Tom.StDenis@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dave.hansen@intel.com Fixes: a7295fd53c39 ("x86/mm/cpa: Use flush_tlb_kernel_range()") Link: http://lkml.kernel.org/r/20181203171043.089868285@infradead.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-12-13dma-direct: merge swiotlb_dma_ops into the dma_direct codeChristoph Hellwig
While the dma-direct code is (relatively) clean and simple we actually have to use the swiotlb ops for the mapping on many architectures due to devices with addressing limits. Instead of keeping two implementations around this commit allows the dma-direct implementation to call the swiotlb bounce buffering functions and thus share the guts of the mapping implementation. This also simplified the dma-mapping setup on a few architectures where we don't have to differenciate which implementation to use. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Tested-by: Jesper Dangaard Brouer <brouer@redhat.com> Tested-by: Tony Luck <tony.luck@intel.com>
2018-12-11x86/mm: Fix decoy address handling vs 32-bit buildsDan Williams
A decoy address is used by set_mce_nospec() to update the cache attributes for a page that may contain poison (multi-bit ECC error) while attempting to minimize the possibility of triggering a speculative access to that page. When reserve_memtype() is handling a decoy address it needs to convert it to its real physical alias. The conversion, AND'ing with __PHYSICAL_MASK, is broken for a 32-bit physical mask and reserve_memtype() is passed the last physical page. Gert reports triggering the: BUG_ON(start >= end); ...assertion when running a 32-bit non-PAE build on a platform that has a driver resource at the top of physical memory: BIOS-e820: [mem 0x00000000fff00000-0x00000000ffffffff] reserved Given that the decoy address scheme is only targeted at 64-bit builds and assumes that the top of physical address space is free for use as a decoy address range, simply bypass address sanitization in the 32-bit case. Lastly, there was no need to crash the system when this failure occurred, and no need to crash future systems if the assumptions of decoy addresses are ever violated. Change the BUG_ON() to a WARN() with an error return. Fixes: 510ee090abc3 ("x86/mm/pat: Prepare {reserve, free}_memtype() for...") Reported-by: Gert Robben <t2@gert.gr> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Gert Robben <t2@gert.gr> Cc: stable@vger.kernel.org Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: platform-driver-x86@vger.kernel.org Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/154454337985.789277.12133288391664677775.stgit@dwillia2-desk3.amr.corp.intel.com
2018-12-11x86/speculation/l1tf: Drop the swap storage limit restriction when l1tf=offMichal Hocko
Swap storage is restricted to max_swapfile_size (~16TB on x86_64) whenever the system is deemed affected by L1TF vulnerability. Even though the limit is quite high for most deployments it seems to be too restrictive for deployments which are willing to live with the mitigation disabled. We have a customer to deploy 8x 6,4TB PCIe/NVMe SSD swap devices which is clearly out of the limit. Drop the swap restriction when l1tf=off is specified. It also doesn't make much sense to warn about too much memory for the l1tf mitigation when it is forcefully disabled by the administrator. [ tglx: Folded the documentation delta change ] Fixes: 377eeaa8e11f ("x86/speculation/l1tf: Limit swap file size to MAX_PA/2") Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Acked-by: Jiri Kosina <jkosina@suse.cz> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Borislav Petkov <bp@suse.de> Cc: <linux-mm@kvack.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20181113184910.26697-1-mhocko@kernel.org
2018-12-11x86/dump_pagetables: Fix LDT remap address markerKirill A. Shutemov
The LDT remap placement has been changed. It's now placed before the direct mapping in the kernel virtual address space for both paging modes. Change address markers order accordingly. Fixes: d52888aa2753 ("x86/mm: Move LDT remap out of KASLR region on 5-level paging") Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: bp@alien8.de Cc: hpa@zytor.com Cc: dave.hansen@linux.intel.com Cc: luto@kernel.org Cc: peterz@infradead.org Cc: boris.ostrovsky@oracle.com Cc: jgross@suse.com Cc: bhe@redhat.com Cc: hans.van.kranenburg@mendix.com Cc: linux-mm@kvack.org Cc: xen-devel@lists.xenproject.org Link: https://lkml.kernel.org/r/20181130202328.65359-3-kirill.shutemov@linux.intel.com
2018-12-11x86/mm: Fix guard hole handlingKirill A. Shutemov
There is a guard hole at the beginning of the kernel address space, also used by hypervisors. It occupies 16 PGD entries. This reserved range is not defined explicitely, it is calculated relative to other entities: direct mapping and user space ranges. The calculation got broken by recent changes of the kernel memory layout: LDT remap range is now mapped before direct mapping and makes the calculation invalid. The breakage leads to crash on Xen dom0 boot[1]. Define the reserved range explicitely. It's part of kernel ABI (hypervisors expect it to be stable) and must not depend on changes in the rest of kernel memory layout. [1] https://lists.xenproject.org/archives/html/xen-devel/2018-11/msg03313.html Fixes: d52888aa2753 ("x86/mm: Move LDT remap out of KASLR region on 5-level paging") Reported-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com> Reviewed-by: Juergen Gross <jgross@suse.com> Cc: bp@alien8.de Cc: hpa@zytor.com Cc: dave.hansen@linux.intel.com Cc: luto@kernel.org Cc: peterz@infradead.org Cc: boris.ostrovsky@oracle.com Cc: bhe@redhat.com Cc: linux-mm@kvack.org Cc: xen-devel@lists.xenproject.org Link: https://lkml.kernel.org/r/20181130202328.65359-2-kirill.shutemov@linux.intel.com
2018-12-05x86/mm: Drop usage of __flush_tlb_all() in kernel_physical_mapping_init()Dan Williams
Commit: f77084d96355 "x86/mm/pat: Disable preemption around __flush_tlb_all()" addressed a case where __flush_tlb_all() is called without preemption being disabled. It also left a warning to catch other cases where preemption is not disabled. That warning triggers for the memory hotplug path which is also used for persistent memory enabling: WARNING: CPU: 35 PID: 911 at ./arch/x86/include/asm/tlbflush.h:460 RIP: 0010:__flush_tlb_all+0x1b/0x3a [..] Call Trace: phys_pud_init+0x29c/0x2bb kernel_physical_mapping_init+0xfc/0x219 init_memory_mapping+0x1a5/0x3b0 arch_add_memory+0x2c/0x50 devm_memremap_pages+0x3aa/0x610 pmem_attach_disk+0x585/0x700 [nd_pmem] Andy wondered why a path that can sleep was using __flush_tlb_all() [1] and Dave confirmed the expectation for TLB flush is for modifying / invalidating existing PTE entries, but not initial population [2]. Drop the usage of __flush_tlb_all() in phys_{p4d,pud,pmd}_init() on the expectation that this path is only ever populating empty entries for the linear map. Note, at linear map teardown time there is a call to the all-cpu flush_tlb_all() to invalidate the removed mappings. [1]: https://lkml.kernel.org/r/9DFD717D-857D-493D-A606-B635D72BAC21@amacapital.net [2]: https://lkml.kernel.org/r/749919a4-cdb1-48a3-adb4-adb81a5fa0b5@intel.com [ mingo: Minor readability edits. ] Suggested-by: Dave Hansen <dave.hansen@linux.intel.com> Reported-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: <stable@vger.kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dave.hansen@intel.com Fixes: f77084d96355 ("x86/mm/pat: Disable preemption around __flush_tlb_all()") Link: http://lkml.kernel.org/r/154395944713.32119.15611079023837132638.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-12-05x86/mm: Validate kernel_physical_mapping_init() PTE populationDan Williams
The usage of __flush_tlb_all() in the kernel_physical_mapping_init() path is not necessary. In general flushing the TLB is not required when updating an entry from the !present state. However, to give confidence in the future removal of TLB flushing in this path, use the new set_pte_safe() family of helpers to assert that the !present assumption is true in this path. [ mingo: Minor readability edits. ] Suggested-by: Peter Zijlstra <peterz@infradead.org> Suggested-by: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Rik van Riel <riel@surriel.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/154395944177.32119.8524957429632012270.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-12-03x86/pkeys: Make init_pkru_value staticSebastian Andrzej Siewior
The variable init_pkru_value isn't used outside of this file. Make it static. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Rik van Riel <riel@surriel.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "Jason A. Donenfeld" <Jason@zx2c4.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: kvm ML <kvm@vger.kernel.org> Cc: x86-ml <x86@kernel.org> Link: https://lkml.kernel.org/r/20181128222035.2996-5-bigeasy@linutronix.de
2018-12-03x86: Fix various typos in commentsIngo Molnar
Go over arch/x86/ and fix common typos in comments, and a typo in an actual function argument name. No change in functionality intended. Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-11-30x86/mm/pageattr: Introduce helper function to unmap EFI boot servicesSai Praneeth Prakhya
Ideally, after kernel assumes control of the platform, firmware shouldn't access EFI boot services code/data regions. But, it's noticed that this is not so true in many x86 platforms. Hence, during boot, kernel reserves EFI boot services code/data regions [1] and maps [2] them to efi_pgd so that call to set_virtual_address_map() doesn't fail. After returning from set_virtual_address_map(), kernel frees the reserved regions [3] but they still remain mapped. Hence, introduce kernel_unmap_pages_in_pgd() which will later be used to unmap EFI boot services code/data regions. While at it modify kernel_map_pages_in_pgd() by: 1. Adding __init modifier because it's always used *only* during boot. 2. Add a warning if it's used after SMP is initialized because it uses __flush_tlb_all() which flushes mappings only on current CPU. Unmapping EFI boot services code/data regions will result in clearing PAGE_PRESENT bit and it shouldn't bother L1TF cases because it's already handled by protnone_mask() at arch/x86/include/asm/pgtable-invert.h. [1] efi_reserve_boot_services() [2] efi_map_region() -> __map_region() -> kernel_map_pages_in_pgd() [3] efi_free_boot_services() Signed-off-by: Sai Praneeth Prakhya <sai.praneeth.prakhya@intel.com> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arend van Spriel <arend.vanspriel@broadcom.com> Cc: Bhupesh Sharma <bhsharma@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Eric Snowberg <eric.snowberg@oracle.com> Cc: Hans de Goede <hdegoede@redhat.com> Cc: Joe Perches <joe@perches.com> Cc: Jon Hunter <jonathanh@nvidia.com> Cc: Julien Thierry <julien.thierry@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Nathan Chancellor <natechancellor@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sedat Dilek <sedat.dilek@gmail.com> Cc: YiFei Zhu <zhuyifei1999@gmail.com> Cc: linux-efi@vger.kernel.org Link: http://lkml.kernel.org/r/20181129171230.18699-5-ard.biesheuvel@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-11-28x86/speculation: Prepare for conditional IBPB in switch_mm()Thomas Gleixner
The IBPB speculation barrier is issued from switch_mm() when the kernel switches to a user space task with a different mm than the user space task which ran last on the same CPU. An additional optimization is to avoid IBPB when the incoming task can be ptraced by the outgoing task. This optimization only works when switching directly between two user space tasks. When switching from a kernel task to a user space task the optimization fails because the previous task cannot be accessed anymore. So for quite some scenarios the optimization is just adding overhead. The upcoming conditional IBPB support will issue IBPB only for user space tasks which have the TIF_SPEC_IB bit set. This requires to handle the following cases: 1) Switch from a user space task (potential attacker) which has TIF_SPEC_IB set to a user space task (potential victim) which has TIF_SPEC_IB not set. 2) Switch from a user space task (potential attacker) which has TIF_SPEC_IB not set to a user space task (potential victim) which has TIF_SPEC_IB set. This needs to be optimized for the case where the IBPB can be avoided when only kernel threads ran in between user space tasks which belong to the same process. The current check whether two tasks belong to the same context is using the tasks context id. While correct, it's simpler to use the mm pointer because it allows to mangle the TIF_SPEC_IB bit into it. The context id based mechanism requires extra storage, which creates worse code. When a task is scheduled out its TIF_SPEC_IB bit is mangled as bit 0 into the per CPU storage which is used to track the last user space mm which was running on a CPU. This bit can be used together with the TIF_SPEC_IB bit of the incoming task to make the decision whether IBPB needs to be issued or not to cover the two cases above. As conditional IBPB is going to be the default, remove the dubious ptrace check for the IBPB always case and simply issue IBPB always when the process changes. Move the storage to a different place in the struct as the original one created a hole. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Woodhouse <dwmw@amazon.co.uk> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Casey Schaufler <casey.schaufler@intel.com> Cc: Asit Mallick <asit.k.mallick@intel.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Jon Masters <jcm@redhat.com> Cc: Waiman Long <longman9394@gmail.com> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Dave Stewart <david.c.stewart@intel.com> Cc: Kees Cook <keescook@chromium.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20181125185005.466447057@linutronix.de
2018-11-22x86/fault: Clean up the page fault oops decoder a bitIngo Molnar
- Make the oops messages a bit less scary (don't mention 'HW errors') - Turn 'PROT USER' (which is visually easily confused with PROT_USER) into individual bit descriptors: "[PROT] [USER]". This also makes "[normal kernel read fault]" more apparent. - De-abbreviate variables to make the code easier to read - Use vertical alignment where appropriate. - Add comment about string size limits and the helper function. - Remove unnecessary line breaks. Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Yu-cheng Yu <yu-cheng.yu@intel.com> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-11-22x86/fault: Decode page fault OOPSes betterAndy Lutomirski
One of Linus' favorite hobbies seems to be looking at OOPSes and decoding the error code in his head. This is not one of my favorite hobbies :) Teach the page fault OOPS hander to decode the error code. If it's a !USER fault from user mode, print an explicit note to that effect and print out the addresses of various tables that might cause such an error. With this patch applied, if I intentionally point the LDT at 0x0 and run the x86 selftests, I get: BUG: unable to handle kernel NULL pointer dereference at 0000000000000000 HW error: normal kernel read fault This was a system access from user code IDT: 0xfffffe0000000000 (limit=0xfff) GDT: 0xfffffe0000001000 (limit=0x7f) LDTR: 0x50 -- base=0x0 limit=0xfff7 TR: 0x40 -- base=0xfffffe0000003000 limit=0x206f PGD 800000000456e067 P4D 800000000456e067 PUD 4623067 PMD 0 SMP PTI CPU: 0 PID: 153 Comm: ldt_gdt_64 Not tainted 4.19.0+ #1317 Hardware name: ... RIP: 0033:0x401454 Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Yu-cheng Yu <yu-cheng.yu@intel.com> Link: http://lkml.kernel.org/r/11212acb25980cd1b3030875cd9502414fbb214d.1542841400.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-11-22x86/fault: Don't try to recover from an implicit supervisor accessAndy Lutomirski
This avoids a situation in which we attempt to apply various fixups that are not intended to handle implicit supervisor accesses from user mode if we screw up in a way that causes this type of fault. Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Yu-cheng Yu <yu-cheng.yu@intel.com> Link: http://lkml.kernel.org/r/9999f151d72ff352265f3274c5ab3a4105090f49.1542841400.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-11-22x86/fault: Remove sw_error_codeAndy Lutomirski
All of the fault handling code now corrently checks user_mode(regs) as needed, and nothing depends on the X86_PF_USER bit being munged. Get rid of the sw_error code and use hw_error_code everywhere. Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Yu-cheng Yu <yu-cheng.yu@intel.com> Link: http://lkml.kernel.org/r/078f5b8ae6e8c79ff8ee7345b5c476c45003e5ac.1542841400.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-11-20x86/fault: Don't set thread.cr2, etc before OOPSingAndy Lutomirski
The fault handling code sets the cr2, trap_nr, and error_code fields in thread_struct before OOPSing. No one reads those fields during an OOPS, so remove the code to set them. Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Yu-cheng Yu <yu-cheng.yu@intel.com> Link: http://lkml.kernel.org/r/d418022aa0fad9cb40467aa7acaf4e95be50ee96.1542667307.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-11-20x86/fault: Make error_code sanitization more robustAndy Lutomirski
The error code in a page fault on a kernel address indicates whether that address is mapped, which should not be revealed in a signal. The normal code path for a page fault on a kernel address sanitizes the bit, but the paths for vsyscall emulation and SIGBUS do not. Both are harmless, but for subtle reasons. SIGBUS is never sent for a kernel address, and vsyscall emulation will never fault on a kernel address per se because it will fail an access_ok() check instead. Make the code more robust by adding a helper that sets the relevant fields and sanitizing the error code in the helper. This also cleans up the code -- we had three copies of roughly the same thing. Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Yu-cheng Yu <yu-cheng.yu@intel.com> Link: http://lkml.kernel.org/r/b31159bd55bd0c4fa061a20dfd6c429c094bebaa.1542667307.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-11-20x86/fault: Improve the condition for signalling vs OOPSingAndy Lutomirski
__bad_area_nosemaphore() currently checks the X86_PF_USER bit in the error code to decide whether to send a signal or to treat the fault as a kernel error. This can cause somewhat erratic behavior. The straightforward cases where the CPL agrees with the hardware USER bit are all correct, but the other cases are confusing. - A user instruction accessing a kernel address with supervisor privilege (e.g. a descriptor table access failed). The USER bit will be clear, and we OOPS. This is correct, because it indicates a kernel bug, not a user error. - A user instruction accessing a user address with supervisor privilege (e.g. a descriptor table was incorrectly pointing at user memory). __bad_area_nosemaphore() will be passed a modified error code with the user bit set, and we will send a signal. Sending the signal will work (because the regs and the entry frame genuinely come from user mode), but we really ought to OOPS, as this event indicates a severe kernel bug. - A kernel instruction with user privilege (i.e. WRUSS). This should OOPS or get fixed up. The current code would instead try send a signal and malfunction. Change the logic: a signal should be sent if the faulting context is user mode *and* the access has user privilege. Otherwise it's either a kernel mode fault or a failed implicit access, either of which should end up in no_context(). Note to -stable maintainers: don't backport this unless you backport CET. The bug it fixes is unobservable in current kernels unless something is extremely wrong. Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Yu-cheng Yu <yu-cheng.yu@intel.com> Link: http://lkml.kernel.org/r/10e509c43893170e262e82027ea399130ae81159.1542667307.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-11-20x86/fault: Fix SMAP #PF handling buglet for implicit supervisor accessesAndy Lutomirski
Currently, if a user program somehow triggers an implicit supervisor access to a user address (e.g. if the kernel somehow sets LDTR to a user address), it will be incorrectly detected as a SMAP violation if AC is clear and SMAP is enabled. This is incorrect -- the error has nothing to do with SMAP. Fix the condition so that only accesses with the hardware USER bit set are diagnosed as SMAP violations. With the logic fixed, an implicit supervisor access to a user address will hit the code lower in the function that is intended to handle it even if SMAP is enabled. That logic is still a bit buggy, and later patches will clean it up. I *think* this code is still correct for WRUSS, and I've added a comment to that effect. Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Yu-cheng Yu <yu-cheng.yu@intel.com> Link: http://lkml.kernel.org/r/d1d1b2e66ef31f884dba172084486ea9423ddcdb.1542667307.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-11-20x86/fault: Fold smap_violation() into do_user_addr_fault()Andy Lutomirski
smap_violation() has a single caller, and the contents are a bit nonsensical. I'm going to fix it, but first let's fold it into its caller for ease of comprehension. In this particular case, the user_mode(regs) check is incorrect -- it will cause false positives in the case of a user-initiated kernel-privileged access. Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Yu-cheng Yu <yu-cheng.yu@intel.com> Link: http://lkml.kernel.org/r/806c366f6ca861152398ce2c01744d59d9aceb6d.1542667307.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-11-20x86/cpufeatures, x86/fault: Mark SMAP as disabled when configured outAndy Lutomirski
Add X86_FEATURE_SMAP to the disabled features mask as appropriate and use cpu_feature_enabled() in the fault code. This lets us get rid of a redundant IS_ENABLED(CONFIG_X86_SMAP). Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Yu-cheng Yu <yu-cheng.yu@intel.com> Link: http://lkml.kernel.org/r/fe93332eded3d702f0b0b4cf83928d6830739ba3.1542667307.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-11-20x86/fault: Check user_mode(regs) when avoiding an mmap_sem deadlockAndy Lutomirski
The fault-handling code that takes mmap_sem needs to avoid a deadlock that could occur if the kernel took a bad (OOPS-worthy) page fault on a user address while holding mmap_sem. This can only happen if the faulting instruction was in the kernel (i.e. user_mode(regs)). Rather than checking the sw_error_code (which will have the USER bit set if the fault was a USER-permission access *or* if user_mode(regs)), just check user_mode(regs) directly. The old code would have malfunctioned if the kernel executed a bogus WRUSS instruction while holding mmap_sem. Fortunately, that is extremely unlikely in current kernels, which don't use WRUSS. Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Yu-cheng Yu <yu-cheng.yu@intel.com> Link: http://lkml.kernel.org/r/4b89b542e8ceba9bd6abde2f386afed6d99244a9.1542667307.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-11-12x86/mm/fault: Allow stack access below %rspWaiman Long
The current x86 page fault handler allows stack access below the stack pointer if it is no more than 64k+256 bytes. Any access beyond the 64k+ limit will cause a segmentation fault. The gcc -fstack-check option generates code to probe the stack for large stack allocation to see if the stack is accessible. The newer gcc does that while updating the %rsp simultaneously. Older gcc's like gcc4 doesn't do that. As a result, an application compiled with an old gcc and the -fstack-check option may fail to start at all: $ cat test.c int main() { char tmp[1024*128]; printf("### ok\n"); return 0; } $ gcc -fstack-check -g -o test test.c $ ./test Segmentation fault The old binary was working in older kernels where expand_stack() was somehow called before the check. But it is not working in newer kernels. Besides, the 64k+ limit check is kind of crude and will not catch a lot of mistakes that userspace applications may be misbehaving anyway. I think the kernel isn't the right place for this kind of tests. We should leave it to userspace instrumentation tools to perform them. The 64k+ limit check is now removed to just let expand_stack() decide if a segmentation fault should happen, when the RLIMIT_STACK limit is exceeded, for example. Signed-off-by: Waiman Long <longman@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1541535149-31963-1-git-send-email-longman@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-11-03Merge branch 'core/urgent' into x86/urgent, to pick up objtool fixIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-11-01x86/compat: Adjust in_compat_syscall() to generic code under !COMPATDmitry Safonov
The result of in_compat_syscall() can be pictured as: x86 platform: --------------------------------------------------- | Arch\syscall | 64-bit | ia32 | x32 | |-------------------------------------------------| | x86_64 | false | true | true | |-------------------------------------------------| | i686 | | <true> | | --------------------------------------------------- Other platforms: ------------------------------------------- | Arch\syscall | 64-bit | compat | |-----------------------------------------| | 64-bit | false | true | |-----------------------------------------| | 32-bit(?) | | <false> | ------------------------------------------- As seen, the result of in_compat_syscall() on generic 32-bit platform differs from i686. There is no reason for in_compat_syscall() == true on native i686. It also easy to misread code if the result on native 32-bit platform differs between arches. Because of that non arch-specific code has many places with: if (IS_ENABLED(CONFIG_COMPAT) && in_compat_syscall()) in different variations. It looks-like the only non-x86 code which uses in_compat_syscall() not under CONFIG_COMPAT guard is in amd/amdkfd. But according to the commit a18069c132cb ("amdkfd: Disable support for 32-bit user processes"), it actually should be disabled on native i686. Rename in_compat_syscall() to in_32bit_syscall() for x86-specific code and make in_compat_syscall() false under !CONFIG_COMPAT. A follow on patch will clean up generic users which were forced to check IS_ENABLED(CONFIG_COMPAT) with in_compat_syscall(). Signed-off-by: Dmitry Safonov <dima@arista.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Andy Lutomirski <luto@kernel.org> Cc: Dmitry Safonov <0x7f454c46@gmail.com> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: John Stultz <john.stultz@linaro.org> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: Stephen Boyd <sboyd@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: linux-efi@vger.kernel.org Cc: netdev@vger.kernel.org Link: https://lkml.kernel.org/r/20181012134253.23266-2-dima@arista.com
2018-10-31mm: remove include/linux/bootmem.hMike Rapoport
Move remaining definitions and declarations from include/linux/bootmem.h into include/linux/memblock.h and remove the redundant header. The includes were replaced with the semantic patch below and then semi-automated removal of duplicated '#include <linux/memblock.h> @@ @@ - #include <linux/bootmem.h> + #include <linux/memblock.h> [sfr@canb.auug.org.au: dma-direct: fix up for the removal of linux/bootmem.h] Link: http://lkml.kernel.org/r/20181002185342.133d1680@canb.auug.org.au [sfr@canb.auug.org.au: powerpc: fix up for removal of linux/bootmem.h] Link: http://lkml.kernel.org/r/20181005161406.73ef8727@canb.auug.org.au [sfr@canb.auug.org.au: x86/kaslr, ACPI/NUMA: fix for linux/bootmem.h removal] Link: http://lkml.kernel.org/r/20181008190341.5e396491@canb.auug.org.au Link: http://lkml.kernel.org/r/1536927045-23536-30-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chris Zankel <chris@zankel.net> Cc: "David S. Miller" <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: Jonas Bonn <jonas@southpole.se> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ley Foon Tan <lftan@altera.com> Cc: Mark Salter <msalter@redhat.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Palmer Dabbelt <palmer@sifive.com> Cc: Paul Burton <paul.burton@mips.com> Cc: Richard Kuo <rkuo@codeaurora.org> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Serge Semin <fancer.lancer@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31memblock: replace BOOTMEM_ALLOC_* with MEMBLOCK variantsMike Rapoport
Drop BOOTMEM_ALLOC_ACCESSIBLE and BOOTMEM_ALLOC_ANYWHERE in favor of identical MEMBLOCK definitions. Link: http://lkml.kernel.org/r/1536927045-23536-29-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chris Zankel <chris@zankel.net> Cc: "David S. Miller" <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: Jonas Bonn <jonas@southpole.se> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ley Foon Tan <lftan@altera.com> Cc: Mark Salter <msalter@redhat.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Palmer Dabbelt <palmer@sifive.com> Cc: Paul Burton <paul.burton@mips.com> Cc: Richard Kuo <rkuo@codeaurora.org> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Serge Semin <fancer.lancer@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31memblock: rename free_all_bootmem to memblock_free_allMike Rapoport
The conversion is done using sed -i 's@free_all_bootmem@memblock_free_all@' \ $(git grep -l free_all_bootmem) Link: http://lkml.kernel.org/r/1536927045-23536-26-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chris Zankel <chris@zankel.net> Cc: "David S. Miller" <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: Jonas Bonn <jonas@southpole.se> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ley Foon Tan <lftan@altera.com> Cc: Mark Salter <msalter@redhat.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Palmer Dabbelt <palmer@sifive.com> Cc: Paul Burton <paul.burton@mips.com> Cc: Richard Kuo <rkuo@codeaurora.org> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Serge Semin <fancer.lancer@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31memblock: replace alloc_bootmem_pages with memblock_allocMike Rapoport
The alloc_bootmem_pages() function allocates PAGE_SIZE aligned memory. memblock_alloc() with alignment set to PAGE_SIZE does exactly the same thing. The conversion is done using the following semantic patch: @@ expression e; @@ - alloc_bootmem_pages(e) + memblock_alloc(e, PAGE_SIZE) Link: http://lkml.kernel.org/r/1536927045-23536-20-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chris Zankel <chris@zankel.net> Cc: "David S. Miller" <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: Jonas Bonn <jonas@southpole.se> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ley Foon Tan <lftan@altera.com> Cc: Mark Salter <msalter@redhat.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Palmer Dabbelt <palmer@sifive.com> Cc: Paul Burton <paul.burton@mips.com> Cc: Richard Kuo <rkuo@codeaurora.org> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Serge Semin <fancer.lancer@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31memblock: remove _virt from APIs returning virtual addressMike Rapoport
The conversion is done using sed -i 's@memblock_virt_alloc@memblock_alloc@g' \ $(git grep -l memblock_virt_alloc) Link: http://lkml.kernel.org/r/1536927045-23536-8-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chris Zankel <chris@zankel.net> Cc: "David S. Miller" <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: Jonas Bonn <jonas@southpole.se> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ley Foon Tan <lftan@altera.com> Cc: Mark Salter <msalter@redhat.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Michal Simek <monstr@monstr.eu> Cc: Palmer Dabbelt <palmer@sifive.com> Cc: Paul Burton <paul.burton@mips.com> Cc: Richard Kuo <rkuo@codeaurora.org> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Serge Semin <fancer.lancer@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31memblock: rename memblock_alloc{_nid,_try_nid} to memblock_phys_alloc*Mike Rapoport
Make it explicit that the caller gets a physical address rather than a virtual one. This will also allow using meblock_alloc prefix for memblock allocations returning virtual address, which is done in the following patches. The conversion is done using the following semantic patch: @@ expression e1, e2, e3; @@ ( - memblock_alloc(e1, e2) + memblock_phys_alloc(e1, e2) | - memblock_alloc_nid(e1, e2, e3) + memblock_phys_alloc_nid(e1, e2, e3) | - memblock_alloc_try_nid(e1, e2, e3) + memblock_phys_alloc_try_nid(e1, e2, e3) ) Link: http://lkml.kernel.org/r/1536927045-23536-7-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chris Zankel <chris@zankel.net> Cc: "David S. Miller" <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: Jonas Bonn <jonas@southpole.se> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ley Foon Tan <lftan@altera.com> Cc: Mark Salter <msalter@redhat.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Palmer Dabbelt <palmer@sifive.com> Cc: Paul Burton <paul.burton@mips.com> Cc: Richard Kuo <rkuo@codeaurora.org> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Serge Semin <fancer.lancer@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-30x86/numa_emulation: Fix uniform-split numa emulationDave Jiang
The numa_emulation() routine in the 'uniform' case walks through all the physical 'memblk' instances and divides them into N emulated nodes with split_nodes_size_interleave_uniform(). As each physical node is consumed it is removed from the physical memblk array in the numa_remove_memblk_from() helper. Since split_nodes_size_interleave_uniform() handles advancing the array as the 'memblk' is consumed it is expected that the base of the array is always specified as the argument. Otherwise, on multi-socket (> 2) configurations the uniform-split capability can generate an invalid numa configuration leading to boot failures with signatures like the following: rcu: INFO: rcu_sched detected stalls on CPUs/tasks: Sending NMI from CPU 0 to CPUs 2: NMI backtrace for cpu 2 CPU: 2 PID: 1332 Comm: pgdatinit0 Not tainted 4.19.0-rc8-next-20181019-baseline #59 RIP: 0010:__init_single_page.isra.74+0x81/0x90 [..] Call Trace: deferred_init_pages+0xaa/0xe3 deferred_init_memmap+0x18f/0x318 kthread+0xf8/0x130 ? deferred_free_pages.isra.105+0xc9/0xc9 ? kthread_stop+0x110/0x110 ret_from_fork+0x35/0x40 Fixes: 1f6a2c6d9f121 ("x86/numa_emulation: Introduce uniform split capability") Signed-off-by: Dave Jiang <dave.jiang@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Alexander Duyck <alexander.h.duyck@linux.intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Borislav Petkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/154049911459.2685845.9210186007479774286.stgit@dwillia2-desk3.amr.corp.intel.com
2018-10-29x86/mm/pat: Disable preemption around __flush_tlb_all()Sebastian Andrzej Siewior
The WARN_ON_ONCE(__read_cr3() != build_cr3()) in switch_mm_irqs_off() triggers every once in a while during a snapshotted system upgrade. The warning triggers since commit decab0888e6e ("x86/mm: Remove preempt_disable/enable() from __native_flush_tlb()"). The callchain is: get_page_from_freelist() -> post_alloc_hook() -> __kernel_map_pages() with CONFIG_DEBUG_PAGEALLOC enabled. Disable preemption during CR3 reset / __flush_tlb_all() and add a comment why preemption has to be disabled so it won't be removed accidentaly. Add another preemptible() check in __flush_tlb_all() to catch callers with enabled preemption when PGE is enabled, because PGE enabled does not trigger the warning in __native_flush_tlb(). Suggested by Andy Lutomirski. Fixes: decab0888e6e ("x86/mm: Remove preempt_disable/enable() from __native_flush_tlb()") Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andy Lutomirski <luto@kernel.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20181017103432.zgv46nlu3hc7k4rq@linutronix.de
2018-10-24Merge branch 'siginfo-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull siginfo updates from Eric Biederman: "I have been slowly sorting out siginfo and this is the culmination of that work. The primary result is in several ways the signal infrastructure has been made less error prone. The code has been updated so that manually specifying SEND_SIG_FORCED is never necessary. The conversion to the new siginfo sending functions is now complete, which makes it difficult to send a signal without filling in the proper siginfo fields. At the tail end of the patchset comes the optimization of decreasing the size of struct siginfo in the kernel from 128 bytes to about 48 bytes on 64bit. The fundamental observation that enables this is by definition none of the known ways to use struct siginfo uses the extra bytes. This comes at the cost of a small user space observable difference. For the rare case of siginfo being injected into the kernel only what can be copied into kernel_siginfo is delivered to the destination, the rest of the bytes are set to 0. For cases where the signal and the si_code are known this is safe, because we know those bytes are not used. For cases where the signal and si_code combination is unknown the bits that won't fit into struct kernel_siginfo are tested to verify they are zero, and the send fails if they are not. I made an extensive search through userspace code and I could not find anything that would break because of the above change. If it turns out I did break something it will take just the revert of a single change to restore kernel_siginfo to the same size as userspace siginfo. Testing did reveal dependencies on preferring the signo passed to sigqueueinfo over si->signo, so bit the bullet and added the complexity necessary to handle that case. Testing also revealed bad things can happen if a negative signal number is passed into the system calls. Something no sane application will do but something a malicious program or a fuzzer might do. So I have fixed the code that performs the bounds checks to ensure negative signal numbers are handled" * 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (80 commits) signal: Guard against negative signal numbers in copy_siginfo_from_user32 signal: Guard against negative signal numbers in copy_siginfo_from_user signal: In sigqueueinfo prefer sig not si_signo signal: Use a smaller struct siginfo in the kernel signal: Distinguish between kernel_siginfo and siginfo signal: Introduce copy_siginfo_from_user and use it's return value signal: Remove the need for __ARCH_SI_PREABLE_SIZE and SI_PAD_SIZE signal: Fail sigqueueinfo if si_signo != sig signal/sparc: Move EMT_TAGOVF into the generic siginfo.h signal/unicore32: Use force_sig_fault where appropriate signal/unicore32: Generate siginfo in ucs32_notify_die signal/unicore32: Use send_sig_fault where appropriate signal/arc: Use force_sig_fault where appropriate signal/arc: Push siginfo generation into unhandled_exception signal/ia64: Use force_sig_fault where appropriate signal/ia64: Use the force_sig(SIGSEGV,...) in ia64_rt_sigreturn signal/ia64: Use the generic force_sigsegv in setup_frame signal/arm/kvm: Use send_sig_mceerr signal/arm: Use send_sig_fault where appropriate signal/arm: Use force_sig_fault where appropriate ...
2018-10-23Merge branch 'x86-pti-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 pti updates from Ingo Molnar: "The main changes: - Make the IBPB barrier more strict and add STIBP support (Jiri Kosina) - Micro-optimize and clean up the entry code (Andy Lutomirski) - ... plus misc other fixes" * 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/speculation: Propagate information about RSB filling mitigation to sysfs x86/speculation: Enable cross-hyperthread spectre v2 STIBP mitigation x86/speculation: Apply IBPB more strictly to avoid cross-process data leak x86/speculation: Add RETPOLINE_AMD support to the inline asm CALL_NOSPEC variant x86/CPU: Fix unused variable warning when !CONFIG_IA32_EMULATION x86/pti/64: Remove the SYSCALL64 entry trampoline x86/entry/64: Use the TSS sp2 slot for SYSCALL/SYSRET scratch space x86/entry/64: Document idtentry
2018-10-23Merge branch 'x86-paravirt-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 paravirt updates from Ingo Molnar: "Two main changes: - Remove no longer used parts of the paravirt infrastructure and put large quantities of paravirt ops under a new config option PARAVIRT_XXL=y, which is selected by XEN_PV only. (Joergen Gross) - Enable PV spinlocks on Hyperv (Yi Sun)" * 'x86-paravirt-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/hyperv: Enable PV qspinlock for Hyper-V x86/hyperv: Add GUEST_IDLE_MSR support x86/paravirt: Clean up native_patch() x86/paravirt: Prevent redefinition of SAVE_FLAGS macro x86/xen: Make xen_reservation_lock static x86/paravirt: Remove unneeded mmu related paravirt ops bits x86/paravirt: Move the Xen-only pv_mmu_ops under the PARAVIRT_XXL umbrella x86/paravirt: Move the pv_irq_ops under the PARAVIRT_XXL umbrella x86/paravirt: Move the Xen-only pv_cpu_ops under the PARAVIRT_XXL umbrella x86/paravirt: Move items in pv_info under PARAVIRT_XXL umbrella x86/paravirt: Introduce new config option PARAVIRT_XXL x86/paravirt: Remove unused paravirt bits x86/paravirt: Use a single ops structure x86/paravirt: Remove clobbers from struct paravirt_patch_site x86/paravirt: Remove clobbers parameter from paravirt patch functions x86/paravirt: Make paravirt_patch_call() and paravirt_patch_jmp() static x86/xen: Add SPDX identifier in arch/x86/xen files x86/xen: Link platform-pci-unplug.o only if CONFIG_XEN_PVHVM x86/xen: Move pv specific parts of arch/x86/xen/mmu.c to mmu_pv.c x86/xen: Move pv irq related functions under CONFIG_XEN_PV umbrella
2018-10-23Merge branch 'x86-mm-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 mm updates from Ingo Molnar: "Lots of changes in this cycle: - Lots of CPA (change page attribute) optimizations and related cleanups (Thomas Gleixner, Peter Zijstra) - Make lazy TLB mode even lazier (Rik van Riel) - Fault handler cleanups and improvements (Dave Hansen) - kdump, vmcore: Enable kdumping encrypted memory with AMD SME enabled (Lianbo Jiang) - Clean up VM layout documentation (Baoquan He, Ingo Molnar) - ... plus misc other fixes and enhancements" * 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (51 commits) x86/stackprotector: Remove the call to boot_init_stack_canary() from cpu_startup_entry() x86/mm: Kill stray kernel fault handling comment x86/mm: Do not warn about PCI BIOS W+X mappings resource: Clean it up a bit resource: Fix find_next_iomem_res() iteration issue resource: Include resource end in walk_*() interfaces x86/kexec: Correct KEXEC_BACKUP_SRC_END off-by-one error x86/mm: Remove spurious fault pkey check x86/mm/vsyscall: Consider vsyscall page part of user address space x86/mm: Add vsyscall address helper x86/mm: Fix exception table comments x86/mm: Add clarifying comments for user addr space x86/mm: Break out user address space handling x86/mm: Break out kernel address space handling x86/mm: Clarify hardware vs. software "error_code" x86/mm/tlb: Make lazy TLB mode lazier x86/mm/tlb: Add freed_tables element to flush_tlb_info x86/mm/tlb: Add freed_tables argument to flush_tlb_mm_range smp,cpumask: introduce on_each_cpu_cond_mask smp: use __cpumask_set_cpu in on_each_cpu_cond ...
2018-10-23Merge branch 'locking-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking and misc x86 updates from Ingo Molnar: "Lots of changes in this cycle - in part because locking/core attracted a number of related x86 low level work which was easier to handle in a single tree: - Linux Kernel Memory Consistency Model updates (Alan Stern, Paul E. McKenney, Andrea Parri) - lockdep scalability improvements and micro-optimizations (Waiman Long) - rwsem improvements (Waiman Long) - spinlock micro-optimization (Matthew Wilcox) - qspinlocks: Provide a liveness guarantee (more fairness) on x86. (Peter Zijlstra) - Add support for relative references in jump tables on arm64, x86 and s390 to optimize jump labels (Ard Biesheuvel, Heiko Carstens) - Be a lot less permissive on weird (kernel address) uaccess faults on x86: BUG() when uaccess helpers fault on kernel addresses (Jann Horn) - macrofy x86 asm statements to un-confuse the GCC inliner. (Nadav Amit) - ... and a handful of other smaller changes as well" * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (57 commits) locking/lockdep: Make global debug_locks* variables read-mostly locking/lockdep: Fix debug_locks off performance problem locking/pvqspinlock: Extend node size when pvqspinlock is configured locking/qspinlock_stat: Count instances of nested lock slowpaths locking/qspinlock, x86: Provide liveness guarantee x86/asm: 'Simplify' GEN_*_RMWcc() macros locking/qspinlock: Rework some comments locking/qspinlock: Re-order code locking/lockdep: Remove duplicated 'lock_class_ops' percpu array x86/defconfig: Enable CONFIG_USB_XHCI_HCD=y futex: Replace spin_is_locked() with lockdep locking/lockdep: Make class->ops a percpu counter and move it under CONFIG_DEBUG_LOCKDEP=y x86/jump-labels: Macrofy inline assembly code to work around GCC inlining bugs x86/cpufeature: Macrofy inline assembly code to work around GCC inlining bugs x86/extable: Macrofy inline assembly code to work around GCC inlining bugs x86/paravirt: Work around GCC inlining bugs when compiling paravirt ops x86/bug: Macrofy the BUG table section handling, to work around GCC inlining bugs x86/alternatives: Macrofy lock prefixes to work around GCC inlining bugs x86/refcount: Work around GCC inlining bug x86/objtool: Use asm macros to work around GCC inlining bugs ...
2018-10-23Merge branch 'efi-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull EFI updates from Ingo Molnar: "The main changes are: - Add support for enlisting the help of the EFI firmware to create memory reservations that persist across kexec. - Add page fault handling to the runtime services support code on x86 so we can more gracefully recover from buggy EFI firmware. - Fix command line handling on x86 for the boot path that omits the stub's PE/COFF entry point. - Other assorted fixes and updates" * 'efi-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86: boot: Fix EFI stub alignment efi/x86: Call efi_parse_options() from efi_main() efi/x86: earlyprintk - Add 64bit efi fb address support efi/x86: drop task_lock() from efi_switch_mm() efi/x86: Handle page faults occurring while running EFI runtime services efi: Make efi_rts_work accessible to efi page fault handler efi/efi_test: add exporting ResetSystem runtime service efi/libstub: arm: support building with clang efi: add API to reserve memory persistently across kexec reboot efi/arm: libstub: add a root memreserve config table efi: honour memory reservations passed via a linux specific config table
2018-10-21x86/mm: Kill stray kernel fault handling commentDave Hansen
I originally had matching user and kernel comments, but the kernel one got improved. Some errant conflict resolution kicked the commment somewhere wrong. Kill it. Reported-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Jann Horn <jannh@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sean Christopherson <sean.j.christopherson@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: aa37c51b94 ("x86/mm: Break out user address space handling") Link: http://lkml.kernel.org/r/20181019140842.12F929FA@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-10-10x86/mm: Do not warn about PCI BIOS W+X mappingsThomas Gleixner
PCI BIOS requires the BIOS area 0x0A0000-0x0FFFFFF to be mapped W+X for various legacy reasons. When CONFIG_DEBUG_WX is enabled, this triggers the WX warning, but this is misleading because the mapping is required and is not a result of an accidental oversight. Prevent the full warning when PCI BIOS is enabled and the detected WX mapping is in the BIOS area. Just emit a pr_warn() which denotes the fact. This is partially duplicating the info which the PCI BIOS code emits when it maps the area as executable, but that info is not in the context of the WX checking output. Remove the extra %p printout in the WARN_ONCE() while at it. %pS is enough. Reported-by: Paul Menzel <pmenzel@molgen.mpg.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Borislav Petkov <bp@suse.de> Cc: Joerg Roedel <joro@8bytes.org> Cc: Kees Cook <keescook@chromium.org> Cc: Bjorn Helgaas <bhelgaas@google.com> Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1810082151160.2455@nanos.tec.linutronix.de
2018-10-09x86/mm: Remove spurious fault pkey checkDave Hansen
Spurious faults only ever occur in the kernel's address space. They are also constrained specifically to faults with one of these error codes: X86_PF_WRITE | X86_PF_PROT X86_PF_INSTR | X86_PF_PROT So, it's never even possible to reach spurious_kernel_fault_check() with X86_PF_PK set. In addition, the kernel's address space never has pages with user-mode protections. Protection Keys are only enforced on pages with user-mode protection. This gives us lots of reasons to not check for protection keys in our sprurious kernel fault handling. But, let's also add some warnings to ensure that these assumptions about protection keys hold true. Cc: x86@kernel.org Cc: Jann Horn <jannh@google.com> Cc: Sean Christopherson <sean.j.christopherson@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Andy Lutomirski <luto@kernel.org> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: http://lkml.kernel.org/r/20180928160231.243A0D6A@viggo.jf.intel.com
2018-10-09x86/mm/vsyscall: Consider vsyscall page part of user address spaceDave Hansen
The vsyscall page is weird. It is in what is traditionally part of the kernel address space. But, it has user permissions and we handle faults on it like we would on a user page: interrupts on. Right now, we handle vsyscall emulation in the "bad_area" code, which is used for both user-address-space and kernel-address-space faults. Move the handling to the user-address-space code *only* and ensure we get there by "excluding" the vsyscall page from the kernel address space via a check in fault_in_kernel_space(). Since the fault_in_kernel_space() check is used on 32-bit, also add a 64-bit check to make it clear we only use this path on 64-bit. Also move the unlikely() to be in is_vsyscall_vaddr() itself. This helps clean up the kernel fault handling path by removing a case that can happen in normal[1] operation. (Yeah, yeah, we can argue about the vsyscall page being "normal" or not.) This also makes sanity checks easier, like the "we never take pkey faults in the kernel address space" check in the next patch. Cc: x86@kernel.org Cc: Jann Horn <jannh@google.com> Cc: Sean Christopherson <sean.j.christopherson@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Andy Lutomirski <luto@kernel.org> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: http://lkml.kernel.org/r/20180928160230.6E9336EE@viggo.jf.intel.com
2018-10-09x86/mm: Add vsyscall address helperDave Hansen
We will shortly be using this check in two locations. Put it in a helper before we do so. Let's also insert PAGE_MASK instead of the open-coded ~0xfff. It is easier to read and also more obviously correct considering the implicit type conversion that has to happen when it is not an implicit 'unsigned long'. Cc: x86@kernel.org Cc: Jann Horn <jannh@google.com> Cc: Sean Christopherson <sean.j.christopherson@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Andy Lutomirski <luto@kernel.org> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: http://lkml.kernel.org/r/20180928160228.C593509B@viggo.jf.intel.com
2018-10-09x86/mm: Fix exception table commentsDave Hansen
The comments here are wrong. They are too absolute about where faults can occur when running in the kernel. The comments are also a bit hard to match up with the code. Trim down the comments, and make them more precise. Also add a comment explaining why we are doing the bad_area_nosemaphore() path here. Cc: x86@kernel.org Cc: Jann Horn <jannh@google.com> Cc: Sean Christopherson <sean.j.christopherson@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Andy Lutomirski <luto@kernel.org> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: http://lkml.kernel.org/r/20180928160227.077DDD7A@viggo.jf.intel.com
2018-10-09x86/mm: Add clarifying comments for user addr spaceDave Hansen
The SMAP and Reserved checking do not have nice comments. Add some to clarify and make it match everything else. Cc: x86@kernel.org Cc: Jann Horn <jannh@google.com> Cc: Sean Christopherson <sean.j.christopherson@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Andy Lutomirski <luto@kernel.org> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: http://lkml.kernel.org/r/20180928160225.FFD44B8D@viggo.jf.intel.com
2018-10-09x86/mm: Break out user address space handlingDave Hansen
The last patch broke out kernel address space handing into its own helper. Now, do the same for user address space handling. Cc: x86@kernel.org Cc: Jann Horn <jannh@google.com> Cc: Sean Christopherson <sean.j.christopherson@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Andy Lutomirski <luto@kernel.org> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: http://lkml.kernel.org/r/20180928160223.9C4F6440@viggo.jf.intel.com
2018-10-09x86/mm: Break out kernel address space handlingDave Hansen
The page fault handler (__do_page_fault()) basically has two sections: one for handling faults in the kernel portion of the address space and another for faults in the user portion of the address space. But, these two parts don't stick out that well. Let's make that more clear from code separation and naming. Pull kernel fault handling into its own helper, and reflect that naming by renaming spurious_fault() -> spurious_kernel_fault(). Also, rewrite the vmalloc() handling comment a bit. It was a bit stale and also glossed over the reserved bit handling. Cc: x86@kernel.org Cc: Jann Horn <jannh@google.com> Cc: Sean Christopherson <sean.j.christopherson@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Andy Lutomirski <luto@kernel.org> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: http://lkml.kernel.org/r/20180928160222.401F4E10@viggo.jf.intel.com
2018-10-09x86/mm: Clarify hardware vs. software "error_code"Dave Hansen
We pass around a variable called "error_code" all around the page fault code. Sounds simple enough, especially since "error_code" looks like it exactly matches the values that the hardware gives us on the stack to report the page fault error code (PFEC in SDM parlance). But, that's not how it works. For part of the page fault handler, "error_code" does exactly match PFEC. But, during later parts, it diverges and starts to mean something a bit different. Give it two names for its two jobs. The place it diverges is also really screwy. It's only in a spot where the hardware tells us we have kernel-mode access that occurred while we were in usermode accessing user-controlled address space. Add a warning in there. Cc: x86@kernel.org Cc: Jann Horn <jannh@google.com> Cc: Sean Christopherson <sean.j.christopherson@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Andy Lutomirski <luto@kernel.org> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: http://lkml.kernel.org/r/20180928160220.4A2272C9@viggo.jf.intel.com