summaryrefslogtreecommitdiff
path: root/arch/powerpc/mm/pgtable-book3s64.c
AgeCommit message (Collapse)Author
2019-03-12treewide: add checks for the return value of memblock_alloc*()Mike Rapoport
Add check for the return value of memblock_alloc*() functions and call panic() in case of error. The panic message repeats the one used by panicing memblock allocators with adjustment of parameters to include only relevant ones. The replacement was mostly automated with semantic patches like the one below with manual massaging of format strings. @@ expression ptr, size, align; @@ ptr = memblock_alloc(size, align); + if (!ptr) + panic("%s: Failed to allocate %lu bytes align=0x%lx\n", __func__, size, align); [anders.roxell@linaro.org: use '%pa' with 'phys_addr_t' type] Link: http://lkml.kernel.org/r/20190131161046.21886-1-anders.roxell@linaro.org [rppt@linux.ibm.com: fix format strings for panics after memblock_alloc] Link: http://lkml.kernel.org/r/1548950940-15145-1-git-send-email-rppt@linux.ibm.com [rppt@linux.ibm.com: don't panic if the allocation in sparse_buffer_init fails] Link: http://lkml.kernel.org/r/20190131074018.GD28876@rapoport-lnx [akpm@linux-foundation.org: fix xtensa printk warning] Link: http://lkml.kernel.org/r/1548057848-15136-20-git-send-email-rppt@linux.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Anders Roxell <anders.roxell@linaro.org> Reviewed-by: Guo Ren <ren_guo@c-sky.com> [c-sky] Acked-by: Paul Burton <paul.burton@mips.com> [MIPS] Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> [s390] Reviewed-by: Juergen Gross <jgross@suse.com> [Xen] Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org> [m68k] Acked-by: Max Filippov <jcmvbkbc@gmail.com> [xtensa] Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@c-s.fr> Cc: Christoph Hellwig <hch@lst.de> Cc: "David S. Miller" <davem@davemloft.net> Cc: Dennis Zhou <dennis@kernel.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Guo Ren <guoren@kernel.org> Cc: Mark Salter <msalter@redhat.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Petr Mladek <pmladek@suse.com> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Rob Herring <robh+dt@kernel.org> Cc: Rob Herring <robh@kernel.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-07powerpc: prefer memblock APIs returning virtual addressMike Rapoport
Patch series "memblock: simplify several early memory allocation", v4. These patches simplify some of the early memory allocations by replacing usage of older memblock APIs with newer and shinier ones. Quite a few places in the arch/ code allocated memory using a memblock API that returns a physical address of the allocated area, then converted this physical address to a virtual one and then used memset(0) to clear the allocated range. More recent memblock APIs do all the three steps in one call and their usage simplifies the code. It's important to note that regardless of API used, the core allocation is nearly identical for any set of memblock allocators: first it tries to find a free memory with all the constraints specified by the caller and then falls back to the allocation with some or all constraints disabled. The first three patches perform the conversion of call sites that have exact requirements for the node and the possible memory range. The fourth patch is a bit one-off as it simplifies openrisc's implementation of pte_alloc_one_kernel(), and not only the memblock usage. The fifth patch takes care of simpler cases when the allocation can be satisfied with a simple call to memblock_alloc(). The sixth patch removes one-liner wrappers for memblock_alloc on arm and unicore32, as suggested by Christoph. This patch (of 6): There are a several places that allocate memory using memblock APIs that return a physical address, convert the returned address to the virtual address and frequently also memset(0) the allocated range. Update these places to use memblock allocators already returning a virtual address. Use memblock functions that clear the allocated memory instead of calling memset(0) where appropriate. The calls to memblock_alloc_base() that were not followed by memset(0) are replaced with memblock_alloc_try_nid_raw(). Since the latter does not panic() when the allocation fails, the appropriate panic() calls are added to the call sites. Link: http://lkml.kernel.org/r/1546248566-14910-2-git-send-email-rppt@linux.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Greentime Hu <green.hu@gmail.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Jonas Bonn <jonas@southpole.se> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Michal Simek <monstr@monstr.eu> Cc: Mark Salter <msalter@redhat.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi> Cc: Stafford Horne <shorne@gmail.com> Cc: Vincent Chen <deanbo422@gmail.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Christoph Hellwig <hch@infradead.org> Cc: Michal Simek <michal.simek@xilinx.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05arch/powerpc/mm: Nest MMU workaround for mprotect RW upgradeAneesh Kumar K.V
NestMMU requires us to mark the pte invalid and flush the tlb when we do a RW upgrade of pte. We fixed a variant of this in the fault path in bd5050e38aec ("powerpc/mm/radix: Change pte relax sequence to handle nest MMU hang"). Do the same for mprotect upgrades. Hugetlb is handled in the next patch. Link: http://lkml.kernel.org/r/20190116085035.29729-4-aneesh.kumar@linux.ibm.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-31powerpc/radix: Fix kernel crash with mremap()Aneesh Kumar K.V
With support for split pmd lock, we use pmd page pmd_huge_pte pointer to store the deposited page table. In those config when we move page tables we need to make sure we move the deposited page table to the correct pmd page. Otherwise this can result in crash when we withdraw of deposited page table because we can find the pmd_huge_pte NULL. eg: __split_huge_pmd+0x1070/0x1940 __split_huge_pmd+0xe34/0x1940 (unreliable) vma_adjust_trans_huge+0x110/0x1c0 __vma_adjust+0x2b4/0x9b0 __split_vma+0x1b8/0x280 __do_munmap+0x13c/0x550 sys_mremap+0x220/0x7e0 system_call+0x5c/0x70 Fixes: 675d995297d4 ("powerpc/book3s64: Enable split pmd ptlock.") Cc: stable@vger.kernel.org # v4.18+ Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-12-04powerpc/mm: Avoid useless lock with single page fragmentsChristophe Leroy
There is no point in taking the page table lock as pte_frag or pmd_frag are always NULL when we have only one fragment. Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-12-04powerpc/mm: Move pte_fragment_alloc() to a common locationChristophe Leroy
In preparation of next patch which generalises the use of pte_fragment_alloc() for all, this patch moves the related functions in a place that is common to all subarches. The 8xx will need that for supporting 16k pages, as in that mode page tables still have a size of 4k. Since pte_fragment with only once fragment is not different from what is done in the general case, we can easily migrate all subarchs to pte fragments. Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-20powerpc/mm: Fix WARN_ON with THP NUMA migrationAneesh Kumar K.V
WARNING: CPU: 12 PID: 4322 at /arch/powerpc/mm/pgtable-book3s64.c:76 set_pmd_at+0x4c/0x2b0 Modules linked in: CPU: 12 PID: 4322 Comm: qemu-system-ppc Tainted: G W 4.19.0-rc3-00758-g8f0c636b0542 #36 NIP: c0000000000872fc LR: c000000000484eec CTR: 0000000000000000 REGS: c000003fba876fe0 TRAP: 0700 Tainted: G W (4.19.0-rc3-00758-g8f0c636b0542) MSR: 900000010282b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE,TM[E]> CR: 24282884 XER: 00000000 CFAR: c000000000484ee8 IRQMASK: 0 GPR00: c000000000484eec c000003fba877268 c000000001f0ec00 c000003fbd229f80 GPR04: 00007c8fe8e00000 c000003f864c5a38 860300853e0000c0 0000000000000080 GPR08: 0000000080000000 0000000000000001 0401000000000080 0000000000000001 GPR12: 0000000000002000 c000003fffff5400 c000003fce292000 00007c9024570000 GPR16: 0000000000000000 0000000000ffffff 0000000000000001 c000000001885950 GPR20: 0000000000000000 001ffffc0004807c 0000000000000008 c000000001f49d05 GPR24: 00007c8fe8e00000 c0000000020f2468 ffffffffffffffff c000003fcd33b090 GPR28: 00007c8fe8e00000 c000003fbd229f80 c000003f864c5a38 860300853e0000c0 NIP [c0000000000872fc] set_pmd_at+0x4c/0x2b0 LR [c000000000484eec] do_huge_pmd_numa_page+0xb1c/0xc20 Call Trace: [c000003fba877268] [c00000000045931c] mpol_misplaced+0x1bc/0x230 (unreliable) [c000003fba8772c8] [c000000000484eec] do_huge_pmd_numa_page+0xb1c/0xc20 [c000003fba877398] [c00000000040d344] __handle_mm_fault+0x5e4/0x2300 [c000003fba8774d8] [c00000000040f400] handle_mm_fault+0x3a0/0x420 [c000003fba877528] [c0000000003ff6f4] __get_user_pages+0x2e4/0x560 [c000003fba877628] [c000000000400314] get_user_pages_unlocked+0x104/0x2a0 [c000003fba8776c8] [c000000000118f44] __gfn_to_pfn_memslot+0x284/0x6a0 [c000003fba877748] [c0000000001463a0] kvmppc_book3s_radix_page_fault+0x360/0x12d0 [c000003fba877838] [c000000000142228] kvmppc_book3s_hv_page_fault+0x48/0x1300 [c000003fba877988] [c00000000013dc08] kvmppc_vcpu_run_hv+0x1808/0x1b50 [c000003fba877af8] [c000000000126b44] kvmppc_vcpu_run+0x34/0x50 [c000003fba877b18] [c000000000123268] kvm_arch_vcpu_ioctl_run+0x288/0x2d0 [c000003fba877b98] [c00000000011253c] kvm_vcpu_ioctl+0x1fc/0x8c0 [c000003fba877d08] [c0000000004e9b24] do_vfs_ioctl+0xa44/0xae0 [c000003fba877db8] [c0000000004e9c44] ksys_ioctl+0x84/0xf0 [c000003fba877e08] [c0000000004e9cd8] sys_ioctl+0x28/0x80 We removed the pte_protnone check earlier with the understanding that we mark the pte invalid before the set_pte/set_pmd usage. But the huge pmd autonuma still use the set_pmd_at directly. This is ok because a protnone pte won't have translation cache in TLB. Fixes: da7ad366b497 ("powerpc/mm/book3s: Update pmd_present to look at _PAGE_PRESENT bit") Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-03powerpc/mm/book3s: Check for pmd_large instead of pmd_trans_hugeAneesh Kumar K.V
Update few code paths to check for pmd_large. set_pmd_at: We want to use this to store swap pte at pmd level. For swap ptes we don't want to set H_PAGE_THP_HUGE. Hence check for pmd_large in set_pmd_at. This remove the false WARN_ON when using this with swap pmd entry. pmd_page: We don't really use them on pmd migration entries. But they can also work with migration entries and we don't differentiate at the pte level. Hence update pmd_page to work with pmd migration entries too __find_linux_pte: lockless page table walk need to handle pmd migration entries. pmd_trans_huge check will return false on them. We don't set thp = 1 for such entries, but update hpage_shift correctly. Without this we will walk pmd migration entries as a pte page pointer which is wrong. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-03powerpc/mm/book3s: Update pmd_present to look at _PAGE_PRESENT bitAneesh Kumar K.V
With this patch we use 0x8000000000000000UL (_PAGE_PRESENT) to indicate a valid pgd/pud/pmd entry. We also switch the p**_present() to look at this bit. With pmd_present, we have a special case. We need to make sure we consider a pmd marked invalid during THP split as present. Right now we clear the _PAGE_PRESENT bit during a pmdp_invalidate. Inorder to consider this special case we add a new pte bit _PAGE_INVALID (mapped to _RPAGE_SW0). This bit is only used with _PAGE_PRESENT cleared. Hence we are not really losing a pte bit for this special case. pmd_present is also updated to look at _PAGE_INVALID. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-08-13powerpc/mm/book3s/radix: Add mapping statisticsAneesh Kumar K.V
Add statistics that show how memory is mapped within the kernel linear mapping. This is similar to commit 37cd944c8d8f ("s390/pgtable: add mapping statistics") We don't do this with Hash translation mode. Hash uses one size (mmu_linear_psize) to map the kernel linear mapping and we print the linear psize during boot as below. "Page orders: linear mapping = 24, virtual = 16, io = 16, vmemmap = 24" A sample output looks like: DirectMap4k: 0 kB DirectMap64k: 18432 kB DirectMap2M: 1030144 kB DirectMap1G: 11534336 kB Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-08-08powerpc/64s: Fix page table fragment refcount race vs speculative referencesNicholas Piggin
The page table fragment allocator uses the main page refcount racily with respect to speculative references. A customer observed a BUG due to page table page refcount underflow in the fragment allocator. This can be caused by the fragment allocator set_page_count stomping on a speculative reference, and then the speculative failure handler decrements the new reference, and the underflow eventually pops when the page tables are freed. Fix this by using a dedicated field in the struct page for the page table fragment allocator. Fixes: 5c1f6ee9a31c ("powerpc: Reduce PTE table memory wastage") Cc: stable@vger.kernel.org # v3.10+ Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-06-20powerpc/mm/hash/4k: Free hugetlb page table caches correctly.Aneesh Kumar K.V
With 4k page size for hugetlb we allocate hugepage directories from its on slab cache. With patch 0c4d26802 ("powerpc/book3s64/mm: Simplify the rcu callback for page table free") we missed to free these allocated hugepd tables. Update pgtable_free to handle hugetlb hugepd directory table. Fixes: 0c4d268029bf ("powerpc/book3s64/mm: Simplify the rcu callback for page table free") Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> [mpe: Add CONFIG_HUGETLB_PAGE guard to fix build break] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-06-03powerpc/64s/radix: prefetch user address in update_mmu_cacheNicholas Piggin
Prefetch the faulting address in update_mmu_cache to give the page table walker perhaps 100 cycles head start as locks are dropped and the interrupt completed. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-06-03powerpc/mm/radix: Change pte relax sequence to handle nest MMU hangAneesh Kumar K.V
When relaxing access (read -> read_write update), pte needs to be marked invalid to handle a nest MMU bug. We also need to do a tlb flush after the pte is marked invalid before updating the pte with new access bits. We also move tlb flush to platform specific __ptep_set_access_flags. This will help us to gerid of unnecessary tlb flush on BOOK3S 64 later. We don't do that in this patch. This also helps in avoiding multiple tlbies with coprocessor attached. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-06-03powerpc/mm: Change function prototypeAneesh Kumar K.V
In later patch, we use the vma and psize to do tlb flush. Do the prototype update in separate patch to make the review easy. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-05-15powerpc/mm: Use page fragments for allocation page table at PMD levelAneesh Kumar K.V
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-05-15powerpc/mm: Implement helpers for pagetable fragment support at PMD levelAneesh Kumar K.V
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-05-15powerpc/book3s64/mm: Simplify the rcu callback for page table freeAneesh Kumar K.V
Instead of encoding shift in the table address, use an enumerated index value. This allow us to do different things in the callback for pte and pmd. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-05-15powerpc/mm/book3s64/4k: Switch 4k pagesize config to use pagetable fragmentAneesh Kumar K.V
4K config use one full page at level 4 of the pagetable. Add support for single fragment allocation in pagetable fragment code and and use that for 4K config. This makes both 4k and 64k use the same code path. Later we will switch pmd to use the page table fragment code. This is done only for 64bit platforms which is using page table fragment support. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-05-15powerpc/mm/nohash: Remove pte fragment dependency from nohashAneesh Kumar K.V
Now that we have removed 64K page size support, the RCU page table free can be much simpler for nohash. Make a copy of the the rcu callback to pgalloc.h header similar to nohash 32. We could possibly merge 32 and 64 bit there. But that is for a later patch We also move the book3s specific handler to pgtable_book3s64.c. This will be updated in a later patch to handle split pmd ptlock. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-05-15powerpc/mm: Use pmd_lockptr instead of opencoding itAneesh Kumar K.V
In later patch we switch pmd_lock from mm->page_table_lock to split pmd ptlock. It avoid compilations issues, use pmd_lockptr helper. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-05-15powerpc/mm/book3s64: Move book3s64 code to pgtable-book3s64Aneesh Kumar K.V
Only code movement and avoid #ifdef. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-03-31Merge branch 'topic/paca' into nextMichael Ellerman
Bring in yet another series that touches KVM code, and might need to be merged into the kvm-ppc branch to resolve conflicts. This required some changes in pnv_power9_force_smt4_catch/release() due to the paca array becomming an array of pointers.
2018-03-31powerpc/mm: Pass node id into create_section_mappingNicholas Piggin
Signed-off-by: Nicholas Piggin <npiggin@gmail.com> [mpe: Move __map_kernel_page_nid() inside #ifdef SPARSEMEM_VMEMMAP] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-03-27powerpc/mm: Fix section mismatch warning in stop_machine_change_mapping()Mauricio Faria de Oliveira
Fix the warning messages for stop_machine_change_mapping(), and a number of other affected functions in its call chain. All modified functions are under CONFIG_MEMORY_HOTPLUG, so __meminit is okay (keeps them / does not discard them). Boot-tested on powernv/power9/radix-mmu and pseries/power8/hash-mmu. $ make -j$(nproc) CONFIG_DEBUG_SECTION_MISMATCH=y vmlinux ... MODPOST vmlinux.o WARNING: vmlinux.o(.text+0x6b130): Section mismatch in reference from the function stop_machine_change_mapping() to the function .meminit.text:create_physical_mapping() The function stop_machine_change_mapping() references the function __meminit create_physical_mapping(). This is often because stop_machine_change_mapping lacks a __meminit annotation or the annotation of create_physical_mapping is wrong. WARNING: vmlinux.o(.text+0x6b13c): Section mismatch in reference from the function stop_machine_change_mapping() to the function .meminit.text:create_physical_mapping() The function stop_machine_change_mapping() references the function __meminit create_physical_mapping(). This is often because stop_machine_change_mapping lacks a __meminit annotation or the annotation of create_physical_mapping is wrong. ... Signed-off-by: Mauricio Faria de Oliveira <mauricfo@linux.vnet.ibm.com> Acked-by: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-31powerpc/mm: update pmdp_invalidate to return old pmd valueAneesh Kumar K.V
It's required to avoid losing dirty and accessed bits. Link: http://lkml.kernel.org/r/20171213105756.69879-7-kirill.shutemov@linux.intel.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-17powerpc/mm/cxl: Add the fault handling cpu to mm cpumaskAneesh Kumar K.V
We use mm cpumask for serializing against lockless page table walk. Anybody who is doing a lockless page table walk is expected to disable irq and only cpus in mm cpumask is expected do the lockless walk. This ensure that a THP split can send IPI to only cpus in the mm cpumask, to make sure there are no parallel lockless page table walk. Add the CAPI fault handling cpu to the mm cpumask so that we can do the lockless page table walk while inserting hash page table entries. Reviewed-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-17powerpc/mm: Don't send IPI to all cpus on THP updatesAneesh Kumar K.V
Now that we made sure that lockless walk of linux page table is mostly limitted to current task(current->mm->pgdir) we can update the THP update sequence to only send IPI to CPUs on which this task has run. This helps in reducing the IPI overload on systems with large number of CPUs. WRT kvm even though kvm is walking page table with vpc->arch.pgdir, it is done only on secondary CPUs and in that case we have primary CPU added to task's mm cpumask. Sending an IPI to primary will force the secondary to do a vm exit and hence this mm cpumask usage is safe here. WRT CAPI, we still end up walking linux page table with capi context MM. For now the pte lookup serialization sends an IPI to all CPUs in CPI is in use. We can further improve this by adding the CAPI interrupt handling CPU to task mm cpumask. That will be done in a later patch. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-07-02powerpc/mm: Add devmap support for ppc64Oliver O'Halloran
Add support for the devmap bit on PTEs and PMDs for PPC64 Book3S. This is used to differentiate device backed memory from transparent huge pages since they are handled in more or less the same manner by the core mm code. Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-03-02sched/headers: Prepare to remove the <linux/mm_types.h> dependency from ↵Ingo Molnar
<linux/sched.h> Update code that relied on sched.h including various MM types for them. This will allow us to remove the <linux/mm_types.h> include from <linux/sched.h>. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-31powerpc/mm: add radix__remove_section_mapping()Reza Arbab
Tear down and free the four-level page tables of physical mappings during memory hotremove. Borrow the basic structure of remove_pagetable() and friends from the identically-named x86 functions. Reduce the frequency of tlb flushes and page_table_lock spinlocks by only doing them in the outermost function. There was some question as to whether the locking is needed at all. Leave it for now, but we could consider dropping it. Memory must be offline to be removed, thus not in use. So there shouldn't be the sort of concurrent page walking activity here that might prompt us to use RCU. Signed-off-by: Reza Arbab <arbab@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-01-31powerpc/mm: add radix__create_section_mapping()Reza Arbab
Wire up memory hotplug page mapping for radix. Share the mapping function already used by radix_init_pgtable(). Signed-off-by: Reza Arbab <arbab@linux.vnet.ibm.com> Acked-by: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-01-17powerpc/mm: Fix memory hotplug BUG() on radixReza Arbab
Memory hotplug is leading to hash page table calls, even on radix: arch_add_memory create_section_mapping htab_bolt_mapping BUG_ON(!ppc_md.hpte_insert); To fix, refactor {create,remove}_section_mapping() into hash__ and radix__ variants. Leave the radix versions stubbed for now. Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Reza Arbab <arbab@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-11-28powerpc/mm: update radix__ptep_set_access_flag to not do full mm tlb flushAneesh Kumar K.V
When we are updating a pte, we just need to flush the tlb mapping that pte. Right now we do a full mm flush because we don't track the page size. Now that we have page size details in pte use that to do the optimized flush Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-09-23powerpc/64/kexec: Fix MMU cleanup on radixBenjamin Herrenschmidt
Just using the hash ops won't work anymore since radix will have NULL in there. Instead create an mmu_cleanup_all() function which will do the right thing based on the MMU mode. For Radix, for now I clear UPRT and the PTCR, effectively switching back to Radix with no partition table setup. Currently set it to NULL on BookE thought it might be a good idea to wipe the TLB there (Scott ?) Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-09-13powerpc/mm/radix: Use different pte update sequence for different POWER9 revsAneesh Kumar K.V
POWER9 DD1 requires pte to be marked invalid (V=0) before updating it with the new value. This makes this distinction for the different revisions. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-08-04powerpc/mm: Move register_process_table() out of ppc_mdMichael Ellerman
We want to initialise register_process_table() before ppc_md is setup, so that it can be called as part of MMU init (at least on Radix ATM). That no longer works because probe_machine() requires that ppc_md be empty before it's called, and we now do probe_machine() much later. So make register_process_table a global for now. It will probably move into a mmu_radix_ops struct at some point in the future. This was broken by me when applying commit 7025776ed1eb "powerpc/mm: Move hash table ops to a separate structure" due to conflicts with other patches. Fixes: 7025776ed1eb ("powerpc/mm: Move hash table ops to a separate structure") Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-08-01powerpc/mm/radix: Add tlb flush of THP ptesAneesh Kumar K.V
Instead of flushing the entire mm, implement a flush_pmd_tlb_range Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-06-01powerpc/mm/radix: Add missing tlb flushAneesh Kumar K.V
This should not have any impact on hash, because hash does tlb invalidate with every pte update and we don't implement flush_tlb_* functions for hash. With radix we should make an explicit call to flush tlb outside pte update. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-11powerpc/mm/radix: Add radix THP callbacksAneesh Kumar K.V
The deposited pgtable_t is a pte fragment hence we cannot use page->lru for linking then together. We use the first two 64 bits for pte fragment as list_head type to link all deposited fragments together. On withdraw we properly zero then out. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-11powerpc/mm/thp: Abstraction for THP functionsAneesh Kumar K.V
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>