summaryrefslogtreecommitdiff
path: root/arch/powerpc/mm
AgeCommit message (Collapse)Author
2023-10-27Merge tag 'powerpc-6.6-6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc fixes from Michael Ellerman: - Fix boot crash with FLATMEM since set_ptes() introduction - Avoid calling arch_enter/leave_lazy_mmu() in set_ptes() Thanks to Aneesh Kumar K.V and Erhard Furtner. * tag 'powerpc-6.6-6' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: powerpc/mm: Avoid calling arch_enter/leave_lazy_mmu() in set_ptes powerpc/mm: Fix boot crash with FLATMEM
2023-10-25powerpc/mm: Avoid calling arch_enter/leave_lazy_mmu() in set_ptesAneesh Kumar K.V
With commit 9fee28baa601 ("powerpc: implement the new page table range API") we added set_ptes to powerpc architecture. The implementation included calling arch_enter/leave_lazy_mmu() calls. The patch removes the usage of arch_enter/leave_lazy_mmu() because set_pte is not supposed to be used when updating a pte entry. Powerpc architecture uses this rule to skip the expensive tlb invalidate which is not needed when you are setting up the pte for the first time. See commit 56eecdb912b5 ("mm: Use ptep/pmdp_set_numa() for updating _PAGE_NUMA bit") for more details The patch also makes sure we are not using the interface to update a valid/present pte entry by adding VM_WARN_ON check all the ptes we are setting up. Furthermore, we add a comment to set_pte_filter to clarify it can only update folio-related flags and cannot filter pfn specific details in pte filtering. Removal of arch_enter/leave_lazy_mmu() also will avoid nesting of these functions that are not supported. For ex: remap_pte_range() -> arch_enter_lazy_mmu() -> set_ptes() -> arch_enter_lazy_mmu() -> arch_leave_lazy_mmu() -> arch_leave_lazy_mmu() Fixes: 9fee28baa601 ("powerpc: implement the new page table range API") Signed-off-by: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20231024143604.16749-1-aneesh.kumar@linux.ibm.com
2023-10-23powerpc/mm: Fix boot crash with FLATMEMMichael Ellerman
Erhard reported that his G5 was crashing with v6.6-rc kernels: mpic: Setting up HT PICs workarounds for U3/U4 BUG: Unable to handle kernel data access at 0xfeffbb62ffec65fe Faulting instruction address: 0xc00000000005dc40 Oops: Kernel access of bad area, sig: 11 [#1] BE PAGE_SIZE=4K MMU=Hash SMP NR_CPUS=2 PowerMac Modules linked in: CPU: 0 PID: 0 Comm: swapper/0 Tainted: G T 6.6.0-rc3-PMacGS #1 Hardware name: PowerMac11,2 PPC970MP 0x440101 PowerMac NIP: c00000000005dc40 LR: c000000000066660 CTR: c000000000007730 REGS: c0000000022bf510 TRAP: 0380 Tainted: G T (6.6.0-rc3-PMacGS) MSR: 9000000000001032 <SF,HV,ME,IR,DR,RI> CR: 44004242 XER: 00000000 IRQMASK: 3 GPR00: 0000000000000000 c0000000022bf7b0 c0000000010c0b00 00000000000001ac GPR04: 0000000003c80000 0000000000000300 c0000000f20001ae 0000000000000300 GPR08: 0000000000000006 feffbb62ffec65ff 0000000000000001 0000000000000000 GPR12: 9000000000001032 c000000002362000 c000000000f76b80 000000000349ecd8 GPR16: 0000000002367ba8 0000000002367f08 0000000000000006 0000000000000000 GPR20: 00000000000001ac c000000000f6f920 c0000000022cd985 000000000000000c GPR24: 0000000000000300 00000003b0a3691d c0003e008030000e 0000000000000000 GPR28: c00000000000000c c0000000f20001ee feffbb62ffec65fe 00000000000001ac NIP hash_page_do_lazy_icache+0x50/0x100 LR __hash_page_4K+0x420/0x590 Call Trace: hash_page_mm+0x364/0x6f0 do_hash_fault+0x114/0x2b0 data_access_common_virt+0x198/0x1f0 --- interrupt: 300 at mpic_init+0x4bc/0x10c4 NIP: c000000002020a5c LR: c000000002020a04 CTR: 0000000000000000 REGS: c0000000022bf9f0 TRAP: 0300 Tainted: G T (6.6.0-rc3-PMacGS) MSR: 9000000000001032 <SF,HV,ME,IR,DR,RI> CR: 24004248 XER: 00000000 DAR: c0003e008030000e DSISR: 40000000 IRQMASK: 1 ... NIP mpic_init+0x4bc/0x10c4 LR mpic_init+0x464/0x10c4 --- interrupt: 300 pmac_setup_one_mpic+0x258/0x2dc pmac_pic_init+0x28c/0x3d8 init_IRQ+0x90/0x140 start_kernel+0x57c/0x78c start_here_common+0x1c/0x20 A bisect pointed to the breakage beginning with commit 9fee28baa601 ("powerpc: implement the new page table range API"). Analysis of the oops pointed to a struct page with a corrupted compound_head being loaded via page_folio() -> _compound_head() in hash_page_do_lazy_icache(). The access by the mpic code is to an MMIO address, so the expectation is that the struct page for that address would be initialised by init_unavailable_range(), as pointed out by Aneesh. Instrumentation showed that was not the case, which eventually lead to the realisation that pfn_valid() was returning false for that address, causing the struct page to not be initialised. Because the system is using FLATMEM, the version of pfn_valid() in memory_model.h is used: static inline int pfn_valid(unsigned long pfn) { ... return pfn >= pfn_offset && (pfn - pfn_offset) < max_mapnr; } Which relies on max_mapnr being initialised. Early in boot max_mapnr is zero meaning no PFNs are valid. max_mapnr is initialised in mem_init() called via: start_kernel() mm_core_init() # init/main.c:928 mem_init() But that is too late for the usage in init_unavailable_range() called via: start_kernel() setup_arch() # init/main.c:893 paging_init() free_area_init() init_unavailable_range() Although max_mapnr is currently set in mem_init(), the value is actually already available much earlier, as soon as mem_topology_setup() has completed, which is also before paging_init() is called. So move the initialisation there, which causes paging_init() to correctly initialise the struct page and fixes the bug. This bug seems to have been lurking for years, but went unnoticed because the pre-folio code was inspecting the uninitialised page->flags but not dereferencing it. Thanks to Erhard and Aneesh for help debugging. Reported-by: Erhard Furtner <erhard_f@mailbox.org> Closes: https://lore.kernel.org/all/20230929132750.3cd98452@yea/ Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20231023112500.1550208-1-mpe@ellerman.id.au
2023-10-21Merge tag 'powerpc-6.6-5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc fixes from Michael Ellerman: - Fix stale propagated yield_cpu in qspinlocks leading to lockups - Fix broken hugepages on some configs due to ARCH_FORCE_MAX_ORDER - Fix a spurious warning when copros are in use at exit time Thanks to Nicholas Piggin, Christophe Leroy, Nysal Jan K.A Sachin Sant, and Shrikanth Hegde. * tag 'powerpc-6.6-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: powerpc/qspinlock: Fix stale propagated yield_cpu powerpc/64s/radix: Don't warn on copros in radix__tlb_flush() powerpc/mm: Allow ARCH_FORCE_MAX_ORDER up to 12
2023-10-18powerpc/64s/radix: Don't warn on copros in radix__tlb_flush()Michael Ellerman
Sachin reported a warning when running the inject-ra-err selftest: # selftests: powerpc/mce: inject-ra-err Disabling lock debugging due to kernel taint MCE: CPU19: machine check (Severe) Real address Load/Store (foreign/control memory) [Not recovered] MCE: CPU19: PID: 5254 Comm: inject-ra-err NIP: [0000000010000e48] MCE: CPU19: Initiator CPU MCE: CPU19: Unknown ------------[ cut here ]------------ WARNING: CPU: 19 PID: 5254 at arch/powerpc/mm/book3s64/radix_tlb.c:1221 radix__tlb_flush+0x160/0x180 CPU: 19 PID: 5254 Comm: inject-ra-err Kdump: loaded Tainted: G M E 6.6.0-rc3-00055-g9ed22ae6be81 #4 Hardware name: IBM,9080-HEX POWER10 (raw) 0x800200 0xf000006 of:IBM,FW1030.20 (NH1030_058) hv:phyp pSeries ... NIP radix__tlb_flush+0x160/0x180 LR radix__tlb_flush+0x104/0x180 Call Trace: radix__tlb_flush+0xf4/0x180 (unreliable) tlb_finish_mmu+0x15c/0x1e0 exit_mmap+0x1a0/0x510 __mmput+0x60/0x1e0 exit_mm+0xdc/0x170 do_exit+0x2bc/0x5a0 do_group_exit+0x4c/0xc0 sys_exit_group+0x28/0x30 system_call_exception+0x138/0x330 system_call_vectored_common+0x15c/0x2ec And bisected it to commit e43c0a0c3c28 ("powerpc/64s/radix: combine final TLB flush and lazy tlb mm shootdown IPIs"), which added a warning in radix__tlb_flush() if mm->context.copros is still elevated. However it's possible for the copros count to be elevated if a process exits without first closing file descriptors that are associated with a copro, eg. VAS. If the process exits with a VAS file still open, the release callback is queued up for exit_task_work() via: exit_files() put_files_struct() close_files() filp_close() fput() And called via: exit_task_work() ____fput() __fput() file->f_op->release(inode, file) coproc_release() vas_user_win_ops->close_win() vas_deallocate_window() mm_context_remove_vas_window() mm_context_remove_copro() But that is after exit_mm() has been called from do_exit() and triggered the warning. Fix it by dropping the warning, and always calling __flush_all_mm(). In the normal case of no copros, that will result in a call to _tlbiel_pid(mm->context.id, RIC_FLUSH_ALL) just as the current code does. If the copros count is elevated then it will cause a global flush, which should flush translations from any copros. Note that the process table entry was cleared in arch_exit_mmap(), so copros should not be able to fetch any new translations. Fixes: e43c0a0c3c28 ("powerpc/64s/radix: combine final TLB flush and lazy tlb mm shootdown IPIs") Reported-by: Sachin Sant <sachinp@linux.ibm.com> Closes: https://lore.kernel.org/all/A8E52547-4BF1-47CE-8AEA-BC5A9D7E3567@linux.ibm.com/ Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Tested-by: Sachin Sant <sachinp@linux.ibm.com> Link: https://msgid.link/20231017121527.1574104-1-mpe@ellerman.id.au
2023-09-29mm: hugetlb: add huge page size param to set_huge_pte_at()Ryan Roberts
Patch series "Fix set_huge_pte_at() panic on arm64", v2. This series fixes a bug in arm64's implementation of set_huge_pte_at(), which can result in an unprivileged user causing a kernel panic. The problem was triggered when running the new uffd poison mm selftest for HUGETLB memory. This test (and the uffd poison feature) was merged for v6.5-rc7. Ideally, I'd like to get this fix in for v6.6 and I've cc'ed stable (correctly this time) to get it backported to v6.5, where the issue first showed up. Description of Bug ================== arm64's huge pte implementation supports multiple huge page sizes, some of which are implemented in the page table with multiple contiguous entries. So set_huge_pte_at() needs to work out how big the logical pte is, so that it can also work out how many physical ptes (or pmds) need to be written. It previously did this by grabbing the folio out of the pte and querying its size. However, there are cases when the pte being set is actually a swap entry. But this also used to work fine, because for huge ptes, we only ever saw migration entries and hwpoison entries. And both of these types of swap entries have a PFN embedded, so the code would grab that and everything still worked out. But over time, more calls to set_huge_pte_at() have been added that set swap entry types that do not embed a PFN. And this causes the code to go bang. The triggering case is for the uffd poison test, commit 99aa77215ad0 ("selftests/mm: add uffd unit test for UFFDIO_POISON"), which causes a PTE_MARKER_POISONED swap entry to be set, coutesey of commit 8a13897fb0da ("mm: userfaultfd: support UFFDIO_POISON for hugetlbfs") - added in v6.5-rc7. Although review shows that there are other call sites that set PTE_MARKER_UFFD_WP (which also has no PFN), these don't trigger on arm64 because arm64 doesn't support UFFD WP. If CONFIG_DEBUG_VM is enabled, we do at least get a BUG(), but otherwise, it will dereference a bad pointer in page_folio(): static inline struct folio *hugetlb_swap_entry_to_folio(swp_entry_t entry) { VM_BUG_ON(!is_migration_entry(entry) && !is_hwpoison_entry(entry)); return page_folio(pfn_to_page(swp_offset_pfn(entry))); } Fix === The simplest fix would have been to revert the dodgy cleanup commit 18f3962953e4 ("mm: hugetlb: kill set_huge_swap_pte_at()"), but since things have moved on, this would have required an audit of all the new set_huge_pte_at() call sites to see if they should be converted to set_huge_swap_pte_at(). As per the original intent of the change, it would also leave us open to future bugs when people invariably get it wrong and call the wrong helper. So instead, I've added a huge page size parameter to set_huge_pte_at(). This means that the arm64 code has the size in all cases. It's a bigger change, due to needing to touch the arches that implement the function, but it is entirely mechanical, so in my view, low risk. I've compile-tested all touched arches; arm64, parisc, powerpc, riscv, s390, sparc (and additionally x86_64). I've additionally booted and run mm selftests against arm64, where I observe the uffd poison test is fixed, and there are no other regressions. This patch (of 2): In order to fix a bug, arm64 needs to be told the size of the huge page for which the pte is being set in set_huge_pte_at(). Provide for this by adding an `unsigned long sz` parameter to the function. This follows the same pattern as huge_pte_clear(). This commit makes the required interface modifications to the core mm as well as all arches that implement this function (arm64, parisc, powerpc, riscv, s390, sparc). The actual arm64 bug will be fixed in a separate commit. No behavioral changes intended. Link: https://lkml.kernel.org/r/20230922115804.2043771-1-ryan.roberts@arm.com Link: https://lkml.kernel.org/r/20230922115804.2043771-2-ryan.roberts@arm.com Fixes: 8a13897fb0da ("mm: userfaultfd: support UFFDIO_POISON for hugetlbfs") Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> [powerpc 8xx] Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com> [vmalloc change] Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: David S. Miller <davem@davemloft.net> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Xu <peterx@redhat.com> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: SeongJae Park <sj@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Cc: <stable@vger.kernel.org> [6.5+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-31Merge tag 'powerpc-6.6-1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc updates from Michael Ellerman: - Add HOTPLUG_SMT support (/sys/devices/system/cpu/smt) and honour the configured SMT state when hotplugging CPUs into the system - Combine final TLB flush and lazy TLB mm shootdown IPIs when using the Radix MMU to avoid a broadcast TLBIE flush on exit - Drop the exclusion between ptrace/perf watchpoints, and drop the now unused associated arch hooks - Add support for the "nohlt" command line option to disable CPU idle - Add support for -fpatchable-function-entry for ftrace, with GCC >= 13.1 - Rework memory block size determination, and support 256MB size on systems with GPUs that have hotpluggable memory - Various other small features and fixes Thanks to Andrew Donnellan, Aneesh Kumar K.V, Arnd Bergmann, Athira Rajeev, Benjamin Gray, Christophe Leroy, Frederic Barrat, Gautam Menghani, Geoff Levand, Hari Bathini, Immad Mir, Jialin Zhang, Joel Stanley, Jordan Niethe, Justin Stitt, Kajol Jain, Kees Cook, Krzysztof Kozlowski, Laurent Dufour, Liang He, Linus Walleij, Mahesh Salgaonkar, Masahiro Yamada, Michal Suchanek, Nageswara R Sastry, Nathan Chancellor, Nathan Lynch, Naveen N Rao, Nicholas Piggin, Nick Desaulniers, Omar Sandoval, Randy Dunlap, Reza Arbab, Rob Herring, Russell Currey, Sourabh Jain, Thomas Gleixner, Trevor Woerner, Uwe Kleine-König, Vaibhav Jain, Xiongfeng Wang, Yuan Tan, Zhang Rui, and Zheng Zengkai. * tag 'powerpc-6.6-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (135 commits) macintosh/ams: linux/platform_device.h is needed powerpc/xmon: Reapply "Relax frame size for clang" powerpc/mm/book3s64: Use 256M as the upper limit with coherent device memory attached powerpc/mm/book3s64: Fix build error with SPARSEMEM disabled powerpc/iommu: Fix notifiers being shared by PCI and VIO buses powerpc/mpc5xxx: Add missing fwnode_handle_put() powerpc/config: Disable SLAB_DEBUG_ON in skiroot powerpc/pseries: Remove unused hcall tracing instruction powerpc/pseries: Fix hcall tracepoints with JUMP_LABEL=n powerpc: dts: add missing space before { powerpc/eeh: Use pci_dev_id() to simplify the code powerpc/64s: Move CPU -mtune options into Kconfig powerpc/powermac: Fix unused function warning powerpc/pseries: Rework lppaca_shared_proc() to avoid DEBUG_PREEMPT powerpc: Don't include lppaca.h in paca.h powerpc/pseries: Move hcall_vphn() prototype into vphn.h powerpc/pseries: Move VPHN constants into vphn.h cxl: Drop unused detach_spa() powerpc: Drop zalloc_maybe_bootmem() powerpc/powernv: Use struct opal_prd_msg in more places ...
2023-08-29Merge tag 'mm-stable-2023-08-28-18-26' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Some swap cleanups from Ma Wupeng ("fix WARN_ON in add_to_avail_list") - Peter Xu has a series (mm/gup: Unify hugetlb, speed up thp") which reduces the special-case code for handling hugetlb pages in GUP. It also speeds up GUP handling of transparent hugepages. - Peng Zhang provides some maple tree speedups ("Optimize the fast path of mas_store()"). - Sergey Senozhatsky has improved te performance of zsmalloc during compaction (zsmalloc: small compaction improvements"). - Domenico Cerasuolo has developed additional selftest code for zswap ("selftests: cgroup: add zswap test program"). - xu xin has doe some work on KSM's handling of zero pages. These changes are mainly to enable the user to better understand the effectiveness of KSM's treatment of zero pages ("ksm: support tracking KSM-placed zero-pages"). - Jeff Xu has fixes the behaviour of memfd's MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED sysctl ("mm/memfd: fix sysctl MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED"). - David Howells has fixed an fscache optimization ("mm, netfs, fscache: Stop read optimisation when folio removed from pagecache"). - Axel Rasmussen has given userfaultfd the ability to simulate memory poisoning ("add UFFDIO_POISON to simulate memory poisoning with UFFD"). - Miaohe Lin has contributed some routine maintenance work on the memory-failure code ("mm: memory-failure: remove unneeded PageHuge() check"). - Peng Zhang has contributed some maintenance work on the maple tree code ("Improve the validation for maple tree and some cleanup"). - Hugh Dickins has optimized the collapsing of shmem or file pages into THPs ("mm: free retracted page table by RCU"). - Jiaqi Yan has a patch series which permits us to use the healthy subpages within a hardware poisoned huge page for general purposes ("Improve hugetlbfs read on HWPOISON hugepages"). - Kemeng Shi has done some maintenance work on the pagetable-check code ("Remove unused parameters in page_table_check"). - More folioification work from Matthew Wilcox ("More filesystem folio conversions for 6.6"), ("Followup folio conversions for zswap"). And from ZhangPeng ("Convert several functions in page_io.c to use a folio"). - page_ext cleanups from Kemeng Shi ("minor cleanups for page_ext"). - Baoquan He has converted some architectures to use the GENERIC_IOREMAP ioremap()/iounmap() code ("mm: ioremap: Convert architectures to take GENERIC_IOREMAP way"). - Anshuman Khandual has optimized arm64 tlb shootdown ("arm64: support batched/deferred tlb shootdown during page reclamation/migration"). - Better maple tree lockdep checking from Liam Howlett ("More strict maple tree lockdep"). Liam also developed some efficiency improvements ("Reduce preallocations for maple tree"). - Cleanup and optimization to the secondary IOMMU TLB invalidation, from Alistair Popple ("Invalidate secondary IOMMU TLB on permission upgrade"). - Ryan Roberts fixes some arm64 MM selftest issues ("selftests/mm fixes for arm64"). - Kemeng Shi provides some maintenance work on the compaction code ("Two minor cleanups for compaction"). - Some reduction in mmap_lock pressure from Matthew Wilcox ("Handle most file-backed faults under the VMA lock"). - Aneesh Kumar contributes code to use the vmemmap optimization for DAX on ppc64, under some circumstances ("Add support for DAX vmemmap optimization for ppc64"). - page-ext cleanups from Kemeng Shi ("add page_ext_data to get client data in page_ext"), ("minor cleanups to page_ext header"). - Some zswap cleanups from Johannes Weiner ("mm: zswap: three cleanups"). - kmsan cleanups from ZhangPeng ("minor cleanups for kmsan"). - VMA handling cleanups from Kefeng Wang ("mm: convert to vma_is_initial_heap/stack()"). - DAMON feature work from SeongJae Park ("mm/damon/sysfs-schemes: implement DAMOS tried total bytes file"), ("Extend DAMOS filters for address ranges and DAMON monitoring targets"). - Compaction work from Kemeng Shi ("Fixes and cleanups to compaction"). - Liam Howlett has improved the maple tree node replacement code ("maple_tree: Change replacement strategy"). - ZhangPeng has a general code cleanup - use the K() macro more widely ("cleanup with helper macro K()"). - Aneesh Kumar brings memmap-on-memory to ppc64 ("Add support for memmap on memory feature on ppc64"). - pagealloc cleanups from Kemeng Shi ("Two minor cleanups for pcp list in page_alloc"), ("Two minor cleanups for get pageblock migratetype"). - Vishal Moola introduces a memory descriptor for page table tracking, "struct ptdesc" ("Split ptdesc from struct page"). - memfd selftest maintenance work from Aleksa Sarai ("memfd: cleanups for vm.memfd_noexec"). - MM include file rationalization from Hugh Dickins ("arch: include asm/cacheflush.h in asm/hugetlb.h"). - THP debug output fixes from Hugh Dickins ("mm,thp: fix sloppy text output"). - kmemleak improvements from Xiaolei Wang ("mm/kmemleak: use object_cache instead of kmemleak_initialized"). - More folio-related cleanups from Matthew Wilcox ("Remove _folio_dtor and _folio_order"). - A VMA locking scalability improvement from Suren Baghdasaryan ("Per-VMA lock support for swap and userfaults"). - pagetable handling cleanups from Matthew Wilcox ("New page table range API"). - A batch of swap/thp cleanups from David Hildenbrand ("mm/swap: stop using page->private on tail pages for THP_SWAP + cleanups"). - Cleanups and speedups to the hugetlb fault handling from Matthew Wilcox ("Change calling convention for ->huge_fault"). - Matthew Wilcox has also done some maintenance work on the MM subsystem documentation ("Improve mm documentation"). * tag 'mm-stable-2023-08-28-18-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (489 commits) maple_tree: shrink struct maple_tree maple_tree: clean up mas_wr_append() secretmem: convert page_is_secretmem() to folio_is_secretmem() nios2: fix flush_dcache_page() for usage from irq context hugetlb: add documentation for vma_kernel_pagesize() mm: add orphaned kernel-doc to the rst files. mm: fix clean_record_shared_mapping_range kernel-doc mm: fix get_mctgt_type() kernel-doc mm: fix kernel-doc warning from tlb_flush_rmaps() mm: remove enum page_entry_size mm: allow ->huge_fault() to be called without the mmap_lock held mm: move PMD_ORDER to pgtable.h mm: remove checks for pte_index memcg: remove duplication detection for mem_cgroup_uncharge_swap mm/huge_memory: work on folio->swap instead of page->private when splitting folio mm/swap: inline folio_set_swap_entry() and folio_swap_entry() mm/swap: use dedicated entry for swap in folio mm/swap: stop using page->private on tail pages for THP_SWAP selftests/mm: fix WARNING comparing pointer to 0 selftests: cgroup: fix test_kmem_memcg_deletion kernel mem check ...
2023-08-28powerpc/mm/book3s64: Use 256M as the upper limit with coherent device memory ↵Aneesh Kumar K.V
attached Commit 4d15721177d5 ("powerpc/mm: Cleanup memory block size probing") used 256MB as the memory block size when we have ibm,coherent-device-memory device tree node present. Instead of returning with 256MB memory block size, continue to check the rest of the memory regions and make sure we can still map them using a 256MB memory block size. Fixes: 4d15721177d5 ("powerpc/mm: Cleanup memory block size probing") Signed-off-by: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20230828074658.59553-2-aneesh.kumar@linux.ibm.com
2023-08-28powerpc/mm/book3s64: Fix build error with SPARSEMEM disabledAneesh Kumar K.V
With CONFIG_SPARSEMEM disabled the below kernel build error is observed. arch/powerpc/mm/init_64.c:477:38: error: use of undeclared identifier 'SECTION_SIZE_BITS' CONFIG_MEMORY_HOTPLUG depends on CONFIG_SPARSEMEM and it is more clear to describe the code dependency in terms of MEMORY_HOTPLUG. Outside memory hotplug the kernel uses memory_block_size for kernel directmap. Instead of depending on SECTION_SIZE_BITS to compute the direct map page size, add a new #define which defaults to 16M(same as existing SECTION_SIZE) Fixes: 4d15721177d5 ("powerpc/mm: Cleanup memory block size probing") Signed-off-by: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Closes: https://lore.kernel.org/oe-kbuild-all/202308251532.k9PpWEAD-lkp@intel.com/ Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20230828074658.59553-1-aneesh.kumar@linux.ibm.com
2023-08-25Merge tag 'mm-hotfixes-stable-2023-08-25-11-07' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "18 hotfixes. 13 are cc:stable and the remainder pertain to post-6.4 issues or aren't considered suitable for a -stable backport" * tag 'mm-hotfixes-stable-2023-08-25-11-07' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: shmem: fix smaps BUG sleeping while atomic selftests: cachestat: catch failing fsync test on tmpfs selftests: cachestat: test for cachestat availability maple_tree: disable mas_wr_append() when other readers are possible madvise:madvise_free_pte_range(): don't use mapcount() against large folio for sharing check madvise:madvise_free_huge_pmd(): don't use mapcount() against large folio for sharing check madvise:madvise_cold_or_pageout_pte_range(): don't use mapcount() against large folio for sharing check mm: multi-gen LRU: don't spin during memcg release mm: memory-failure: fix unexpected return value in soft_offline_page() radix tree: remove unused variable mm: add a call to flush_cache_vmap() in vmap_pfn() selftests/mm: FOLL_LONGTERM need to be updated to 0x100 nilfs2: fix general protection fault in nilfs_lookup_dirty_data_buffers() mm/gup: handle cont-PTE hugetlb pages correctly in gup_must_unshare() via GUP-fast selftests: cgroup: fix test_kmem_basic less than error mm: enable page walking API to lock vmas during the walk smaps: use vm_normal_page_pmd() instead of follow_trans_huge_pmd() mm/gup: reintroduce FOLL_NUMA as FOLL_HONOR_NUMA_FAULT
2023-08-24powerpc: implement the new page table range APIMatthew Wilcox (Oracle)
Add set_ptes(), update_mmu_cache_range() and flush_dcache_folio(). Change the PG_arch_1 (aka PG_dcache_dirty) flag from being per-page to per-folio. [willy@infradead.org: re-export flush_dcache_icache_folio()] Link: https://lkml.kernel.org/r/ZMx1daYwvD9EM7Cv@casper.infradead.org Link: https://lkml.kernel.org/r/20230802151406.3735276-22-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-24mm: drop per-VMA lock when returning VM_FAULT_RETRY or VM_FAULT_COMPLETEDSuren Baghdasaryan
handle_mm_fault returning VM_FAULT_RETRY or VM_FAULT_COMPLETED means mmap_lock has been released. However with per-VMA locks behavior is different and the caller should still release it. To make the rules consistent for the caller, drop the per-VMA lock when returning VM_FAULT_RETRY or VM_FAULT_COMPLETED. Currently the only path returning VM_FAULT_RETRY under per-VMA locks is do_swap_page and no path returns VM_FAULT_COMPLETED for now. [willy@infradead.org: fix riscv] Link: https://lkml.kernel.org/r/CAJuCfpE6GWEx1rPBmNpUfoD5o-gNFz9-UFywzCE2PbEGBiVz7g@mail.gmail.com Link: https://lkml.kernel.org/r/20230630211957.1341547-4-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Peter Xu <peterx@redhat.com> Tested-by: Conor Dooley <conor.dooley@microchip.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Hillf Danton <hdanton@sina.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Laurent Dufour <ldufour@linux.ibm.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Michel Lespinasse <michel@lespinasse.org> Cc: Minchan Kim <minchan@google.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Punit Agrawal <punit.agrawal@bytedance.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-24powerpc: Don't include lppaca.h in paca.hMichael Ellerman
By adding a forward declaration for struct lppaca we can untangle paca.h and lppaca.h. Also move get_lppaca() into lppaca.h for consistency. Add includes of lppaca.h to some files that need it. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20230823055317.751786-3-mpe@ellerman.id.au
2023-08-24powerpc/pseries: Move VPHN constants into vphn.hMichael Ellerman
These don't have any particularly good reason to belong in lppaca.h, move them into their own header. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20230823055317.751786-1-mpe@ellerman.id.au
2023-08-21merge mm-hotfixes-stable into mm-stable to pick up depended-upon changesAndrew Morton
2023-08-21powerpc: convert various functions to use ptdescsVishal Moola (Oracle)
In order to split struct ptdesc from struct page, convert various functions to use ptdescs. Link: https://lkml.kernel.org/r/20230807230513.102486-13-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Geert Uytterhoeven <geert+renesas@glider.be> Cc: Guo Ren <guoren@kernel.org> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Jonas Bonn <jonas@southpole.se> Cc: Matthew Wilcox <willy@infradead.org> Cc: Palmer Dabbelt <palmer@rivosinc.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Richard Weinberger <richard@nod.at> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21mm: enable page walking API to lock vmas during the walkSuren Baghdasaryan
walk_page_range() and friends often operate under write-locked mmap_lock. With introduction of vma locks, the vmas have to be locked as well during such walks to prevent concurrent page faults in these areas. Add an additional member to mm_walk_ops to indicate locking requirements for the walk. The change ensures that page walks which prevent concurrent page faults by write-locking mmap_lock, operate correctly after introduction of per-vma locks. With per-vma locks page faults can be handled under vma lock without taking mmap_lock at all, so write locking mmap_lock would not stop them. The change ensures vmas are properly locked during such walks. A sample issue this solves is do_mbind() performing queue_pages_range() to queue pages for migration. Without this change a concurrent page can be faulted into the area and be left out of migration. Link: https://lkml.kernel.org/r/20230804152724.3090321-2-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org> Suggested-by: Jann Horn <jannh@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Laurent Dufour <ldufour@linux.ibm.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Michel Lespinasse <michel@lespinasse.org> Cc: Peter Xu <peterx@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18powerpc/book3s64/radix: add debug message to give more details of vmemmap ↵Aneesh Kumar K.V
allocation Add some extra vmemmap pr_debug message that will indicate the type of vmemmap allocations. For ex: with DAX vmemmap optimization we can find the below details: [ 187.166580] radix-mmu: PAGE_SIZE vmemmap mapping [ 187.166587] radix-mmu: PAGE_SIZE vmemmap mapping [ 187.166591] radix-mmu: Tail page reuse vmemmap mapping [ 187.166594] radix-mmu: Tail page reuse vmemmap mapping [ 187.166598] radix-mmu: Tail page reuse vmemmap mapping [ 187.166601] radix-mmu: Tail page reuse vmemmap mapping [ 187.166604] radix-mmu: Tail page reuse vmemmap mapping [ 187.166608] radix-mmu: Tail page reuse vmemmap mapping [ 187.166611] radix-mmu: Tail page reuse vmemmap mapping [ 187.166614] radix-mmu: Tail page reuse vmemmap mapping [ 187.166617] radix-mmu: Tail page reuse vmemmap mapping [ 187.166620] radix-mmu: Tail page reuse vmemmap mapping [ 187.166623] radix-mmu: Tail page reuse vmemmap mapping [ 187.166626] radix-mmu: Tail page reuse vmemmap mapping [ 187.166629] radix-mmu: Tail page reuse vmemmap mapping [ 187.166632] radix-mmu: Tail page reuse vmemmap mapping And without vmemmap optimization [ 293.549931] radix-mmu: PMD_SIZE vmemmap mapping [ 293.549984] radix-mmu: PMD_SIZE vmemmap mapping [ 293.550032] radix-mmu: PMD_SIZE vmemmap mapping [ 293.550076] radix-mmu: PMD_SIZE vmemmap mapping [ 293.550117] radix-mmu: PMD_SIZE vmemmap mapping Link: https://lkml.kernel.org/r/20230724190759.483013-14-aneesh.kumar@linux.ibm.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18powerpc/book3s64/radix: remove mmu_vmemmap_psizeAneesh Kumar K.V
This is not used by radix anymore. [aneesh.kumar@linux.ibm.com: fix kernel build error] Link: https://lkml.kernel.org/r/874jlowd0c.fsf@linux.ibm.com Link: https://lkml.kernel.org/r/20230724190759.483013-13-aneesh.kumar@linux.ibm.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18powerpc/book3s64/radix: add support for vmemmap optimization for radixAneesh Kumar K.V
With 2M PMD-level mapping, we require 32 struct pages and a single vmemmap page can contain 1024 struct pages (PAGE_SIZE/sizeof(struct page)). Hence with 64K page size, we don't use vmemmap deduplication for PMD-level mapping. [aneesh.kumar@linux.ibm.com: ppc64: don't include radix headers if CONFIG_PPC_RADIX_MMU=n] Link: https://lkml.kernel.org/r/87zg3jw8km.fsf@linux.ibm.com Link: https://lkml.kernel.org/r/20230724190759.483013-12-aneesh.kumar@linux.ibm.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18powerpc/book3s64/vmemmap: switch radix to use a different vmemmap handling ↵Aneesh Kumar K.V
function This is in preparation to update radix to implement vmemmap optimization for devdax. Below are the rules w.r.t radix vmemmap mapping 1. First try to map things using PMD (2M) 2. With altmap if altmap cross-boundary check returns true, fall back to PAGE_SIZE 3. If we can't allocate PMD_SIZE backing memory for vmemmap, fallback to PAGE_SIZE On removing vmemmap mapping, check if every subsection that is using the vmemmap area is invalid. If found to be invalid, that implies we can safely free the vmemmap area. We don't use the PAGE_UNUSED pattern used by x86 because with 64K page size, we need to do the above check even at the PAGE_SIZE granularity. [aneesh.kumar@linux.ibm.com: fix section mismatch warning] Link: https://lkml.kernel.org/r/87h6pqvu5g.fsf@linux.ibm.com [aneesh.kumar@linux.ibm.com: fix kernel build error] Link: https://lkml.kernel.org/r/877cqkwd20.fsf@linux.ibm.com Link: https://lkml.kernel.org/r/20230724190759.483013-11-aneesh.kumar@linux.ibm.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18powerpc/book3s64/mm: enable transparent pud hugepageAneesh Kumar K.V
This is enabled only with radix translation and 1G hugepage size. This will be used with devdax device memory with a namespace alignment of 1G. Anon transparent hugepage is not supported even though we do have helpers checking pud_trans_huge(). We should never find that return true. The only expected pte bit combination is _PAGE_PTE | _PAGE_DEVMAP. Some of the helpers are never expected to get called on hash translation and hence is marked to call BUG() in such a case. Link: https://lkml.kernel.org/r/20230724190759.483013-10-aneesh.kumar@linux.ibm.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18powerpc/mm/trace: convert trace event to trace event classAneesh Kumar K.V
A follow-up patch will add a pud variant for this same event. Using event class makes that addition simpler. No functional change in this patch. Link: https://lkml.kernel.org/r/20230724190759.483013-9-aneesh.kumar@linux.ibm.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18mm: remove CONFIG_PER_VMA_LOCK ifdefsMatthew Wilcox (Oracle)
Patch series "Handle most file-backed faults under the VMA lock", v3. This patchset adds the ability to handle page faults on parts of files which are already in the page cache without taking the mmap lock. This patch (of 10): Provide lock_vma_under_rcu() when CONFIG_PER_VMA_LOCK is not defined to eliminate ifdefs in the users. Link: https://lkml.kernel.org/r/20230724185410.1124082-1-willy@infradead.org Link: https://lkml.kernel.org/r/20230724185410.1124082-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Cc: Punit Agrawal <punit.agrawal@bytedance.com> Cc: Arjun Roy <arjunroy@google.com> Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18mmu_notifiers: rename invalidate_range notifierAlistair Popple
There are two main use cases for mmu notifiers. One is by KVM which uses mmu_notifier_invalidate_range_start()/end() to manage a software TLB. The other is to manage hardware TLBs which need to use the invalidate_range() callback because HW can establish new TLB entries at any time. Hence using start/end() can lead to memory corruption as these callbacks happen too soon/late during page unmap. mmu notifier users should therefore either use the start()/end() callbacks or the invalidate_range() callbacks. To make this usage clearer rename the invalidate_range() callback to arch_invalidate_secondary_tlbs() and update documention. Link: https://lkml.kernel.org/r/6f77248cd25545c8020a54b4e567e8b72be4dca1.1690292440.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Cc: Andrew Donnellan <ajd@linux.ibm.com> Cc: Chaitanya Kumar Borah <chaitanya.kumar.borah@intel.com> Cc: Frederic Barrat <fbarrat@linux.ibm.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kevin Tian <kevin.tian@intel.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Nicolin Chen <nicolinc@nvidia.com> Cc: Robin Murphy <robin.murphy@arm.com> Cc: Sean Christopherson <seanjc@google.com> Cc: SeongJae Park <sj@kernel.org> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Cc: Will Deacon <will@kernel.org> Cc: Zhi Wang <zhi.wang.linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18mmu_notifiers: call invalidate_range() when invalidating TLBsAlistair Popple
The invalidate_range() is going to become an architecture specific mmu notifier used to keep the TLB of secondary MMUs such as an IOMMU in sync with the CPU page tables. Currently it is called from separate code paths to the main CPU TLB invalidations. This can lead to a secondary TLB not getting invalidated when required and makes it hard to reason about when exactly the secondary TLB is invalidated. To fix this move the notifier call to the architecture specific TLB maintenance functions for architectures that have secondary MMUs requiring explicit software invalidations. This fixes a SMMU bug on ARM64. On ARM64 PTE permission upgrades require a TLB invalidation. This invalidation is done by the architecture specific ptep_set_access_flags() which calls flush_tlb_page() if required. However this doesn't call the notifier resulting in infinite faults being generated by devices using the SMMU if it has previously cached a read-only PTE in it's TLB. Moving the invalidations into the TLB invalidation functions ensures all invalidations happen at the same time as the CPU invalidation. The architecture specific flush_tlb_all() routines do not call the notifier as none of the IOMMUs require this. Link: https://lkml.kernel.org/r/0287ae32d91393a582897d6c4db6f7456b1001f2.1690292440.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Suggested-by: Jason Gunthorpe <jgg@ziepe.ca> Tested-by: SeongJae Park <sj@kernel.org> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Tested-by: Luis Chamberlain <mcgrof@kernel.org> Cc: Andrew Donnellan <ajd@linux.ibm.com> Cc: Chaitanya Kumar Borah <chaitanya.kumar.borah@intel.com> Cc: Frederic Barrat <fbarrat@linux.ibm.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kevin Tian <kevin.tian@intel.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Nicolin Chen <nicolinc@nvidia.com> Cc: Robin Murphy <robin.murphy@arm.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Cc: Will Deacon <will@kernel.org> Cc: Zhi Wang <zhi.wang.linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18powerpc: mm: convert to GENERIC_IOREMAPChristophe Leroy
By taking GENERIC_IOREMAP method, the generic generic_ioremap_prot(), generic_iounmap(), and their generic wrapper ioremap_prot(), ioremap() and iounmap() are all visible and available to arch. Arch needs to provide wrapper functions to override the generic versions if there's arch specific handling in its ioremap_prot(), ioremap() or iounmap(). This change will simplify implementation by removing duplicated code with generic_ioremap_prot() and generic_iounmap(), and has the equivalent functioality as before. Here, add wrapper functions ioremap_prot() and iounmap() for powerpc's special operation when ioremap() and iounmap(). Link: https://lkml.kernel.org/r/20230706154520.11257-18-bhe@redhat.com Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Baoquan He <bhe@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Brian Cain <bcain@quicinc.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Chris Zankel <chris@zankel.net> Cc: David Laight <David.Laight@ACULAB.COM> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Jonas Bonn <jonas@southpole.se> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Niklas Schnelle <schnelle@linux.ibm.com> Cc: Rich Felker <dalias@libc.org> Cc: Stafford Horne <shorne@gmail.com> Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vineet Gupta <vgupta@kernel.org> Cc: Will Deacon <will@kernel.org> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18powerpc: add pte_free_defer() for pgtables sharing pageHugh Dickins
Add powerpc-specific pte_free_defer(), to free table page via call_rcu(). pte_free_defer() will be called inside khugepaged's retract_page_tables() loop, where allocating extra memory cannot be relied upon. This precedes the generic version to avoid build breakage from incompatible pgtable_t. This is awkward because the struct page contains only one rcu_head, but that page may be shared between PTE_FRAG_NR pagetables, each wanting to use the rcu_head at the same time. But powerpc never reuses a fragment once it has been freed: so mark the page Active in pte_free_defer(), before calling pte_fragment_free() directly; and there call_rcu() to pte_free_now() when last fragment is freed and the page is PageActive. Link: https://lkml.kernel.org/r/6e3ca5f1-334d-4b14-b92d-fc8e99914fcb@google.com Suggested-by: Jason Gunthorpe <jgg@ziepe.ca> Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Christoph Hellwig <hch@infradead.org> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: David Hildenbrand <david@redhat.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Huang, Ying <ying.huang@intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jann Horn <jannh@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Russell King <linux@armlinux.org.uk> Cc: SeongJae Park <sj@kernel.org> Cc: Song Liu <song@kernel.org> Cc: Steven Price <steven.price@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zack Rusin <zackr@vmware.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18powerpc: assert_pte_locked() use pte_offset_map_nolock()Hugh Dickins
Instead of pte_lockptr(), use the recently added pte_offset_map_nolock() in assert_pte_locked(). BUG if pte_offset_map_nolock() fails. This mod might cause new crashes: which either expose my ignorance, or indicate issues to be fixed, or limit the usage of assert_pte_locked(). [hughd@google.com: assert_pte_locked() still needs the pmd_none() check] Link: https://lkml.kernel.org/r/c73d1543-532c-3da2-8cf2-a95363a14116@google.com Link: https://lkml.kernel.org/r/e8d56c95-c132-a82e-5f5f-7bb1b738b057@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Christoph Hellwig <hch@infradead.org> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: David Hildenbrand <david@redhat.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Huang, Ying <ying.huang@intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Russell King <linux@armlinux.org.uk> Cc: SeongJae Park <sj@kernel.org> Cc: Song Liu <song@kernel.org> Cc: Steven Price <steven.price@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zack Rusin <zackr@vmware.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18powerpc/mm: Cleanup memory block size probingAneesh Kumar K.V
Parse the device tree in early init to find the memory block size to be used by the kernel. Consolidate the memory block size device tree parsing to one helper and use that on both powernv and pseries. We still want to use machine-specific callback because on all machine types other than powernv and pseries we continue to return MIN_MEMORY_BLOCK_SIZE. pseries_memory_block_size used to look for the second memory block (memory@x) to determine the memory_block_size value. This patch changed that to look at all memory blocks and make sure we can map them all correctly using the computed memory block size value. Add workaround to force 256MB memory block size if device driver managed memory such as GPU memory is present. This helps to add GPU memory that is not aligned to 1G. Co-developed-by: Reza Arbab <arbab@linux.ibm.com> Signed-off-by: Reza Arbab <arbab@linux.ibm.com> Signed-off-by: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20230801044447.11275-1-aneesh.kumar@linux.ibm.com
2023-08-18powerpc/47x: Add prototype for mmu_init_secondary()Christophe Leroy
A W=1 build of 44x/iss476-smp_defconfig gives: arch/powerpc/mm/nohash/44x.c:220:13: error: no previous prototype for 'mmu_init_secondary' [-Werror=missing-prototypes] 220 | void __init mmu_init_secondary(int cpu) | ^~~~~~~~~~~~~~~~~~ That function is called from head_4xx.S Add a prototype in mmu_decl.h Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/e89d9927c926044e54fd056a849785f526c6414f.1692282340.git.christophe.leroy@csgroup.eu
2023-08-18powerpc/47x: Remove early_init_mmu_47x() to fix no previous prototypeChristophe Leroy
4xx/iss476-smp_defconfig leads to: CC arch/powerpc/mm/nohash/tlb.o arch/powerpc/mm/nohash/tlb.c:322:13: error: no previous prototype for 'early_init_mmu_47x' [-Werror=missing-prototypes] 322 | void __init early_init_mmu_47x(void) | ^~~~~~~~~~~~~~~~~~ early_init_mmu_47x() is used only at one place and only locally. Fold it into its only caller and remove it. Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/0a667b7c2e05d3cf41ecd38f33cc334083a61c8d.1692282396.git.christophe.leroy@csgroup.eu
2023-08-16powerpc: replace #include <asm/export.h> with #include <linux/export.h>Masahiro Yamada
Commit ddb5cdbafaaa ("kbuild: generate KSYMTAB entries by modpost") deprecated <asm/export.h>, which is now a wrapper of <linux/export.h>. Replace #include <asm/export.h> with #include <linux/export.h>. After all the <asm/export.h> lines are converted, <asm/export.h> and <asm-generic/export.h> will be removed. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> [mpe: Fixup selftests that stub asm/export.h] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20230806150954.394189-2-masahiroy@kernel.org
2023-08-14powerpc/radix: Move some functions into #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLEChristophe Leroy
With skiboot_defconfig, Clang reports: CC arch/powerpc/mm/book3s64/radix_tlb.o arch/powerpc/mm/book3s64/radix_tlb.c:419:20: error: unused function '_tlbie_pid_lpid' [-Werror,-Wunused-function] static inline void _tlbie_pid_lpid(unsigned long pid, unsigned long lpid, ^ arch/powerpc/mm/book3s64/radix_tlb.c:663:20: error: unused function '_tlbie_va_range_lpid' [-Werror,-Wunused-function] static inline void _tlbie_va_range_lpid(unsigned long start, unsigned long end, ^ This is because those functions are only called from functions enclosed in a #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE Move below functions inside that #ifdef * __tlbie_pid_lpid(unsigned long pid, * __tlbie_va_lpid(unsigned long va, unsigned long pid, * fixup_tlbie_pid_lpid(unsigned long pid, unsigned long lpid) * _tlbie_pid_lpid(unsigned long pid, unsigned long lpid, * fixup_tlbie_va_range_lpid(unsigned long va, * __tlbie_va_range_lpid(unsigned long start, unsigned long end, * _tlbie_va_range_lpid(unsigned long start, unsigned long end, Fixes: f0c6fbbb9050 ("KVM: PPC: Book3S HV: Add support for H_RPT_INVALIDATE") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202307260802.Mjr99P5O-lkp@intel.com/ Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/3d72efd39f986ee939d068af69fdce28bd600766.1691568093.git.christophe.leroy@csgroup.eu
2023-08-02powerpc: address missing-prototypes warningsArnd Bergmann
There are a few warnings in powerpc64 defconfig builds after -Wmissing-prototypes gets promoted from W=1 to the default warning set: arch/powerpc/mm/book3s64/pgtable.c:422:6: error: no previous prototype for 'arch_report_meminfo' [-Werror=missing-prototypes] arch/powerpc/platforms/cell/ras.c:275:5: error: no previous prototype for 'cbe_sysreset_hack' [-Werror=missing-prototypes] arch/powerpc/platforms/cell/spu_manage.c:29:21: error: no previous prototype for 'spu_devnode' [-Werror=missing-prototypes] arch/powerpc/platforms/pasemi/time.c:12:17: error: no previous prototype for 'pas_get_boot_time' [-Werror=missing-prototypes] arch/powerpc/platforms/powermac/feature.c:1532:13: error: no previous prototype for 'g5_phy_disable_cpu1' [-Werror=missing-prototypes] arch/powerpc/platforms/86xx/pic.c:28:13: error: no previous prototype for 'mpc86xx_init_irq' [-Werror=missing-prototypes] drivers/pci/pci-sysfs.c:936:13: error: no previous prototype for 'pci_adjust_legacy_attr' [-Werror=missing-prototypes] Address these by including the right header files or marking the functions static. The audit.c one is a bit tricky since compat_audit.h cannot include regular kernel headers tht have conflicting types on 32-bit powerpc. Signed-off-by: Arnd Bergmann <arnd@arndb.de> [mpe: Drop change to __vmemmap_free() which only exists in mm] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20230727122720.2558065-1-arnd@kernel.org
2023-08-02powerpc/64s/radix: combine final TLB flush and lazy tlb mm shootdown IPIsNicholas Piggin
This performs lazy tlb mm shootdown when doing the exit TLB flush when all mm users go away and user mappings are removed, which avoids having to do the lazy tlb mm shootdown IPIs on the final mmput when all kernel references disappear. powerpc/64s uses a broadcast TLBIE for the exit TLB flush if remote CPUs need to be invalidated (unless TLBIE is disabled), so this doesn't necessarily save IPIs but it does avoid a broadcast TLBIE which is quite expensive. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> [mpe: Squash in preempt_disable/enable() fix from Nick] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20230524060821.148015-5-npiggin@gmail.com
2023-08-02powerpc: Add mm_cpumask warning when context switchingNicholas Piggin
When context switching away from an mm, add a CONFIG_DEBUG_VM warning check to ensure this CPU is still set in the mask. This could catch bugs where the mask is improperly trimmed while the CPU is still using the mm. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20230524060821.148015-4-npiggin@gmail.com
2023-08-02powerpc/64s: Use dec_mm_active_cpus helperNicholas Piggin
Avoid open-coded atomic_dec on mm->context.active_cpus and use the function made for it. Add CONFIG_DEBUG_VM underflow checking on the counter. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20230524060821.148015-3-npiggin@gmail.com
2023-08-02powerpc: Account mm_cpumask and active_cpus in init_mmNicholas Piggin
init_mm mm_cpumask and context.active_cpus is not maintained at boot and hotplug. This seems to be harmless because init_mm does not have a userspace and so never gets user TLBs flushed, but it looks odd and it prevents some sanity checks being added. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20230524060821.148015-2-npiggin@gmail.com
2023-08-02powerpc/kuap: Use ASM feature fixups instead of static branchesChristophe Leroy
To avoid a useless nop on top of every uaccess enable/disable and make life easier for objtool, replace static branches by ASM feature fixups that will nop KUAP enabling instructions out in the unlikely case KUAP is disabled at boottime. Leave it as is on book3s/64 for now, it will be handled later when objtool is activated on PPC64. Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/671948788024fd890ec4ed175bc332dab8664ea5.1689091022.git.christophe.leroy@csgroup.eu
2023-08-02powerpc/kuap: Simplify KUAP lock/unlock on BOOK3S/32Christophe Leroy
On book3s/32 KUAP is performed at segment level. At the moment, when enabling userspace access, only current segment is modified. Then if a write is performed on another user segment, a fault is taken and all other user segments get enabled for userspace access. This then require special attention when disabling userspace access. Having a userspace write access crossing a segment boundary is unlikely. Having a userspace write access crossing a segment boundary back and forth is even more unlikely. So, instead of enabling userspace access on all segments when a write fault occurs, just change which segment has userspace access enabled in order to eliminate the case when more than one segment has userspace access enabled. That simplifies userspace access deactivation. There is however a corner case which is even more unlikely but has to be handled anyway: an unaligned access which is crossing a segment boundary. That would definitely require at least having userspace access enabled on the two segments. To avoid complicating the likely case for a so unlikely happening, handle such situation like an alignment exception and emulate the store. Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/8de8580513c1a6e880bad1ba9a69d3efad3d4fa5.1689091022.git.christophe.leroy@csgroup.eu
2023-08-02powerpc/kuap: Use MMU_FTR_KUAP on all and refactor disabling kuapChristophe Leroy
All but book3s/64 use a static branch key for disabling kuap. book3s/64 uses an mmu feature. Refactor all targets to use MMU_FTR_KUAP like book3s/64. For PPC32 that implies updating mmu features fixups once KUAP has been initialised. Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/6b3d7c977bad73378ea368bc6818e9c94ea95ab0.1689091022.git.christophe.leroy@csgroup.eu
2023-08-02powerpc/kuap: MMU_FTR_BOOK3S_KUAP becomes MMU_FTR_KUAPChristophe Leroy
In order to reuse MMU_FTR_BOOK3S_KUAP for other targets than BOOK3S, rename it MMU_FTR_KUAP. Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/c8b6f7b8cd0eeaace96879ed0e0a157faa619451.1689091022.git.christophe.leroy@csgroup.eu
2023-08-02powerpc/kuap: Fold kuep_is_disabled() into its only userChristophe Leroy
kuep_is_disabled() was introduced by commit 91bb30822a2e ("powerpc/32s: Refactor update of user segment registers") but then all users but one were removed by commit 526d4a4c77ae ("powerpc/32s: Do kuep_lock() and kuep_unlock() in assembly"). Fold kuep_is_disabled() into init_new_context() which is its only user. Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/b2247147c0a8c830ac82966451647850df4a64da.1689091022.git.christophe.leroy@csgroup.eu
2023-07-28powerpc/mm/altmap: Fix altmap boundary checkAneesh Kumar K.V
altmap->free includes the entire free space from which altmap blocks can be allocated. So when checking whether the kernel is doing altmap block free, compute the boundary correctly, otherwise memory hotunplug can fail. Fixes: 9ef34630a461 ("powerpc/mm: Fallback to RAM if the altmap is unusable") Signed-off-by: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20230724181320.471386-1-aneesh.kumar@linux.ibm.com
2023-07-17powerpc/kasan: Disable KCOV in KASAN codeBenjamin Gray
As per the generic KASAN code in mm/kasan, disable KCOV with KCOV_INSTRUMENT := n in the makefile. This fixes a ppc64 boot hang when KCOV and KASAN are enabled. kasan_early_init() gets called before a PACA is initialised, but the KCOV hook expects a valid PACA. Suggested-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Benjamin Gray <bgray@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20230710044143.146840-1-bgray@linux.ibm.com
2023-07-10powerpc/64s: Fix native_hpte_remove() to be irq-safeMichael Ellerman
Lockdep warns that the use of the hpte_lock in native_hpte_remove() is not safe against an IRQ coming in: ================================ WARNING: inconsistent lock state 6.4.0-rc2-g0c54f4d30ecc #1 Not tainted -------------------------------- inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage. qemu-system-ppc/93865 [HC0[0]:SC0[0]:HE1:SE1] takes: c0000000021f5180 (hpte_lock){+.?.}-{0:0}, at: native_lock_hpte+0x8/0xd0 {IN-SOFTIRQ-W} state was registered at: lock_acquire+0x134/0x3f0 native_lock_hpte+0x44/0xd0 native_hpte_insert+0xd4/0x2a0 __hash_page_64K+0x218/0x4f0 hash_page_mm+0x464/0x840 do_hash_fault+0x11c/0x260 data_access_common_virt+0x210/0x220 __ip_select_ident+0x140/0x150 ... net_rx_action+0x3bc/0x440 __do_softirq+0x180/0x534 ... sys_sendmmsg+0x34/0x50 system_call_exception+0x128/0x320 system_call_common+0x160/0x2e4 ... Possible unsafe locking scenario: CPU0 ---- lock(hpte_lock); <Interrupt> lock(hpte_lock); *** DEADLOCK *** ... Call Trace: dump_stack_lvl+0x98/0xe0 (unreliable) print_usage_bug.part.0+0x250/0x278 mark_lock+0xc9c/0xd30 __lock_acquire+0x440/0x1ca0 lock_acquire+0x134/0x3f0 native_lock_hpte+0x44/0xd0 native_hpte_remove+0xb0/0x190 kvmppc_mmu_map_page+0x650/0x698 [kvm_pr] kvmppc_handle_pagefault+0x534/0x6e8 [kvm_pr] kvmppc_handle_exit_pr+0x6d8/0xe90 [kvm_pr] after_sprg3_load+0x80/0x90 [kvm_pr] kvmppc_vcpu_run_pr+0x108/0x270 [kvm_pr] kvmppc_vcpu_run+0x34/0x48 [kvm] kvm_arch_vcpu_ioctl_run+0x340/0x470 [kvm] kvm_vcpu_ioctl+0x338/0x8b8 [kvm] sys_ioctl+0x7c4/0x13e0 system_call_exception+0x128/0x320 system_call_common+0x160/0x2e4 I suspect kvm_pr is the only caller that doesn't already have IRQs disabled, which is why this hasn't been reported previously. Fix it by disabling IRQs in native_hpte_remove(). Fixes: 35159b5717fa ("powerpc/64s: make HPTE lock and native_tlbie_lock irq-safe") Cc: stable@vger.kernel.org # v6.1+ Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20230517123033.18430-1-mpe@ellerman.id.au
2023-06-30Merge tag 'powerpc-6.5-1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc updates from Michael Ellerman: - Extend KCSAN support to 32-bit and BookE. Add some KCSAN annotations - Make ELFv2 ABI the default for 64-bit big-endian kernel builds, and use the -mprofile-kernel option (kernel specific ftrace ABI) for big endian ELFv2 kernels - Add initial Dynamic Execution Control Register (DEXCR) support, and allow the ROP protection instructions to be used on Power 10 - Various other small features and fixes Thanks to Aditya Gupta, Aneesh Kumar K.V, Benjamin Gray, Brian King, Christophe Leroy, Colin Ian King, Dmitry Torokhov, Gaurav Batra, Jean Delvare, Joel Stanley, Marco Elver, Masahiro Yamada, Nageswara R Sastry, Nathan Chancellor, Naveen N Rao, Nayna Jain, Nicholas Piggin, Paul Gortmaker, Randy Dunlap, Rob Herring, Rohan McLure, Russell Currey, Sachin Sant, Timothy Pearson, Tom Rix, and Uwe Kleine-König. * tag 'powerpc-6.5-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (76 commits) powerpc: remove checks for binutils older than 2.25 powerpc: Fail build if using recordmcount with binutils v2.37 powerpc/iommu: TCEs are incorrectly manipulated with DLPAR add/remove of memory powerpc/iommu: Only build sPAPR access functions on pSeries powerpc: powernv: Annotate data races in opal events powerpc: Mark writes registering ipi to host cpu through kvm and polling powerpc: Annotate accesses to ipi message flags powerpc: powernv: Fix KCSAN datarace warnings on idle_state contention powerpc: Mark [h]ssr_valid accesses in check_return_regs_valid powerpc: qspinlock: Enforce qnode writes prior to publishing to queue powerpc: qspinlock: Mark accesses to qnode lock checks powerpc/powernv/pci: Remove last IODA1 defines powerpc/powernv/pci: Remove MVE code powerpc/powernv/pci: Remove ioda1 support powerpc: 52xx: Make immr_id DT match tables static powerpc: mpc512x: Remove open coded "ranges" parsing powerpc: fsl_soc: Use of_range_to_resource() for "ranges" parsing powerpc: fsl: Use of_property_read_reg() to parse "reg" powerpc: fsl_rio: Use of_range_to_resource() for "ranges" parsing macintosh: Use of_property_read_reg() to parse "reg" ...
2023-06-28Merge branch 'expand-stack'Linus Torvalds
This modifies our user mode stack expansion code to always take the mmap_lock for writing before modifying the VM layout. It's actually something we always technically should have done, but because we didn't strictly need it, we were being lazy ("opportunistic" sounds so much better, doesn't it?) about things, and had this hack in place where we would extend the stack vma in-place without doing the proper locking. And it worked fine. We just needed to change vm_start (or, in the case of grow-up stacks, vm_end) and together with some special ad-hoc locking using the anon_vma lock and the mm->page_table_lock, it all was fairly straightforward. That is, it was all fine until Ruihan Li pointed out that now that the vma layout uses the maple tree code, we *really* don't just change vm_start and vm_end any more, and the locking really is broken. Oops. It's not actually all _that_ horrible to fix this once and for all, and do proper locking, but it's a bit painful. We have basically three different cases of stack expansion, and they all work just a bit differently: - the common and obvious case is the page fault handling. It's actually fairly simple and straightforward, except for the fact that we have something like 24 different versions of it, and you end up in a maze of twisty little passages, all alike. - the simplest case is the execve() code that creates a new stack. There are no real locking concerns because it's all in a private new VM that hasn't been exposed to anybody, but lockdep still can end up unhappy if you get it wrong. - and finally, we have GUP and page pinning, which shouldn't really be expanding the stack in the first place, but in addition to execve() we also use it for ptrace(). And debuggers do want to possibly access memory under the stack pointer and thus need to be able to expand the stack as a special case. None of these cases are exactly complicated, but the page fault case in particular is just repeated slightly differently many many times. And ia64 in particular has a fairly complicated situation where you can have both a regular grow-down stack _and_ a special grow-up stack for the register backing store. So to make this slightly more manageable, the bulk of this series is to first create a helper function for the most common page fault case, and convert all the straightforward architectures to it. Thus the new 'lock_mm_and_find_vma()' helper function, which ends up being used by x86, arm, powerpc, mips, riscv, alpha, arc, csky, hexagon, loongarch, nios2, sh, sparc32, and xtensa. So we not only convert more than half the architectures, we now have more shared code and avoid some of those twisty little passages. And largely due to this common helper function, the full diffstat of this series ends up deleting more lines than it adds. That still leaves eight architectures (ia64, m68k, microblaze, openrisc, parisc, s390, sparc64 and um) that end up doing 'expand_stack()' manually because they are doing something slightly different from the normal pattern. Along with the couple of special cases in execve() and GUP. So there's a couple of patches that first create 'locked' helper versions of the stack expansion functions, so that there's a obvious path forward in the conversion. The execve() case is then actually pretty simple, and is a nice cleanup from our old "grow-up stackls are special, because at execve time even they grow down". The #ifdef CONFIG_STACK_GROWSUP in that code just goes away, because it's just more straightforward to write out the stack expansion there manually, instead od having get_user_pages_remote() do it for us in some situations but not others and have to worry about locking rules for GUP. And the final step is then to just convert the remaining odd cases to a new world order where 'expand_stack()' is called with the mmap_lock held for reading, but where it might drop it and upgrade it to a write, only to return with it held for reading (in the success case) or with it completely dropped (in the failure case). In the process, we remove all the stack expansion from GUP (where dropping the lock wouldn't be ok without special rules anyway), and add it in manually to __access_remote_vm() for ptrace(). Thanks to Adrian Glaubitz and Frank Scheiner who tested the ia64 cases. Everything else here felt pretty straightforward, but the ia64 rules for stack expansion are really quite odd and very different from everything else. Also thanks to Vegard Nossum who caught me getting one of those odd conditions entirely the wrong way around. Anyway, I think I want to actually move all the stack expansion code to a whole new file of its own, rather than have it split up between mm/mmap.c and mm/memory.c, but since this will have to be backported to the initial maple tree vma introduction anyway, I tried to keep the patches _fairly_ minimal. Also, while I don't think it's valid to expand the stack from GUP, the final patch in here is a "warn if some crazy GUP user wants to try to expand the stack" patch. That one will be reverted before the final release, but it's left to catch any odd cases during the merge window and release candidates. Reported-by: Ruihan Li <lrh2000@pku.edu.cn> * branch 'expand-stack': gup: add warning if some caller would seem to want stack expansion mm: always expand the stack with the mmap write lock held execve: expand new process stack manually ahead of time mm: make find_extend_vma() fail if write lock not held powerpc/mm: convert coprocessor fault to lock_mm_and_find_vma() mm/fault: convert remaining simple cases to lock_mm_and_find_vma() arm/mm: Convert to using lock_mm_and_find_vma() riscv/mm: Convert to using lock_mm_and_find_vma() mips/mm: Convert to using lock_mm_and_find_vma() powerpc/mm: Convert to using lock_mm_and_find_vma() arm64/mm: Convert to using lock_mm_and_find_vma() mm: make the page fault mmap locking killable mm: introduce new 'lock_mm_and_find_vma()' page fault helper