summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)Author
2024-11-06mm/vmstat: defer the refresh_zone_stat_thresholds after all CPUs bringupSaurabh Sengar
refresh_zone_stat_thresholds function has two loops which is expensive for higher number of CPUs and NUMA nodes. Below is the rough estimation of total iterations done by these loops based on number of NUMA and CPUs. Total number of iterations: nCPU * 2 * Numa * mCPU Where: nCPU = total number of CPUs Numa = total number of NUMA nodes mCPU = mean value of total CPUs (e.g., 512 for 1024 total CPUs) For the system under test with 16 NUMA nodes and 1024 CPUs, this results in a substantial increase in the number of loop iterations during boot-up when NUMA is enabled: No NUMA = 1024*2*1*512 = 1,048,576 : Here refresh_zone_stat_thresholds takes around 224 ms total for all the CPUs in the system under test. 16 NUMA = 1024*2*16*512 = 16,777,216 : Here refresh_zone_stat_thresholds takes around 4.5 seconds total for all the CPUs in the system under test. Calling this for each CPU is expensive when there are large number of CPUs along with multiple NUMAs. Fix this by deferring refresh_zone_stat_thresholds to be called later at once when all the secondary CPUs are up. Also, register the DYN hooks to keep the existing hotplug functionality intact. Link: https://lkml.kernel.org/r/1723443220-20623-1-git-send-email-ssengar@linux.microsoft.com Signed-off-by: Saurabh Sengar <ssengar@linux.microsoft.com> Acked-by: Christoph Lameter <cl@linux.com> Reviewed-by: Srivatsa S. Bhat (Microsoft) <srivatsa@csail.mit.edu> Cc: Saurabh Singh Sengar <ssengar@microsoft.com> Cc: Wei Liu <wei.liu@kernel.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06vmscan: add a vmscan event for reclaim_pagesJaewon Kim
reclaim_folio_list uses a dummy reclaim_stat and is not being used. To know the memory stat, add a new trace event. This is useful how how many pages are not reclaimed or why. This is an example: mm_vmscan_reclaim_pages: nid=0 nr_scanned=112 nr_reclaimed=112 nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activate_anon=0 nr_activate_file=0 nr_ref_keep=0 nr_unmap_fail=0 Currently reclaim_folio_list is only called by reclaim_pages, and reclaim_pages is used by damon and madvise. In the latest Android, reclaim_pages is also used by shmem to reclaim all pages in a address_space. [jaewon31.kim@samsung.com: use sc.nr_scanned rather than new counting] Link: https://lkml.kernel.org/r/20241016143227.961162-1-jaewon31.kim@samsung.com Link: https://lkml.kernel.org/r/20241011124928.1224813-1-jaewon31.kim@samsung.com Signed-off-by: Jaewon Kim <jaewon31.kim@samsung.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Jaewon Kim <jaewon31.kim@samsung.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06mm: avoid zeroing user movable page twice with init_on_alloc=1Zi Yan
Commit 6471384af2a6 ("mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options") forces allocated page to be zeroed in post_alloc_hook() when init_on_alloc=1. For order-0 folios, if arch does not define vma_alloc_zeroed_movable_folio(), the default implementation again zeros the page return from the buddy allocator. So the page is zeroed twice. Fix it by passing __GFP_ZERO instead to avoid double page zeroing. At the moment, s390,arm64,x86,alpha,m68k are not impacted since they define their own vma_alloc_zeroed_movable_folio(). For >0 order folios (mTHP and PMD THP), folio_zero_user() is called to zero the folio again. Fix it by calling folio_zero_user() only if init_on_alloc is set. All arch are impacted. Add alloc_zeroed() helper to encapsulate the init_on_alloc check. [ziy@nvidia.com: comment fixes, per David] Link: https://lkml.kernel.org/r/97DB52E1-C594-49B5-9736-89AC302FAB01@nvidia.com Link: https://lkml.kernel.org/r/20241011150304.709590-1-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Hildenbrand <david@redhat.com> Cc: Alexander Potapenko <glider@google.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kees Cook <keescook@chromium.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06mm/zswap: avoid touching XArray for unnecessary invalidationKairui Song
zswap_invalidation simply calls xa_erase, which acquires the Xarray lock first, then does a look up. This has a higher overhead even if zswap is not used or the tree is empty. So instead, do a very lightweight xa_empty check first, if there is nothing to erase, don't touch the lock or the tree. Using xa_empty rather than zswap_never_enabled is more helpful as it cover both case where zswap wes never used or the particular range doesn't have any zswap entry. And it's safe as the swap slot should be currently pinned by caller with HAS_CACHE. Sequential SWAP in/out tests with zswap disabled showed a minor performance gain, SWAP in of zero page with zswap enabled also showed a performance gain. (swapout is basically unchanged so only test one case): Swapout of 2G zero page using brd as SWAP, zswap disabled (total time, 4 testrun, +0.1%): Before: 1705013 us 1703119 us 1704335 us 1705848 us. After: 1703579 us 1710640 us 1703625 us 1708699 us. Swapin of 2G zero page using brd as SWAP, zswap disabled (total time, 4 testrun, -3.5%): Before: 1912312 us 1915692 us 1905837 us 1912706 us. After: 1845354 us 1849691 us 1845868 us 1841828 us. Swapin of 2G zero page using brd as SWAP, zswap enabled (total time, 4 testrun, -3.3%): Before: 1897994 us 1894681 us 1899982 us 1898333 us After: 1835894 us 1834113 us 1832047 us 1833125 us Swapin of 2G random page using brd as SWAP, zswap enabled (total time, 4 testrun, -0.1%): Before: 4519747 us 4431078 us 4430185 us 4439999 us After: 4492176 us 4437796 us 4434612 us 4434289 us And the performance is very slightly better or unchanged for build kernel test with zswap enabled or disabled. Build Linux Kernel with defconfig and -j32 in 1G memory cgroup, using brd SWAP, zswap disabled (sys time in seconds, 6 testrun, -0.1%): Before: 1648.83 1653.52 1666.34 1665.95 1663.06 1656.67 After: 1651.36 1661.89 1645.70 1657.45 1662.07 1652.83 Build Linux Kernel with defconfig and -j32 in 2G memory cgroup, using brd SWAP zswap enabled (sys time in seconds, 6 testrun, -0.3%): Before: 1240.25 1254.06 1246.77 1265.92 1244.23 1227.74 After: 1226.41 1218.21 1249.12 1249.13 1244.39 1233.01 Link: https://lkml.kernel.org/r/20241011171950.62684-1-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Yosry Ahmed <yosryahmed@google.com> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Chris Li <chrisl@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06mm/hugetlb: perform vmemmap optimization batchly for specific node allocationsuhua
When HVO is enabled and huge page memory allocs are made, the freed memory can be aggregated into higher order memory in the following paths, which facilitates further allocs for higher order memory. echo 200000 > /proc/sys/vm/nr_hugepages echo 200000 > /sys/devices/system/node/node*/hugepages/hugepages-2048kB/nr_hugepages grub default_hugepagesz=2M hugepagesz=2M hugepages=200000 Currently not support for releasing aggregations to higher order in the following way, which will releasing to lower order. grub: default_hugepagesz=2M hugepagesz=2M hugepages=0:100000,1:100000 This patch supports the release of huge page optimizations aggregates to higher order memory. eg: cat /proc/cmdline BOOT_IMAGE=/boot/vmlinuz-xxx ... default_hugepagesz=2M hugepagesz=2M hugepages=0:100000,1:100000 Before: Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10 ... Node 0, zone Normal, type Unmovable 55282 97039 99307 0 1 1 0 1 1 1 0 Node 0, zone Normal, type Movable 25 11 345 87 48 21 2 20 9 3 75061 Node 0, zone Normal, type Reclaimable 4 2 2 4 3 0 2 1 1 1 0 Node 0, zone Normal, type HighAtomic 0 0 0 0 0 0 0 0 0 0 0 ... Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10 Node 1, zone Normal, type Unmovable 98888 99650 99679 2 3 1 2 2 2 0 0 Node 1, zone Normal, type Movable 1 1 0 1 1 0 1 0 1 1 75937 Node 1, zone Normal, type Reclaimable 0 0 0 0 0 0 0 0 0 0 0 Node 1, zone Normal, type HighAtomic 0 0 0 0 0 0 0 0 0 0 0 After: Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10 ... Node 0, zone Normal, type Unmovable 152 158 37 2 2 0 3 4 2 6 717 Node 0, zone Normal, type Movable 1 37 53 3 55 49 16 6 2 1 75000 Node 0, zone Normal, type Reclaimable 1 4 3 1 2 1 1 1 1 1 0 Node 0, zone Normal, type HighAtomic 0 0 0 0 0 0 0 0 0 0 0 ... Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10 Node 1, zone Normal, type Unmovable 5 3 2 1 3 4 2 2 2 0 779 Node 1, zone Normal, type Movable 1 0 1 1 1 0 1 0 1 1 75849 Node 1, zone Normal, type Reclaimable 0 0 0 0 0 0 0 0 0 0 0 Node 1, zone Normal, type HighAtomic 0 0 0 0 0 0 0 0 0 0 0 Link: https://lkml.kernel.org/r/20241012070802.1876-1-suhua1@kingsoft.com Signed-off-by: suhua <suhua1@kingsoft.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06memcg: add tracing for memcg stat updatesShakeel Butt
The memcg stats are maintained in rstat infrastructure which provides very fast updates side and reasonable read side. However memcg added plethora of stats and made the read side, which is cgroup rstat flush, very slow. To solve that, threshold was added in the memcg stats read side i.e. no need to flush the stats if updates are within the threshold. This threshold based improvement worked for sometime but more stats were added to memcg and also the read codepath was getting triggered in the performance sensitive paths which made threshold based ratelimiting ineffective. We need more visibility into the hot and cold stats i.e. stats with a lot of updates. Let's add trace to get that visibility. [shakeel.butt@linux.dev: use unsigned long type for memcg_rstat_events, per Yosry] Link: https://lkml.kernel.org/r/20241015213721.3804209-1-shakeel.butt@linux.dev Link: https://lkml.kernel.org/r/20241010003550.3695245-1-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Reviewed-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: T.J. Mercier <tjmercier@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: JP Kobryn <inwardvessel@gmail.com> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06mm: remove unused hugepage for vma_alloc_folio()Kefeng Wang
The hugepage parameter was deprecated since commit ddc1a5cbc05d ("mempolicy: alloc_pages_mpol() for NUMA policy without vma"), for PMD-sized THP, it still tries only preferred node if possible in vma_alloc_folio() by checking the order of the folio allocation. Link: https://lkml.kernel.org/r/20241010061556.1846751-1-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Barry Song <baohua@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06mm: add pcp high_min high_max to proc zoneinfoMengEn Sun
When we do not set percpu_pagelist_high_fraction the kernel will compute the pcp high_min/max by itself, which makes it hard to determine the current high_min/max values. So output the pcp high_min/max values to /proc/zoneinfo. Link: https://lkml.kernel.org/r/20241010120935.656619-1-mengensun@tencent.com Signed-off-by: MengEn Sun <mengensun@tencent.com> Reviewed-by: Jinliang Zheng <alexjlzheng@tencent.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06mm/kmemleak: fix typo in object_no_scan() commentMike Rapoport (Microsoft)
Replace "corresponding to the give pointer" with "corresponding to the given pointer" Link: https://lkml.kernel.org/r/20241010155439.554416-1-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06kaslr: rename physmem_end and PHYSMEM_END to direct_map_physmem_endJohn Hubbard
For clarity. It's increasingly hard to reason about the code, when KASLR is moving around the boundaries. In this case where KASLR is randomizing the location of the kernel image within physical memory, the maximum number of address bits for physical memory has not changed. What has changed is the ending address of memory that is allowed to be directly mapped by the kernel. Let's name the variable, and the associated macro accordingly. Also, enhance the comment above the direct_map_physmem_end definition, to further clarify how this all works. Link: https://lkml.kernel.org/r/20241009025024.89813-1-jhubbard@nvidia.com Signed-off-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Will Deacon <will@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Alistair Popple <apopple@nvidia.com> Cc: Jordan Niethe <jniethe@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06mm: allocate THP on hugezeropage wp-faultDev Jain
Introduce do_huge_zero_wp_pmd() to handle wp-fault on a hugezeropage and replace it with a PMD-mapped THP. Remember to flush TLB entry corresponding to the hugezeropage. In case of failure, fallback to splitting the PMD. Link: https://lkml.kernel.org/r/20241008061746.285961-3-dev.jain@arm.com Signed-off-by: Dev Jain <dev.jain@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Barry Song <baohua@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christoph Lameter <cl@gentwo.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06mm: abstract THP allocationDev Jain
Patch series "Do not shatter hugezeropage on wp-fault", v7. It was observed at [1] and [2] that the current kernel behaviour of shattering a hugezeropage is inconsistent and suboptimal. For a VMA with a THP allowable order, when we write-fault on it, the kernel installs a PMD-mapped THP. On the other hand, if we first get a read fault, we get a PMD pointing to the hugezeropage; subsequent write will trigger a write-protection fault, shattering the hugezeropage into one writable page, and all the other PTEs write-protected. The conclusion being, as compared to the case of a single write-fault, applications have to suffer 512 extra page faults if they were to use the VMA as such, plus we get the overhead of khugepaged trying to replace that area with a THP anyway. Instead, replace the hugezeropage with a THP on wp-fault. [1]: https://lore.kernel.org/all/3743d7e1-0b79-4eaf-82d5-d1ca29fe347d@arm.com/ [2]: https://lore.kernel.org/all/1cfae0c0-96a2-4308-9c62-f7a640520242@arm.com/ This patch (of 2): In preparation for the second patch, abstract away the THP allocation logic present in the create_huge_pmd() path, which corresponds to the faulting case when no page is present. There should be no functional change as a result of applying this patch, except that, as David notes at [1], a PMD-aligned address should be passed to update_mmu_cache_pmd(). [1]: https://lore.kernel.org/all/ddd3fcd2-48b3-4170-bcaa-2fe66e093f43@redhat.com/ Link: https://lkml.kernel.org/r/20241008061746.285961-1-dev.jain@arm.com Link: https://lkml.kernel.org/r/20241008061746.285961-2-dev.jain@arm.com Signed-off-by: Dev Jain <dev.jain@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Barry Song <baohua@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christoph Lameter <cl@gentwo.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06mm/memory.c: remove stray newline at top of fileAndrew Morton
Fixes: d61ea1cb0095 ("userfaultfd: UFFD_FEATURE_WP_ASYNC") Reported-by: Jeongjun Park <aha310510@gmail.com> Closes: https://lkml.kernel.org/r/20241007065307.4158-1-aha310510@gmail.com Cc: Muhammad Usama Anjum <usama.anjum@collabora.com> Cc: Peter Xu <peterx@redhat.com> Cc: Greg KH <gregkh@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06percpu: fix data race with pcpu_nr_empty_pop_pagesDennis Zhou
Fixes the data race by moving the read to be behind the pcpu_lock. This is okay because the code (initially) above it will not increase the empty populated page count because it is populating backing pages that already have allocations served out of them. Link: https://lkml.kernel.org/r/20241008001942.8114-1-dennis@kernel.org Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202407191651.f24e499d-oliver.sang@intel.com Signed-off-by: Dennis Zhou <dennis@kernel.org> Cc: Christoph Lameter <cl@linux.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06mm/mmap: teach generic_get_unmapped_area{_topdown} to handle hugetlb mappingsOscar Salvador
Patch series "Unify hugetlb into arch_get_unmapped_area functions", v4. This is an attempt to get rid of a fair amount of duplicated code wrt. hugetlb and *get_unmapped_area* functions. HugeTLB registers a .get_unmapped_area function which gets called from __get_unmapped_area(). hugetlb_get_unmapped_area() is defined by a bunch of architectures and it also has a generic definition for those that do not define it. Short-long story is that there is a ton of duplicated code between specific hugetlb *_get_unmapped_area_* functions and mm-core functions, so we can do better by teaching arch_get_unmapped_area* functions how to deal with hugetlb mappings. Note that not a lot of things need to be taught though. hugetlb_get_unmapped_area, that gets called for hugetlb mappings, runs some sanity checks prior to calling mm_get_unmapped_area_vmflags(), so we do not need to that down the road in the respective {generic,arch}_get_unmapped_area* functions. More information can be found in the respective patches. LTP mmapstress hugetlb selftests were ran succesfully on: This patch (of 9): We want to stop special casing hugetlb mappings and make them go through generic channels, so teach generic_get_unmapped_area{_topdown} to handle those. The main difference is that we set info.align_mask for huge mappings. Link: https://lkml.kernel.org/r/20241007075037.267650-1-osalvador@suse.de Link: https://lkml.kernel.org/r/20241007075037.267650-2-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Cc: David Hildenbrand <david@redhat.com> Cc: Donet Tom <donettom@linux.ibm.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Peter Xu <peterx@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06mm: remove misleading 'unlikely' hint in vms_gather_munmap_vmas()Breno Leitao
Performance analysis using branch annotation on a fleet of 200 hosts running web servers revealed that the 'unlikely' hint in vms_gather_munmap_vmas() was 100% consistently incorrect. In all observed cases, the branch behavior contradicted the hint. Remove the 'unlikely' qualifier from the condition checking 'vms->uf'. By doing so, we allow the compiler to make optimization decisions based on its own heuristics and profiling data, rather than relying on a static hint that has proven to be inaccurate in real-world scenarios. Link: https://lkml.kernel.org/r/20241004164832.218681-1-leitao@debian.org Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06mm/truncate: reset xa_has_values flag on each iterationShakeel Butt
Currently mapping_try_invalidate() and invalidate_inode_pages2_range() traverses the xarray in batches and then for each batch, maintains and sets the flag named xa_has_values if the batch has a shadow entry to clear the entries at the end of the iteration. However they forgot to reset the flag at the end of the iteration which causes them to always try to clear the shadow entries in the subsequent iterations where there might not be any shadow entries. Fix this inefficiency. Link: https://lkml.kernel.org/r/20241002225150.2334504-1-shakeel.butt@linux.dev Fixes: 61c663e020d2 ("mm/truncate: batch-clear shadow entries") Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Yu Zhao <yuzhao@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06mm: swap: make some count_mthp_stat() call-sites be THP-agnostic.Kanchana P Sridhar
In commit 246d3aa3e531 ("mm: cleanup count_mthp_stat() definition"), Ryan Roberts has pointed out the merits of mm code that does not require THP, to be compile-able without requiring THP ifdefs. As a step in that direction, he has moved count_mthp_stat() to be always defined, resolving to a no-op if THP is not defined. Barry Song referred me to Ryan's commit when I was working on the "mm: zswap swap-out of large folios" patch-series [1]. This patch propagates the benefits of the above change to page_io.c and vmscan.c. As a result, there is one less reason to have the ifdef THP in these code sections. [1]: https://patchwork.kernel.org/project/linux-mm/list/?series=894347 Link: https://lkml.kernel.org/r/20241002225822.9006-1-kanchana.p.sridhar@intel.com Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Wajdi Feghali <wajdi.k.feghali@intel.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Barry Song <21cnbao@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06gup: convert FOLL_TOUCH case in follow_page_pte() to folioMatthew Wilcox (Oracle)
We already have the folio here, so just use it, removing three hidden calls to compound_head(). Link: https://lkml.kernel.org/r/20241002151403.1345296-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06mm: remove PageKsm()Matthew Wilcox (Oracle)
All callers have been converted to use folio_test_ksm() or PageAnonNotKsm(), so we can remove this wrapper. Link: https://lkml.kernel.org/r/20241002152533.1350629-6-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: David Hildenbrand <david@redhat.com> Cc: Alex Shi <alexs@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06ksm: convert should_skip_rmap_item() to take a folioMatthew Wilcox (Oracle)
Remove a call to PageKSM() by passing the folio containing tmp_page to should_skip_rmap_item. Removes a hidden call to compound_head(). Link: https://lkml.kernel.org/r/20241002152533.1350629-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: David Hildenbrand <david@redhat.com> Cc: Alex Shi <alexs@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06ksm: convert cmp_and_merge_page() to use a folioMatthew Wilcox (Oracle)
By making try_to_merge_two_pages() and stable_tree_search() return a folio, we can replace kpage with kfolio. This replaces 7 calls to compound_head() with one. [cuigaosheng1@huawei.com: add IS_ERR_OR_NULL check for stable_tree_search()] Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com> Link: https://lkml.kernel.org/r/20241002152533.1350629-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: David Hildenbrand <david@redhat.com> Cc: Alex Shi <alexs@kernel.org> Cc: Gaosheng Cui <cuigaosheng1@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06ksm: use a folio in try_to_merge_one_page()Matthew Wilcox (Oracle)
Patch series "Remove PageKsm()". The KSM flag is almost always tested on the folio rather than on the page. This series removes the final users of PageKsm() and makes the flag only This patch (of 5): It is safe to use a folio here because all callers took a refcount on this page. The one wrinkle is that we have to recalculate the value of folio after splitting the page, since it has probably changed. Replaces nine calls to compound_head() with one. Link: https://lkml.kernel.org/r/20241002152533.1350629-1-willy@infradead.org Link: https://lkml.kernel.org/r/20241002152533.1350629-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Alex Shi <alexs@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06tmpfs: Initialize sysfs during tmpfs initAndré Almeida
Instead of using fs_initcall(), initialize sysfs with the rest of the filesystem. This is the right way to do it because otherwise any error during tmpfs_sysfs_init() would get silently ignored. It's also useful if tmpfs' sysfs ever need to display runtime information. Signed-off-by: André Almeida <andrealmeid@igalia.com> Link: https://lore.kernel.org/r/20241101164251.327884-4-andrealmeid@igalia.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-11-06tmpfs: Fix type for sysfs' casefold attributeAndré Almeida
DEVICE_STRING_ATTR_RO should be only used by device drivers since it relies on `struct device` to use device_show_string() function. Using this with non device code led to a kCFI violation: > cat /sys/fs/tmpfs/features/casefold [ 70.558496] CFI failure at kobj_attr_show+0x2c/0x4c (target: device_show_string+0x0/0x38; expected type: 0xc527b809) Like the other filesystems, fix this by manually declaring the attribute using kobj_attribute() and writing a proper show() function. Also, leave macros for anyone that need to expand tmpfs sysfs' with more attributes. Fixes: 5132f08bd332 ("tmpfs: Expose filesystem features via sysfs") Reported-by: Nathan Chancellor <nathan@kernel.org> Closes: https://lore.kernel.org/lkml/20241031051822.GA2947788@thelio-3990X/ Signed-off-by: André Almeida <andrealmeid@igalia.com> Link: https://lore.kernel.org/r/20241101164251.327884-3-andrealmeid@igalia.com Tested-by: Nathan Chancellor <nathan@kernel.org> Reviewed-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-11-05mm/util: deduplicate code in {kstrdup,kstrndup,kmemdup_nul}Yafang Shao
These three functions follow the same pattern. To deduplicate the code, let's introduce a common helper __kmemdup_nul(). Link: https://lkml.kernel.org/r/20241007144911.27693-7-laoar.shao@gmail.com Suggested-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Cc: Simon Horman <horms@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Alejandro Colomar <alx@kernel.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: Andy Shevchenko <andy.shevchenko@gmail.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: David Airlie <airlied@gmail.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Eric Paris <eparis@redhat.com> Cc: James Morris <jmorris@namei.org> Cc: Jan Kara <jack@suse.cz> Cc: Justin Stitt <justinstitt@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Cc: Matus Jokay <matus.jokay@stuba.sk> Cc: Maxime Ripard <mripard@kernel.org> Cc: Ondrej Mosnacek <omosnace@redhat.com> Cc: Paul Moore <paul@paul-moore.com> Cc: Quentin Monnet <qmo@kernel.org> Cc: "Serge E. Hallyn" <serge@hallyn.com> Cc: Stephen Smalley <stephen.smalley.work@gmail.com> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Thomas Zimmermann <tzimmermann@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-05mm/util: fix possible race condition in kstrdup()Yafang Shao
In kstrdup(), it is critical to ensure that the dest string is always NUL-terminated. However, potential race condition can occur between a writer and a reader. Consider the following scenario involving task->comm: reader writer len = strlen(s) + 1; strlcpy(tsk->comm, buf, sizeof(tsk->comm)); memcpy(buf, s, len); In this case, there is a race condition between the reader and the writer. The reader calculates the length of the string `s` based on the old value of task->comm. However, during the memcpy(), the string `s` might be updated by the writer to a new value of task->comm. If the new task->comm is larger than the old one, the `buf` might not be NUL-terminated. This can lead to undefined behavior and potential security vulnerabilities. Let's fix it by explicitly adding a NUL terminator after the memcpy. It is worth noting that memcpy() is not atomic, so the new string can be shorter when memcpy() already copied past the new NUL. Link: https://lkml.kernel.org/r/20241007144911.27693-6-laoar.shao@gmail.com Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Cc: Alejandro Colomar <alx@kernel.org> Cc: Andy Shevchenko <andy.shevchenko@gmail.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: David Airlie <airlied@gmail.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Eric Paris <eparis@redhat.com> Cc: James Morris <jmorris@namei.org> Cc: Jan Kara <jack@suse.cz> Cc: Justin Stitt <justinstitt@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Matus Jokay <matus.jokay@stuba.sk> Cc: Maxime Ripard <mripard@kernel.org> Cc: Ondrej Mosnacek <omosnace@redhat.com> Cc: Paul Moore <paul@paul-moore.com> Cc: Quentin Monnet <qmo@kernel.org> Cc: "Serge E. Hallyn" <serge@hallyn.com> Cc: Simon Horman <horms@kernel.org> Cc: Stephen Smalley <stephen.smalley.work@gmail.com> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Thomas Zimmermann <tzimmermann@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-05mm/cma: fix useless return in void functionPintu Kumar
There is a unnecessary return statement at the end of void function cma_activate_area. This can be dropped. While at it, also fix another warning related to unsigned. These are reported by checkpatch as well. WARNING: Prefer 'unsigned int' to bare use of 'unsigned' +unsigned cma_area_count; WARNING: void function return statements are not generally useful + return; +} Link: https://lkml.kernel.org/r/20240927181637.19941-1-quic_pintu@quicinc.com Signed-off-by: Pintu Kumar <quic_pintu@quicinc.com> Cc: Pintu Agarwal <pintu.ping@gmail.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-05mm: optimize invalidation of shadow entriesShakeel Butt
The kernel invalidates the page cache in batches of PAGEVEC_SIZE. For each batch, it traverses the page cache tree and collects the entries (folio and shadow entries) in the struct folio_batch. For the shadow entries present in the folio_batch, it has to traverse the page cache tree for each individual entry to remove them. This patch optimize this by removing them in a single tree traversal. To evaluate the changes, we created 200GiB file on a fuse fs and in a memcg. We created the shadow entries by triggering reclaim through memory.reclaim in that specific memcg and measure the simple fadvise(DONTNEED) operation. # time xfs_io -c 'fadvise -d 0 ${file_size}' file time (sec) Without 5.12 +- 0.061 With-patch 4.19 +- 0.086 (18.16% decrease) Link: https://lkml.kernel.org/r/20240925224716.2904498-3-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Chris Mason <clm@fb.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Omar Sandoval <osandov@osandov.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-05mm: optimize truncation of shadow entriesShakeel Butt
Patch series "mm: optimize shadow entries removal", v2. Some of our production workloads which processes a large amount of data spends considerable amount of CPUs on truncation and invalidation of large sized files (100s of GiBs of size). Tracing the operations showed that most of the time is in shadow entries removal. This patch series optimizes the truncation and invalidation operations. This patch (of 2): The kernel truncates the page cache in batches of PAGEVEC_SIZE. For each batch, it traverses the page cache tree and collects the entries (folio and shadow entries) in the struct folio_batch. For the shadow entries present in the folio_batch, it has to traverse the page cache tree for each individual entry to remove them. This patch optimize this by removing them in a single tree traversal. On large machines in our production which run workloads manipulating large amount of data, we have observed that a large amount of CPUs are spent on truncation of very large files (100s of GiBs file sizes). More specifically most of time was spent on shadow entries cleanup, so optimizing the shadow entries cleanup, even a little bit, has good impact. To evaluate the changes, we created 200GiB file on a fuse fs and in a memcg. We created the shadow entries by triggering reclaim through memory.reclaim in that specific memcg and measure the simple truncation operation. # time truncate -s 0 file time (sec) Without 5.164 +- 0.059 With-patch 4.21 +- 0.066 (18.47% decrease) Link: https://lkml.kernel.org/r/20240925224716.2904498-1-shakeel.butt@linux.dev Link: https://lkml.kernel.org/r/20240925224716.2904498-2-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Chris Mason <clm@fb.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Omar Sandoval <osandov@osandov.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-05mm: migrate LRU_REFS_MASK bits in folio_migrate_flagsZhaoyang Huang
Bits of LRU_REFS_MASK are not inherited during migration which lead to new folio start from tier0 when MGLRU enabled. Try to bring as much bits of folio->flags as possible since compaction and alloc_contig_range which introduce migration do happen at times. Link: https://lkml.kernel.org/r/20240926050647.5653-1-zhaoyang.huang@unisoc.com Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com> Suggested-by: Yu Zhao <yuzhao@google.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-05mm: pgtable: remove pte_offset_map_nolock()Qi Zheng
Now no users are using the pte_offset_map_nolock(), remove it. Link: https://lkml.kernel.org/r/d04f9bbbcde048fb6ffa6f2bdbc6f9b22d5286f9.1727332572.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Acked-by: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-05mm: multi-gen LRU: walk_pte_range() use pte_offset_map_rw_nolock()Qi Zheng
In walk_pte_range(), we may modify the pte entry after holding the ptl, so convert it to using pte_offset_map_rw_nolock(). At this time, the pte_same() check is not performed after the ptl held, so we should get pmdval and do pmd_same() check to ensure the stability of pmd entry. Link: https://lkml.kernel.org/r/7e9c194a5efacc9609cfd31abb9c7df88b53b530.1727332572.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Acked-by: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-05mm: userfaultfd: move_pages_pte() use pte_offset_map_rw_nolock()Qi Zheng
In move_pages_pte(), we may modify the dst_pte and src_pte after acquiring the ptl, so convert it to using pte_offset_map_rw_nolock(). But since we will use pte_same() to detect the change of the pte entry, there is no need to get pmdval, so just pass a dummy variable to it. Link: https://lkml.kernel.org/r/1530e8fdbfc72eacf3b095babe139ce3d715600a.1727332572.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-05mm: page_vma_mapped_walk: map_pte() use pte_offset_map_rw_nolock()Qi Zheng
In the caller of map_pte(), we may modify the pvmw->pte after acquiring the pvmw->ptl, so convert it to using pte_offset_map_rw_nolock(). At this time, the pte_same() check is not performed after the pvmw->ptl held, so we should get pmdval and do pmd_same() check to ensure the stability of pvmw->pmd. Link: https://lkml.kernel.org/r/2620a48f34c9f19864ab0169cdbf253d31a8fcaa.1727332572.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-05mm: mremap: move_ptes() use pte_offset_map_rw_nolock()Qi Zheng
In move_ptes(), we may modify the new_pte after acquiring the new_ptl, so convert it to using pte_offset_map_rw_nolock(). Now new_pte is none, so hpage_collapse_scan_file() path can not find this by traversing file->f_mapping, so there is no concurrency with retract_page_tables(). In addition, we already hold the exclusive mmap_lock, so this new_pte page is stable, so there is no need to get pmdval and do pmd_same() check. Link: https://lkml.kernel.org/r/9d582a09dbcf12e562ac5fe0ba05e9248a58f5e0.1727332572.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-05mm: copy_pte_range() use pte_offset_map_rw_nolock()Qi Zheng
In copy_pte_range(), we may modify the src_pte entry after holding the src_ptl, so convert it to using pte_offset_map_rw_nolock(). Since we already hold the exclusive mmap_lock, and the copy_pte_range() and retract_page_tables() are using vma->anon_vma to be exclusive, so the PTE page is stable, there is no need to get pmdval and do pmd_same() check. Link: https://lkml.kernel.org/r/9166f6fad806efbca72e318ab6f0f8af458056a9.1727332572.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-05mm: khugepaged: collapse_pte_mapped_thp() use pte_offset_map_rw_nolock()Qi Zheng
In collapse_pte_mapped_thp(), we may modify the pte and pmd entry after acquiring the ptl, so convert it to using pte_offset_map_rw_nolock(). At this time, the pte_same() check is not performed after the PTL held. So we should get pgt_pmd and do pmd_same() check after the ptl held. Link: https://lkml.kernel.org/r/055e42db68da00ac8ecab94bd2633c7cd965eb1c.1727332572.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-05mm: handle_pte_fault() use pte_offset_map_rw_nolock()Qi Zheng
In handle_pte_fault(), we may modify the vmf->pte after acquiring the vmf->ptl, so convert it to using pte_offset_map_rw_nolock(). But since we will do the pte_same() check, so there is no need to get pmdval to do pmd_same() check, just pass a dummy variable to it. Link: https://lkml.kernel.org/r/af8d694853b44c5a6018403ae435440e275854c7.1727332572.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-05mm: khugepaged: __collapse_huge_page_swapin() use pte_offset_map_ro_nolock()Qi Zheng
In __collapse_huge_page_swapin(), we just use the ptl for pte_same() check in do_swap_page(). In other places, we directly use pte_offset_map_lock(), so convert it to using pte_offset_map_ro_nolock(). Link: https://lkml.kernel.org/r/dc97a6c3cb9ea80cab30c5626eeea79959d93258.1727332572.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-05mm: filemap: filemap_fault_recheck_pte_none() use pte_offset_map_ro_nolock()Qi Zheng
In filemap_fault_recheck_pte_none(), we just do pte_none() check, so convert it to using pte_offset_map_ro_nolock(). Link: https://lkml.kernel.org/r/9f7cbbaa772385ced1b8931b67a8b9d246c9b82d.1727332572.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-05mm: pgtable: introduce pte_offset_map_{ro|rw}_nolock()Qi Zheng
Patch series "introduce pte_offset_map_{ro|rw}_nolock()", v5. As proposed by David Hildenbrand [1], this series introduces the following two new helper functions to replace pte_offset_map_nolock(). 1. pte_offset_map_ro_nolock() 2. pte_offset_map_rw_nolock() As the name suggests, pte_offset_map_ro_nolock() is used for read-only case. In this case, only read-only operations will be performed on PTE page after the PTL is held. The RCU lock in pte_offset_map_nolock() will ensure that the PTE page will not be freed, and there is no need to worry about whether the pmd entry is modified. Therefore pte_offset_map_ro_nolock() is just a renamed version of pte_offset_map_nolock(). pte_offset_map_rw_nolock() is used for may-write case. In this case, the pte or pmd entry may be modified after the PTL is held, so we need to ensure that the pmd entry has not been modified concurrently. So in addition to the name change, it also outputs the pmdval when successful. The users should make sure the page table is stable like checking pte_same() or checking pmd_same() by using the output pmdval before performing the write operations. This series will convert all pte_offset_map_nolock() into the above two helper functions one by one, and finally completely delete it. This also a preparation for reclaiming the empty user PTE page table pages. This patch (of 13): Currently, the usage of pte_offset_map_nolock() can be divided into the following two cases: 1) After acquiring PTL, only read-only operations are performed on the PTE page. In this case, the RCU lock in pte_offset_map_nolock() will ensure that the PTE page will not be freed, and there is no need to worry about whether the pmd entry is modified. 2) After acquiring PTL, the pte or pmd entries may be modified. At this time, we need to ensure that the pmd entry has not been modified concurrently. To more clearing distinguish between these two cases, this commit introduces two new helper functions to replace pte_offset_map_nolock(). For 1), just rename it to pte_offset_map_ro_nolock(). For 2), in addition to changing the name to pte_offset_map_rw_nolock(), it also outputs the pmdval when successful. It is applicable for may-write cases where any modification operations to the page table may happen after the corresponding spinlock is held afterwards. But the users should make sure the page table is stable like checking pte_same() or checking pmd_same() by using the output pmdval before performing the write operations. Note: "RO" / "RW" expresses the intended semantics, not that the *kmap* will be read-only/read-write protected. Subsequent commits will convert pte_offset_map_nolock() into the above two functions one by one, and finally completely delete it. Link: https://lkml.kernel.org/r/cover.1727332572.git.zhengqi.arch@bytedance.com Link: https://lkml.kernel.org/r/5aeecfa131600a454b1f3a038a1a54282ca3b856.1727332572.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Acked-by: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-05mm: move mm flags to mm_types.hNanyong Sun
The types of mm flags are now far beyond the core dump related features. This patch moves mm flags from linux/sched/coredump.h to linux/mm_types.h. The linux/sched/coredump.h has include the mm_types.h, so the C files related to coredump does not need to change head file inclusion. In addition, the inclusion of sched/coredump.h now can be deleted from the C files that irrelevant to core dump. Link: https://lkml.kernel.org/r/20240926074922.2721274-1-sunnanyong@huawei.com Signed-off-by: Nanyong Sun <sunnanyong@huawei.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-05mm/madvise: unrestrict process_madvise() for current processLorenzo Stoakes
The process_madvise() call was introduced in commit ecb8ac8b1f14 ("mm/madvise: introduce process_madvise() syscall: an external memory hinting API") as a means of performing madvise() operations on another process. However, as it provides the means by which to perform multiple madvise() operations in a batch via an iovec, it is useful to utilise the same interface for performing operations on the current process rather than a remote one. Commit 22af8caff7d1 ("mm/madvise: process_madvise() drop capability check if same mm") removed the need for a caller invoking process_madvise() on its own pidfd to possess the CAP_SYS_NICE capability, however this leaves the restrictions on operation in place. Resolve this by only applying the restriction on operations when accessing a remote process. Moving forward we plan to implement a simpler means of specifying this condition other than needing to establish a self pidfd, perhaps in the form of a sentinel pidfd. Also take the opportunity to refactor the system call implementation abstracting the vectorised operation. Link: https://lkml.kernel.org/r/20240926151019.82902-1-lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christian Brauner <brauner@kernel.org> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Pedro Falcato <pedro.falcato@gmail.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-05mm/mempolicy: fix comments for better documentationTanya Agarwal
Fix typo in mempolicy.h and Correct the number of allowed memory policy Link: https://lkml.kernel.org/r/20240926183516.4034-2-tanyaagarwal25699@gmail.com Signed-off-by: Tanya Agarwal <tanyaagarwal25699@gmail.com> Reviewed-by: Shuah Khan <skhan@linuxfoundation.org> Cc: Anup Sharma <anupnewsmail@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-05mm: fix shrink nr.unqueued_dirty counter issueZhiguo Jiang
It is needed to ensure sc->nr.unqueued_dirty > 0, which can avoid setting PGDAT_DIRTY flag when sc->nr.unqueued_dirty and sc->nr.file_taken are both zero. Link: https://lkml.kernel.org/r/20240112012353.1387-1-justinjiang@vivo.com Signed-off-by: Zhiguo Jiang <justinjiang@vivo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-05mm: refactor mm_access() to not return NULLLorenzo Stoakes
mm_access() can return NULL if the mm is not found, but this is handled the same as an error in all callers, with some translating this into an -ESRCH error. Only proc_mem_open() returns NULL if no mm is found, however in this case it is clearer and makes more sense to explicitly handle the error. Additionally we take the opportunity to refactor the function to eliminate unnecessary nesting. Simplify things by simply returning -ESRCH if no mm is found - this both eliminates confusing use of the IS_ERR_OR_NULL() macro, and simplifies callers which would return -ESRCH by returning this error directly. [lorenzo.stoakes@oracle.com: prefer neater pointer error comparison] Link: https://lkml.kernel.org/r/2fae1834-749a-45e1-8594-5e5979cf7103@lucifer.local Link: https://lkml.kernel.org/r/20240924201023.193135-1-lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Suggested-by: Arnd Bergmann <arnd@arndb.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-05mm/vmalloc: combine all TLB flush operations of KASAN shadow virtual address ↵Adrian Huang
into one operation When compiling kernel source 'make -j $(nproc)' with the up-and-running KASAN-enabled kernel on a 256-core machine, the following soft lockup is shown: watchdog: BUG: soft lockup - CPU#28 stuck for 22s! [kworker/28:1:1760] CPU: 28 PID: 1760 Comm: kworker/28:1 Kdump: loaded Not tainted 6.10.0-rc5 #95 Workqueue: events drain_vmap_area_work RIP: 0010:smp_call_function_many_cond+0x1d8/0xbb0 Code: 38 c8 7c 08 84 c9 0f 85 49 08 00 00 8b 45 08 a8 01 74 2e 48 89 f1 49 89 f7 48 c1 e9 03 41 83 e7 07 4c 01 e9 41 83 c7 03 f3 90 <0f> b6 01 41 38 c7 7c 08 84 c0 0f 85 d4 06 00 00 8b 45 08 a8 01 75 RSP: 0018:ffffc9000cb3fb60 EFLAGS: 00000202 RAX: 0000000000000011 RBX: ffff8883bc4469c0 RCX: ffffed10776e9949 RDX: 0000000000000002 RSI: ffff8883bb74ca48 RDI: ffffffff8434dc50 RBP: ffff8883bb74ca40 R08: ffff888103585dc0 R09: ffff8884533a1800 R10: 0000000000000004 R11: ffffffffffffffff R12: ffffed1077888d39 R13: dffffc0000000000 R14: ffffed1077888d38 R15: 0000000000000003 FS: 0000000000000000(0000) GS:ffff8883bc400000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00005577b5c8d158 CR3: 0000000004850000 CR4: 0000000000350ef0 Call Trace: <IRQ> ? watchdog_timer_fn+0x2cd/0x390 ? __pfx_watchdog_timer_fn+0x10/0x10 ? __hrtimer_run_queues+0x300/0x6d0 ? sched_clock_cpu+0x69/0x4e0 ? __pfx___hrtimer_run_queues+0x10/0x10 ? srso_return_thunk+0x5/0x5f ? ktime_get_update_offsets_now+0x7f/0x2a0 ? srso_return_thunk+0x5/0x5f ? srso_return_thunk+0x5/0x5f ? hrtimer_interrupt+0x2ca/0x760 ? __sysvec_apic_timer_interrupt+0x8c/0x2b0 ? sysvec_apic_timer_interrupt+0x6a/0x90 </IRQ> <TASK> ? asm_sysvec_apic_timer_interrupt+0x16/0x20 ? smp_call_function_many_cond+0x1d8/0xbb0 ? __pfx_do_kernel_range_flush+0x10/0x10 on_each_cpu_cond_mask+0x20/0x40 flush_tlb_kernel_range+0x19b/0x250 ? srso_return_thunk+0x5/0x5f ? kasan_release_vmalloc+0xa7/0xc0 purge_vmap_node+0x357/0x820 ? __pfx_purge_vmap_node+0x10/0x10 __purge_vmap_area_lazy+0x5b8/0xa10 drain_vmap_area_work+0x21/0x30 process_one_work+0x661/0x10b0 worker_thread+0x844/0x10e0 ? srso_return_thunk+0x5/0x5f ? __kthread_parkme+0x82/0x140 ? __pfx_worker_thread+0x10/0x10 kthread+0x2a5/0x370 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x30/0x70 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1a/0x30 </TASK> Debugging Analysis: 1. The following ftrace log shows that the lockup CPU spends too much time iterating vmap_nodes and flushing TLB when purging vm_area structures. (Some info is trimmed). kworker: funcgraph_entry: | drain_vmap_area_work() { kworker: funcgraph_entry: | mutex_lock() { kworker: funcgraph_entry: 1.092 us | __cond_resched(); kworker: funcgraph_exit: 3.306 us | } ... ... kworker: funcgraph_entry: | flush_tlb_kernel_range() { ... ... kworker: funcgraph_exit: # 7533.649 us | } ... ... kworker: funcgraph_entry: 2.344 us | mutex_unlock(); kworker: funcgraph_exit: $ 23871554 us | } The drain_vmap_area_work() spends over 23 seconds. There are 2805 flush_tlb_kernel_range() calls in the ftrace log. * One is called in __purge_vmap_area_lazy(). * Others are called by purge_vmap_node->kasan_release_vmalloc. purge_vmap_node() iteratively releases kasan vmalloc allocations and flushes TLB for each vmap_area. - [Rough calculation] Each flush_tlb_kernel_range() runs about 7.5ms. -- 2804 * 7.5ms = 21.03 seconds. -- That's why a soft lock is triggered. 2. Extending the soft lockup time can work around the issue (For example, # echo 60 > /proc/sys/kernel/watchdog_thresh). This confirms the above-mentioned speculation: drain_vmap_area_work() spends too much time. If we combine all TLB flush operations of the KASAN shadow virtual address into one operation in the call path 'purge_vmap_node()->kasan_release_vmalloc()', the running time of drain_vmap_area_work() can be saved greatly. The idea is from the flush_tlb_kernel_range() call in __purge_vmap_area_lazy(). And, the soft lockup won't be triggered. Here is the test result based on 6.10: [6.10 wo/ the patch] 1. ftrace latency profiling (record a trace if the latency > 20s). echo 20000000 > /sys/kernel/debug/tracing/tracing_thresh echo drain_vmap_area_work > /sys/kernel/debug/tracing/set_graph_function echo function_graph > /sys/kernel/debug/tracing/current_tracer echo 1 > /sys/kernel/debug/tracing/tracing_on 2. Run `make -j $(nproc)` to compile the kernel source 3. Once the soft lockup is reproduced, check the ftrace log: cat /sys/kernel/debug/tracing/trace # tracer: function_graph # # CPU DURATION FUNCTION CALLS # | | | | | | | 76) $ 50412985 us | } /* __purge_vmap_area_lazy */ 76) $ 50412997 us | } /* drain_vmap_area_work */ 76) $ 29165911 us | } /* __purge_vmap_area_lazy */ 76) $ 29165926 us | } /* drain_vmap_area_work */ 91) $ 53629423 us | } /* __purge_vmap_area_lazy */ 91) $ 53629434 us | } /* drain_vmap_area_work */ 91) $ 28121014 us | } /* __purge_vmap_area_lazy */ 91) $ 28121026 us | } /* drain_vmap_area_work */ [6.10 w/ the patch] 1. Repeat step 1-2 in "[6.10 wo/ the patch]" 2. The soft lockup is not triggered and ftrace log is empty. cat /sys/kernel/debug/tracing/trace # tracer: function_graph # # CPU DURATION FUNCTION CALLS # | | | | | | | 3. Setting 'tracing_thresh' to 10/5 seconds does not get any ftrace log. 4. Setting 'tracing_thresh' to 1 second gets ftrace log. cat /sys/kernel/debug/tracing/trace # tracer: function_graph # # CPU DURATION FUNCTION CALLS # | | | | | | | 23) $ 1074942 us | } /* __purge_vmap_area_lazy */ 23) $ 1074950 us | } /* drain_vmap_area_work */ The worst execution time of drain_vmap_area_work() is about 1 second. Link: https://lore.kernel.org/lkml/ZqFlawuVnOMY2k3E@pc638.lan/ Link: https://lkml.kernel.org/r/20240726165246.31326-1-ahuang12@lenovo.com Fixes: 282631cb2447 ("mm: vmalloc: remove global purge_vmap_area_root rb-tree") Signed-off-by: Adrian Huang <ahuang12@lenovo.com> Co-developed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Tested-by: Jiwei Sun <sunjw10@lenovo.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-05mm/memcontrol: add per-memcg pgpgin/pswpin counterJingxiang Zeng
In proactive memory reclamation scenarios, it is necessary to estimate the pswpin and pswpout metrics of the cgroup to determine whether to continue reclaiming anonymous pages in the current batch. This patch will collect these metrics and expose them. [linuszeng@tencent.com: v2] Link: https://lkml.kernel.org/r/20240830082244.156923-1-jingxiangzeng.cas@gmail.com Li nk: https://lkml.kernel.org/r/20240913084453.3605621-1-jingxiangzeng.cas@gmail.com Link: https://lkml.kernel.org/r/20240830082244.156923-1-jingxiangzeng.cas@gmail.com Signed-off-by: Jingxiang Zeng <linuszeng@tencent.com> Acked-by: Nhat Pham <nphamcs@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-05mm/damon: fix sparse warning for zero initializerLeo Stone
sparse warns about zero initializing an array with {0,}, change it to the equivalent {0}. Fixes the sparse warning: mm/damon/tests/vaddr-kunit.h:69:47: warning: missing braces around initializer Link: https://lkml.kernel.org/r/xriwklcwjpwcz7eiavo6f7envdar4jychhsk6sfkj5klaznb6b@j6vrvr2sxjht Fixes: 17ccae8bb5c9 ("mm/damon: add kunit tests") Signed-off-by: Leo Stone <leocstone@gmail.com> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: Jinjie Ruan <ruanjinjie@huawei.com> Cc: Shuah Khan <skhan@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>